To many people in consulting and industry, clustering is synonomous with k-means, and the intuitive results of the k-means algorithm generally do a nice job of identifying potential clusters within a given dataset. But what is surprising is the motivation to naively use k-means as a ‘one-off’ classifier in a supervised-learning sense in order to perform segmentation of data.
In a strictly un-supervised case, it’s impossible to to understand the accuracy of this approach, but when you have a supervised learning problem this is a little different. A simple way to visualise how k-means can mis-classify data is to use principle components analysis, or PCA. PCA creates ‘principle components’ of a dataset, these can be intuitively thought of as a new co-ordinate system that act to highlight the linear variance of the data. It is a a linear projection of your data onto a new space, such that each new dimension is uncorrelated with the last, and each new dimension aims to explain the remaining variance.
The best way to see this is by example. In the following bit of code I’m going to visualise the first two principle components of the Iris dataset. I’m using Sepal.Length, Sepal.Width, Petal.Length and Petal.Width as features and colouring by the Species this entire example.
Pretty cool right?! Now why am I talking about plotting PCA results again? Oh yeah - well, we can now see our high dimensional Iris data in 2d, so I guess we could visualise just how well our kmeans algorithm can classify our data?
So let’s give this a go. Our iris variables are not scaled, nore are they centred. So you’ll notice that I have included
Catch the mis-clustered observations? Perhaps this plot is a little clearer:
Ahh. There it is. We can correctly Identify ‘setosa’ and do a decent job with ‘versicolor’, yet mis-classify 14 from the 50 virginica observations.
This shouldn’t be anything new to many of us, but it serves as a good reminder to check what your doing with k-means. Plotting with PCA gives an intuitive approach to understanding results of complex data.
In a strictly un-supervised case, it’s impossible to to understand the accuracy of this approach, but when you have a supervised learning problem this is a little different. A simple way to visualise how k-means can mis-classify data is to use principle components analysis, or PCA. PCA creates ‘principle components’ of a dataset, these can be intuitively thought of as a new co-ordinate system that act to highlight the linear variance of the data. It is a a linear projection of your data onto a new space, such that each new dimension is uncorrelated with the last, and each new dimension aims to explain the remaining variance.
The best way to see this is by example. In the following bit of code I’m going to visualise the first two principle components of the Iris dataset. I’m using Sepal.Length, Sepal.Width, Petal.Length and Petal.Width as features and colouring by the Species this entire example.
# Load the data
data(iris)
set.seed(12)
# Compute the principle components using default settings
pc = prcomp(iris[,1:4], scale = FALSE, center = FALSE)
# Create a colour vector ready to plot
cols.pca = ifelse(iris$Species == 'setosa','red',ifelse(iris$Species ==
'versicolor','green','blue'))
# Plot the two principle components with colour indicating the species
plot(pc$x[,1],pc$x[,2],col=cols.pca,main = 'Visualising Iris Data with PCA', xlab =
'Principle Compoment 1', ylab = 'Principle Component 2')
legend('topleft',c('setosa','versicolor','virginica'),fill = c('red','green','blue'))
Pretty cool right?! Now why am I talking about plotting PCA results again? Oh yeah - well, we can now see our high dimensional Iris data in 2d, so I guess we could visualise just how well our kmeans algorithm can classify our data?
So let’s give this a go. Our iris variables are not scaled, nore are they centred. So you’ll notice that I have included
scaled = FALSE, center = FALSE
parameters in prcomp
. This turns off the automatic scaling and centralising of the data for us, since kmeans doesn’t automatically scale and center data. Scaling and centering data are often crucial steps in k-means but for this purpose I will skip it.# Apply the kmeans algorithm to the un-scaled data. We are going to assume we want 3 clusters.
fit.unscaled = kmeans(iris[,1:4], centers = 3, iter.max = 1000)
# View the kmeans results
results = data.frame(cbind(as.vector(iris$Species),fit.unscaled$cluster))
aggregate(results$X2,by=list(results$X1),summary)
## Group.1 x.1 x.2 x.3
## 1 setosa 50 0 0
## 2 versicolor 0 48 2
## 3 virginica 0 14 36
Now that I’ve ran the k-means algorithm and analysed the results, it’s now possible to visualise the clustering of the data with respect to the classification variable, ‘species’. Instantly you may notice that the observations labelled ‘Virginica’ are split across cluster-2 and cluster-3 (14,36) respectively.# Colour cluster membership according to kmeans
cols.kmeans = ifelse(fit.unscaled$cluster == 1,'red',ifelse(fit.unscaled$cluster==2,'green','blue'))
# Plot out data on PC1 and PC2 again, colouring the Species
plot(pc$x[,1],pc$x[,2],col=cols.pca,main = 'Visualising Iris Data with PCA', xlab =
'Principle Compoment 1', ylab = 'Principle Component 2')
# Colour the PC1 and PC2 points now with respect to the kmeans clustering
points(pc$x[,1],pc$x[,2],col=cols.kmeans, pch = 19, cex = 0.5)
Catch the mis-clustered observations? Perhaps this plot is a little clearer:
# Colour based on correctly classified
cols.class = ifelse(cols.kmeans == cols.pca, 'lightblue','black')
# Plot result of our kmeans classification
plot(pc$x[,1],pc$x[,2],col = cols.class, pch = 19, main = 'K-means classification results',xlab =
'Principle Compoment 1', ylab = 'Principle Component 2', sub = 'Plot of k-means membership vs. Species of data in first two PCA space')
# Add a legend
legend('topleft',c('Correct classification','Incorrect classification'),fill = c('lightblue','black'))
Ahh. There it is. We can correctly Identify ‘setosa’ and do a decent job with ‘versicolor’, yet mis-classify 14 from the 50 virginica observations.
This shouldn’t be anything new to many of us, but it serves as a good reminder to check what your doing with k-means. Plotting with PCA gives an intuitive approach to understanding results of complex data.