- Initialization step: pick k clusters at random from the training set
- Expectation step: assign each member of the training set to the nearest of those k centers
- Minimization step: update cluster centers to be the centroid of the training samples assigned to them
- repeat the expectation-maximization steps until convergence, i.e. there is no change in the assignments or the cluster centers.
- if they exist.
- if k is appropriate for the data.
- if your clusters are roughly spherical.
- if dimensionality is not too high / data are not too sparse.
In a key paper by David & Vassilvitskii 2007, k-means++: the advantages of careful seeding, they authors presented a tweak to k-means which (on average) reduced both the cost and the running time of k-means. It sounds too good to be true but it works. They tackled the cluster initialization step to avoid cluster centers being chosen close together. Their idea was as follows:
- Pick the first cluster center randomly from the training set.
- Compute the distance from each training sample to that first center. Pick the second cluster center in proportion to the squared distances.
- Now compute the shortest distance from each of the remaining training samples to all the cluster centers. Pick the next cluster center in proportion to those squared distances.
- Repeat step above until k centers have been chosen.
Their analyses and data demonstrated clearly that this is a better approach. However, something bugged me about this. There were two aspects:
- It is probabilistic. There is still a chance that clusters can be close together.
- We choose cluster centers sequentially which is not necessarily the same arrangement that one might choose if we could choose the k clusters simultaneously. That is, imagine your data were uniformly distributed across a disk and k = 3. You choose the first cluster, the expected location of which would be the center of the disk. The second cluster would be at the margin of the disk. Where now would the third cluster be placed? At the margin opposite center #2. The clusters centers are not optimally placed. If we could choose 3 cluster centers at once, we might choose them in an equilateral triangle at 0, 120, and 240 degrees and at say radius 1/sqrt(2).
I realized that any approach that attempted to distribute the cluster centers evenly across the span or convex hull of the data is going to be sensitive to outliers. We might start to place cluster centers across very low density or empty areas.
By choosing the data in proportion to the squared distances, we are not just choosing far away points but we are also choosing an area of feature space based on density. Thus, if we have 10 points close together roughly at distance D we are more likely to choose one of them than a single outlier point at distance 2D. This is precisely what we want in a good initialization step. In a sense, the simple procedure is statistically balancing distance among the cluster centers and density of data points, which could represent a cluster per se. Beautiful. As David & Vassilvitskii summarize in one of their presentations: "friends don't let their friends use k-means." Unfortunately, however, it does not seem to have the adoption that it deserves.
By the way, Vassilvitskii and colleagues went on to parallelize k-means++ for big data in a 2012 paper in a version called k-means||.