p-value.info: Why k-means++ works

I've become somewhat obsessed by clustering recently. Like many people I tend to use that general workhorse: k-means (also known as Lloyd's algorithm in comp. sci. circles). It does a pretty good job, on average, and is simple and intuitive:

Initialization step: pick k clusters at random from the training set
Expectation step: assign each member of the training set to the nearest of those k centers
Minimization step: update cluster centers to be the centroid of the training samples assigned to them
repeat the expectation-maximization steps until convergence, i.e. there is no change in the assignments or the cluster centers.

With a few tens of iterations, maybe up to a hundred, the algorithm will have converged to a reasonable set of clusters

if they exist.
if k is appropriate for the data.
if your clusters are roughly spherical.
if dimensionality is not too high / data are not too sparse.

While it is guaranteed to converge, it is not guaranteed to find the global minimum. It can get stuck on local optima. This can happen because of the stochastic nature of the initialization step in which it is possible for two or more of the initial cluster centers to be chosen close together and for k-means to then split the cluster. This is problematic enough that it is typically recommended to run k-means multiple times and to then choose the run with the lowest cost (sum of squared distances between the centers and their associated training samples).

In a key paper by David & Vassilvitskii 2007, k-means++: the advantages of careful seeding, they authors presented a tweak to k-means which (on average) reduced both the cost and the running time of k-means. It sounds too good to be true but it works. They tackled the cluster initialization step to avoid cluster centers being chosen close together. Their idea was as follows:

Pick the first cluster center randomly from the training set.
Compute the distance from each training sample to that first center. Pick the second cluster center in proportion to the squared distances.
Now compute the shortest distance from each of the remaining training samples to all the cluster centers. Pick the next cluster center in proportion to those squared distances.
Repeat step above until k centers have been chosen.

This seems simple enough but what is it doing? The first center is chosen randomly. The second center is chosen to be, on average, far from the first. The third is chosen to be far from both the previous two. And so on. With the centers more or less spread out throughout the dataset, there is less chance of splitting clusters and with the centers being a better starting point, the centers can converge more quickly. This more complex initialization comes at a cost---it is as costly as one iteration of k-means---but overall it converges more quickly and so is cheaper.

Their analyses and data demonstrated clearly that this is a better approach. However, something bugged me about this. There were two aspects:

It is probabilistic. There is still a chance that clusters can be close together.
We choose cluster centers sequentially which is not necessarily the same arrangement that one might choose if we could choose the k clusters simultaneously. That is, imagine your data were uniformly distributed across a disk and k = 3. You choose the first cluster, the expected location of which would be the center of the disk. The second cluster would be at the margin of the disk. Where now would the third cluster be placed? At the margin opposite center #2. The clusters centers are not optimally placed. If we could choose 3 cluster centers at once, we might choose them in an equilateral triangle at 0, 120, and 240 degrees and at say radius 1/sqrt(2).

I started thinking through this. How could you deterministically and globally choose the optimal cluster placement? How about a regular square lattice? No. What do you do with k = 7? A 3 * 3 grid with some cells empty? OK then, how about multidimensional cell packing, similar to soap bubbles? That is tricky, computationally intensive but possibly doable. How about an electron shell type arrangement in which we place a cluster center at the center of the data, a next shell with say 2 centers, the next outer shell with more cluster centers. (These ideas seems a little wacky but they seemed feasible at the time.) However, the more I thought about it, and the more I thought how real world data in not multi-uniform, the more the statistical properties of the k-means++ initialization made sense. This is the real point of this post.

I realized that any approach that attempted to distribute the cluster centers evenly across the span or convex hull of the data is going to be sensitive to outliers. We might start to place cluster centers across very low density or empty areas.

By choosing the data in proportion to the squared distances, we are not just choosing far away points but we are also choosing an area of feature space based on density. Thus, if we have 10 points close together roughly at distance D we are more likely to choose one of them than a single outlier point at distance 2D. This is precisely what we want in a good initialization step. In a sense, the simple procedure is statistically balancing distance among the cluster centers and density of data points, which could represent a cluster per se. Beautiful. As David & Vassilvitskii summarize in one of their presentations: "friends don't let their friends use k-means." Unfortunately, however, it does not seem to have the adoption that it deserves.

By the way, Vassilvitskii and colleagues went on to parallelize k-means++ for big data in a 2012 paper in a version called k-means||.

1 comment:

DavidFebruary 23, 2014 at 12:20 AM
I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know promptessay.com what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.I’ll use this information for my essays:)

Note: Only a member of this blog may post a comment.

Monday, November 26, 2012

Why k-means++ works

1 comment: