**Initialization step**: pick*k*clusters at random from the training set**Expectation step**: assign each member of the training set to the nearest of those*k*centers**Minimization step**: update cluster centers to be the centroid of the training samples assigned to them- repeat the expectation-maximization steps until convergence, i.e. there is no change in the assignments or the cluster centers.

- if they exist.
- if
*k*is appropriate for the data. - if your clusters are roughly spherical.
- if dimensionality is not too high / data are not too sparse.

In a key paper by David & Vassilvitskii 2007, k-means++: the advantages of careful seeding, they authors presented a tweak to k-means which (on average) reduced both the cost

*and*the running time of k-means. It sounds too good to be true but it works. They tackled the cluster initialization step to avoid cluster centers being chosen close together. Their idea was as follows:

- Pick the first cluster center randomly from the training set.
- Compute the distance from each training sample to that first center. Pick the second cluster center in proportion to the squared distances.
- Now compute the shortest distance from each of the remaining training samples to all the cluster centers. Pick the next cluster center in proportion to those squared distances.
- Repeat step above until
*k*centers have been chosen.

Their analyses and data demonstrated clearly that this is a better approach. However, something bugged me about this. There were two aspects:

- It is probabilistic. There is still a chance that clusters can be close together.
- We choose cluster centers sequentially which is not necessarily the same arrangement that one might choose if we could choose the
*k*clusters simultaneously. That is, imagine your data were uniformly distributed across a disk and*k*= 3. You choose the first cluster, the expected location of which would be the center of the disk. The second cluster would be at the margin of the disk. Where now would the third cluster be placed? At the margin opposite center #2. The clusters centers are not optimally placed. If we could choose 3 cluster centers at once, we might choose them in an equilateral triangle at 0, 120, and 240 degrees and at say radius 1/sqrt(2).

*deterministically*and

*globally*choose the optimal cluster placement? How about a regular square lattice? No. What do you do with

*k*= 7? A 3 * 3 grid with some cells empty? OK then, how about multidimensional cell packing, similar to soap bubbles? That is tricky, computationally intensive but possibly doable. How about an electron shell type arrangement in which we place a cluster center at the center of the data, a next shell with say 2 centers, the next outer shell with more cluster centers. (These ideas seems a little wacky but they seemed feasible at the time.) However, the more I thought about it, and the more I thought how real world data in not multi-uniform, the more the statistical properties of the k-means++ initialization made sense. This is the real point of this post.

I realized that any approach that attempted to distribute the cluster centers evenly across the span or convex hull of the data is going to be sensitive to outliers. We might start to place cluster centers across very low density or empty areas.

By choosing the data in proportion to the squared distances, we are not just choosing far away points but we are also choosing an area of feature space based on

*density*. Thus, if we have 10 points close together roughly at distance

*D*we are more likely to choose one of them than a single outlier point at distance 2

*D*. This is precisely what we want in a good initialization step. In a sense, the simple procedure is statistically balancing distance among the cluster centers and density of data points, which could represent a cluster

*per se*. Beautiful. As David & Vassilvitskii summarize in one of their presentations: "friends don't let their friends use k-means." Unfortunately, however, it does not seem to have the adoption that it deserves.

By the way, Vassilvitskii and colleagues went on to parallelize k-means++ for big data in a 2012 paper in a version called k-means||.

I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know promptessay.com what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.I’ll use this information for my essays:)

ReplyDelete