Geometric methods to accelerate $k$-means algorithms

Ryšavý, Petr and Hamerly, Greg

The $k$-means algorithm is popular for data clustering applications. Most implementations use Lloyd's algorithm, which does many unnecessary distance calculations. Several accelerated algorithms (Elkan's, Hamerly's, heap, etc.) have recently been developed which produce exactly the same answer as Lloyd's, only faster. They avoid redundant work using the triangle inequality paired with a set of lower and upper bounds on point-centroid distances. In this paper we propose several novel methods that allow those accelerated algorithms to perform even better, giving up to eight times further speedup. Our methods give tighter lower bound updates, efficiently skip centroids that cannot possibly be close to a set of points, keep extra information about upper bounds to help the heap algorithm avoid more distance computations, and decrease the number of distance calculations that are done in the first iteration.
@inbook{sdm2016,
  author = {Ryšavý, Petr and Hamerly, Greg},
  title = {Geometric methods to accelerate $k$-means algorithms},
  booktitle = {Proceedings of the 2016 SIAM International Conference on Data Mining},
  chapter = {},
  year = {2016},
  month = may,
  pages = {324-332},
  doi = {10.1137/1.9781611974348.37},
  url = {https://epubs.siam.org/doi/abs/10.1137/1.9781611974348.37},
  eprint = {https://epubs.siam.org/doi/pdf/10.1137/1.9781611974348.37}
}