fast k means with accurate bounds
play

Fast K-Means with Accurate Bounds James Newling & Franc ois - PowerPoint PPT Presentation

Fast K-Means with Accurate Bounds James Newling & Franc ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 COLE POLYTECHNIQUE FDRALE DE LAUSANNE K -Means Problem Statement and


  1. Fast K-Means with Accurate Bounds James Newling & Franc ¸ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

  2. K -Means Problem Statement and Lloyd’s Algorithm Given data ( x i ) N i = 1 ∈ ( R d ) N , find centers ( c k ) K k = 1 ∈ ( R d ) K minimising N � k = 1 : K � x i − c k � 2 . min i = 1 NP-hard, so heuristic algorithms such as Lloyd’s are used Lloyd’s algorithm run for T iterations requires dKNT FLOPs We are interested in making it faster 1 / 9

  3. Lloyd’s Algorithm × : data • : centers × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  4. Lloyd’s Algorithm Assignment of datapoint at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  5. Lloyd’s Algorithm All assignments at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

  6. Lloyd’s Algorithm Updates at iteration 1 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  7. Lloyd’s Algorithm Assignment of datapoint at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  8. Lloyd’s Algorithm All assignments at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

  9. Lloyd’s Algorithm Updates at iteration 2 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  10. Lloyd’s Algorithm Assignment of datapoint at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  11. Lloyd’s Algorithm All assignments at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

  12. Lloyd’s Algorithm Updates at iteration 3 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  13. Lloyd’s Algorithm Assignment of datapoint at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  14. Lloyd’s Algorithm All assignments at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

  15. Lloyd’s Algorithm Updates at iteration 4 × × × × • • × × • × × • • × • • × • • • • • × × × • × × • • • × × × • × × × • • • • • • × × × × • × • × × × × × 2 / 9

  16. Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

  17. Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it only exact for next 13 minutes (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

  18. Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L 4 / 9

  19. Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L (A) We show that (1) + (2) is slower than just (2). Simplifying helps! 4 / 9

  20. Using The Triangle Inequality Elkan K − 1 lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

  21. Using The Triangle Inequality Yinyang group lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

  22. Using The Triangle Inequality Hamerly 1 lower bound • • • • • • • • • • • • • • U • • • × • • • L 5 / 9

  23. Lower bound updating • × 6 / 9

  24. Lower bound updating • • × 6 / 9

  25. Lower bound updating • • • × 6 / 9

  26. Lower bound updating • • • • × 6 / 9

  27. Lower bound updating • • • • × • 6 / 9

  28. Lower bound updating • • • • × • • 6 / 9

  29. Lower bound updating • • • • × • • • 6 / 9

  30. Lower bound updating • • • • × • • • • 6 / 9

  31. Lower bound updating • • • • × • • • • • 6 / 9

  32. Lower bound updating • • • • × • • • • • • 6 / 9

  33. Lower bound updating • • • • × • • • • • • 6 / 9

  34. Lower bound updating � � ·� -bound • � � · � -bound • • • × • • • • • • 6 / 9

  35. � � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update 7 / 9

  36. � � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update (B) We show that � � ·� -bounding generally improves algorithms. 7 / 9

  37. Hamerly (2010) bound test, failure 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  38. Hamerly (2010) bound test, failure 2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  39. Hamerly (2010) compute all distances • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  40. Hamerly (2010) reset bounds • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

  41. Eliminating distance calculations c �∈ B ( x , r ) ⇒ c �∈ { c new , c new } a b • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • c old • c old • • • • • • b a • • • • × • r • r = max c ∈{ c old } � x − c � , c old a b 8 / 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend