Fast K-Means with Accurate Bounds James Newling & Franc ois - PowerPoint PPT Presentation

Fast K-Means with Accurate Bounds James Newling & Franc ¸ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

K -Means Problem Statement and Lloyd’s Algorithm Given data ( x i ) N i = 1 ∈ ( R d ) N , find centers ( c k ) K k = 1 ∈ ( R d ) K minimising N � k = 1 : K � x i − c k � 2 . min i = 1 NP-hard, so heuristic algorithms such as Lloyd’s are used Lloyd’s algorithm run for T iterations requires dKNT FLOPs We are interested in making it faster 1 / 9

Lloyd’s Algorithm × : data • : centers × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Assignment of datapoint at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm All assignments at iteration 1 × × × × • × × × × × × • × × × × × • • × × × × × × • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Updates at iteration 1 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Assignment of datapoint at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm All assignments at iteration 2 × × × × • • × × × × × • × • × × × × × • • • × × × • × × × • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Updates at iteration 2 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Assignment of datapoint at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm All assignments at iteration 3 × × × × • • × × × × • × • • × • × × × • × × • • • × × × • × × × • • • • × × × × × × × × × × 2 / 9

Lloyd’s Algorithm Updates at iteration 3 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

Lloyd’s Algorithm Assignment of datapoint at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

Lloyd’s Algorithm All assignments at iteration 4 × × × × • • × × × × • • × • • × • • • × × × • × × • • • × × × • × × × • • • • • × × × × • × × × × × × 2 / 9

Lloyd’s Algorithm Updates at iteration 4 × × × × • • × × • × × • • × • • × • • • • • × × × • × × • • • × × × • × × × • • • • • • × × × × • × • × × × × × 2 / 9

Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

Lloyd’s Algorithm How to Accelerate Two approaches : (1) approximate it only exact for next 13 minutes (2) be more efficient – get exactly the same output as Lloyd’s algorithm without all data-center distances i Pelleg et al. (1999) ∆ Elkan (2003) best high- d i Kanungo et al. (2002) ∆ Yinyang (2015) best mid- d ∆ Hamerly (2010) ∆ Annular (2013) best low- d 3 / 9

Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L 4 / 9

Using The Triangle Inequality Elkan’s Two Techniques Elkan uses the triangle inequality in two distinct ways (1) center-center distances to bound data-center distances (2) directly maintain bounds on data-center distances • • • • × × U L U L (A) We show that (1) + (2) is slower than just (2). Simplifying helps! 4 / 9

Using The Triangle Inequality Elkan K − 1 lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

Using The Triangle Inequality Yinyang group lower bounds • • • • • • • • • • • • • • U • L • • × • • • 5 / 9

Using The Triangle Inequality Hamerly 1 lower bound • • • • • • • • • • • • • • U • • • × • • • L 5 / 9

Lower bound updating • × 6 / 9

Lower bound updating • • × 6 / 9

Lower bound updating • • • × 6 / 9

Lower bound updating • • • • × 6 / 9

Lower bound updating • • • • × • 6 / 9

Lower bound updating • • • • × • • 6 / 9

Lower bound updating • • • • × • • • 6 / 9

Lower bound updating • • • • × • • • • 6 / 9

Lower bound updating • • • • × • • • • • 6 / 9

Lower bound updating • • • • × • • • • • • 6 / 9

Lower bound updating � � ·� -bound • � � · � -bound • • • × • • • • • • 6 / 9

� � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update 7 / 9

� � ·� -bounds All upper and lower bounds in Elkan, Hamerly, Yinyang, Annular are � � · � -bounds, and can be replaced by tighter � � ·� -bounds. There is a cost to � � ·� -bounds, additional memory is required: • Store historical centers from all rounds • Store the round in which bounds are made tight This memory overhead can be controlled by periodically clearing the history, requiring a � � · � -bound update (B) We show that � � ·� -bounding generally improves algorithms. 7 / 9

Hamerly (2010) bound test, failure 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

Hamerly (2010) bound test, failure 2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

Hamerly (2010) compute all distances • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

Hamerly (2010) reset bounds • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • × • • 8 / 9

Eliminating distance calculations c �∈ B ( x , r ) ⇒ c �∈ { c new , c new } a b • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • c old • c old • • • • • • b a • • • • × • r • r = max c ∈{ c old } � x − c � , c old a b 8 / 9

Fast K-Means with Accurate Bounds James Newling & Franc ois - PowerPoint PPT Presentation

Fast K-Means with Accurate Bounds James Newling & Franc ois Fleuret Idiap Research Institute Computer Vision and Learning Group & EPFL June 20th, 2016 COLE POLYTECHNIQUE FDRALE DE LAUSANNE K -Means Problem Statement and

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

tail bounds tail bounds For a random variable X, the tails of X are the parts of the PMF/density

Randomness in Computing L ECTURE 10 Last time Chernoff Bounds Today Hoeffding Bounds

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

K-MEANS++ OPTIMAL INITIALIZATION ALGORITHM An Improved K-means Clustering Method OVERVIEW

Lecture 23/Chapter 19 Diversity of Sample Means Means versus Proportions Behavior of

Bio Detectors Accurate and precise Stable system Fast and visible response Versatile

JIT-Assisted Fast-Forward Embedding and Instrumentation to Enable Fast, Accurate, and Agile

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Data Clustering: Data Clustering: 50 Years Beyond K means 50 Years Beyond K means 50 Years

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

ANNULAR AND PANTS THRACKLES Grace Misere La Trobe University Joint work with Grant Cairns and

Impact of ANSI X9.24 1:2009 Key Check Value on ISO/IEC 9797 1:2011 MACs Tetsu Iwata, Nagoya

Ontology-based Framework for Electronic Health Records

Announcements Midterm: Wednesday 7pm-9pm See midterm prep page (posted on Piazza, inst.eecs

Disclosures Contemporary Assessment of Cardiac I have nothing to disclose Function by

Overview Atmospheric Infrared Sounder Pasadena, California Motivation Data

L ECTURE 11: D YNAMICAL S YSTEMS 10 T EACHER : G IANNI A. D I C ARO L IMIT CYCLES So far

The critical Z -invariant Ising model via dimers B eatrice de Tili` ere University of