lecture 22
play

Lecture 22: Clustering Distance measures K-Means Aykut Erdem May - PowerPoint PPT Presentation

Lecture 22: Clustering Distance measures K-Means Aykut Erdem May 2016 Hacettepe University Last time Boosting Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote


  1. Lecture 22: − Clustering − Distance measures − K-Means Aykut Erdem May 2016 Hacettepe University

  2. Last time… Boosting • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified Learn a hypothesis – h t - A strength for this hypothesis – a t - • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 2

  3. 3 Last time.. The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman

  4. This week • Clustering • Distance measures • K-Means • Spectral clustering • Hierarchical clustering • What is a good clustering? 4

  5. Distance measures 5

  6. Distance measures • In studying clustering techniques we will assume that we are given a matrix of distances between all pairs of data points: x x x x x 1 2 3 4 m x 1 x 2 x 3 d(x , x ) x i j 4 • • • • slide by Julia Hockenmeier • • x m 6

  7. What is Similarity/Dissimilarity? Hard to define! But we know it when we see it • The real meaning of similarity is a philosophical question. We will take � a more pragmatic approach. � • Depends on representation and algorithm. For many rep.//alg., easier to think in terms of a distance (rather than similarity) between vectors. slide by Eric Xing 7

  8. Defining Distance Measures • Definition: Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D( O 1 , O 2 ). gene1 gene2 slide by Andrew Moore 0.23 3 342.7 8

  9. A few examples: Euclidean distance • � d ( x , y ) � ( x i � y i ) 2 � • ance • � � i Correlation coefficient • • • • Similarity rather than distance • • • Can determine similar trends • � • coefficient slide by Andrew Moore � � � � � ( x i � � x )( y i � � y ) ฀ � s ( x , y ) � ฀ � i � � x � y 9 � � ฀ � ฀ �

  10. What properties should a distance measure have? • Symmetric - D( A , B ) = D( B , A ) - Otherwise, we can say A looks like B but B does not look like A • Positivity, and self-similarity - D( A , B ) ≥ 0, and D( A , B ) = 0 i ff A = B - Otherwise there will di ff erent objects that we cannot tell apart • Triangle inequality - D( A , B ) + D( B , C ) ≥ D( A , C ) - Otherwise one can say “ A is like B , B is like C , but A is not slide by Alan Fern like C at all” 10

  11. 
 
 
 Distance measures • Euclidean (L 2 ) 
 idean (L 2 ) d ( x i − y i ) 2 ∑ d ( x , y ) = i = 1 hattan (L ) • Manhattan (L 1 ) 
 hattan (L 1 ) d d ( x , y ) = x - y = ∑ x i − y i i = 1 ity (Sup) Distance L • Infinity (Sup) Distance L ∞
 ity (Sup) Distance L ∞ d ( x , y ) = max 1 ≤ i ≤ d x i − y i slide by Julia Hockenmeier • Note that L ∞ < L 1 < L 2 , but di ff erent distances do not induce the same ordering on points. 11

  12. Distance measures x = (x 1 , x 2 ) y = (x 1 –2, x 2 +4) Euclidean: (4 2 + 2 2 ) 1/2 = 4.47 Manhattan: 4 + 2 = 6 Sup: Max (4,2) = 4 4 slide by Julia Hockenmeier 2 12

  13. Distance measures • Di ff erent distances do not induce the same ordering on points L (a, b) 5 = ∞ 2 2 1/2 L (a, b) (5 ) 5 = + ε = + ε 2 L (c, d) 4 = ∞ 2 2 1/2 4 L (c, d) (4 4 ) 4 2 5 . 66 = + = = 2 5 L (c, d) L (a, b) < slide by Julia Hockenmeier ∞ ∞ L (c, d) L (a, b) > 4 2 2 9 13

  14. Distance measures • Clustering is sensitive to the distance measure. • Sometimes it is beneficial to use a distance measure that is invariant to transformations that are natural to the problem: - Mahalanobis distance: ✓ Shift and scale invariance slide by Julia Hockenmeier 14

  15. Mahalanobis Distance ( x - y ) T Σ ( x − y ) d ( x , y ) = Σ is a (symmetric) Covariance Matrix: µ = 1 m ∑ x i , (average of the data) m i = 1 Σ = 1 m ( x − µ )( x − µ ) T , ∑ a matrix of size m × m m i = 1 Translates all the axes to a mean = 0 and slide by Julia Hockenmeier variance = 1 (shift and scale invariance) 15

  16. Distance measures • Some algorithms require distances between a point x and a set of points A d(x, A) 
 This might be defined e.g. as min/max/avg distance between x and any point in A. 
 • Others require distances between two sets of points A, B, d(A, B). 
 This might be defined e.g as min/max/avg distance between any point in A and any point in B. slide by Julia Hockenmeier 16

  17. � � � � � � � � Clustering algorithms • Partitioning algorithms � � %; - Construct various partitions 
 � and then evaluate them by 
 � � some criterion � • K-means • Mixture of Gaussians � • Spectral Clustering • Hierarchical algorithms � � - Create a hierarchical decomposition 
 � � of the set of objects using some 
 � � � criterion - Bottom-up – agglomerative - Top-down – divisive slide by Eric Xing � � 17 � � � �

  18. Desirable Properties of a Clustering Algorithm • Scalability (in terms of both time and space) • Ability to deal with di ff erent data types • Minimal requirements for domain knowledge to determine input parameters • Ability to deal with noisy data • Interpretability and usability • Optional slide by Andrew Moore - Incorporation of user-specified constraints 18

  19. K-Means 19

  20. K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 20

  21. K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 21

  22. K-Means Clustering: Example • Pick K random points as cluster centers (means) Shown here for K=2 slide by David Sontag 22

  23. K-Means Clustering: Example Iterative Step 1 • Assign data points to closest cluster centers slide by David Sontag 23

  24. K-Means Clustering: Example Iterative Step 2 • Change the cluster center to the average of the assigned points slide by David Sontag 24

  25. K-Means Clustering: Example • Repeat until convergence slide by David Sontag 25

  26. K-Means Clustering: Example slide by David Sontag 26

  27. K-Means Clustering: Example slide by David Sontag 27

  28. Properties of K-Means Algorithms • Guaranteed to converge in a finite number of iterations • Running time per iteration: 1. Assign data points to closest cluster center 
 O( KN ) time 2. Change the cluster center to the average of its assigned points 
 O( N ) time slide by David Sontag 28

  29. K-Means Convergence Objective 1. Fix μ , optimize C : 2. Fix C , optimize μ : Take partial derivative of μ i and set to zero, we have – K-Means takes an alternating optimization approach, each step is slide by Alan Fern guaranteed to decrease the objective – thus guaranteed to converge 29

  30. Demo time… 30

  31. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 Goal of Segmentation K = 10 is to partition an image into regions each of which has reasonably homogenous visual appearance. slide by David Sontag 31

  32. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 32

  33. Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 33

  34. Example: Vector quantization FIGURE 14.9. Sir Ronald A. Fisher ( 1890 − 1962 ) was one of the founders of modern day statistics, to whom we owe maximum-likelihood, su ffi ciency, and many other fundamental concepts. The image on the left is a 1024 × 1024 grayscale image at 8 bits per pixel. The center image is the result of 2 × 2 block VQ, using 200 code vectors, with a compression rate of 1 . 9 bits/pixel. The right image uses only four code vectors, with a compression rate of 0 . 50 bits/pixel slide by David Sontag [Figure from Hastie et al. book] 34

  35. Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … slide by Carlos Guestrin Zaire 0 35

  36. 36 slide by Fei Fei Li

  37. Object Bag of ‘words’ slide by Fei Fei Li 37

  38. Interest Point Features Compute Normalize SIFT patch descriptor [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] slide by Josef Sivic 38

  39. 39 Patch Features … slide by Josef Sivic

  40. Dictionary Formation … slide by Josef Sivic 40

  41. Clustering (usually K-means) … Vector quantization slide by Josef Sivic 41

  42. Clustered Image Patches slide by Fei Fei Li 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend