 
              Lecture 22: − Clustering − Distance measures − K-Means Aykut Erdem December 2016 Hacettepe University
Last time… Boosting • Idea: given a weak learner, run it multiple times on (reweighted) training data, then let the learned classifiers vote • On each iteration t : - weight each training example by how incorrectly it was classified - Learn a hypothesis – h t - A strength for this hypothesis – a t • Final classifier: - A linear combination of the votes of the di ff erent classifiers weighted by their strength slide by Aarti Singh & Barnabas Poczos • Practically useful • Theoretically interesting 2
3 Last time.. The AdaBoost Algorithm slide by Jiri Matas and Jan Š ochman
This week • Distance measures • K-Means • Spectral clustering • Hierarchical clustering • What is a good clustering? 4
Distance measures 5
Distance measures • In studying clustering techniques we will assume that we are given a matrix of distances between all pairs of data points: x x x x x 1 2 3 4 m x 1 x 2 x 3 d(x , x ) x i j 4 • • • • slide by Julia Hockenmeier • • x m 6
What is Similarity/Dissimilarity? Hard to define! But we know it when we see it • The real meaning of similarity is a philosophical question. We will take � a more pragmatic approach. � • Depends on representation and algorithm. For many rep.//alg., easier to think in terms of a distance (rather than similarity) between vectors. slide by Eric Xing 7
Defining Distance Measures • Definition: Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D( O 1 , O 2 ). gene1 gene2 slide by Andrew Moore 0.23 3 342.7 8
A few examples: Euclidean distance • � d ( x , y ) � ( x i � y i ) 2 � • ance • � � i Correlation coefficient • • • • Similarity rather than distance • • • Can determine similar trends • � • coefficient slide by Andrew Moore � � � � � ( x i � � x )( y i � � y )  � s ( x , y ) �  � i � � x � y 9 � �  �  �
What properties should a distance measure have? • Symmetric - D( A , B ) = D( B , A ) - Otherwise, we can say A looks like B but B does not look like A • Positivity, and self-similarity - D( A , B ) ≥ 0, and D( A , B ) = 0 i ff A = B - Otherwise there will di ff erent objects that we cannot tell apart • Triangle inequality - D( A , B ) + D( B , C ) ≥ D( A , C ) - Otherwise one can say “ A is like B , B is like C , but A is not slide by Alan Fern like C at all” 10
Distance measures • Euclidean (L 2 ) idean (L 2 ) d ( x i − y i ) 2 ∑ d ( x , y ) = i = 1 hattan (L ) • Manhattan (L 1 ) hattan (L 1 ) d d ( x , y ) = x - y = ∑ x i − y i i = 1 ity (Sup) Distance L • Infinity (Sup) Distance L ∞ ity (Sup) Distance L ∞ d ( x , y ) = max 1 ≤ i ≤ d x i − y i slide by Julia Hockenmeier • Note that L ∞ < L 1 < L 2 , but di ff erent distances do not induce the same ordering on points. 11
Distance measures x = (x 1 , x 2 ) y = (x 1 –2, x 2 +4) Euclidean: (4 2 + 2 2 ) 1/2 = 4.47 Manhattan: 4 + 2 = 6 Sup: Max (4,2) = 4 4 slide by Julia Hockenmeier 2 12
Distance measures • Di ff erent distances do not induce the same ordering on points L (a, b) 5 = ∞ 2 2 1/2 L (a, b) (5 ) 5 = + ε = + ε 2 L (c, d) 4 = ∞ 2 2 1/2 4 L (c, d) (4 4 ) 4 2 5 . 66 = + = = 2 5 L (c, d) L (a, b) < slide by Julia Hockenmeier ∞ ∞ L (c, d) L (a, b) > 4 2 2 9 13
Distance measures • Clustering is sensitive to the distance measure. • Sometimes it is beneficial to use a distance measure that is invariant to transformations that are natural to the problem: - Mahalanobis distance: ✓ Shift and scale invariance slide by Julia Hockenmeier 14
Mahalanobis Distance ( x - y ) T Σ ( x − y ) d ( x , y ) = Σ is a (symmetric) Covariance Matrix: µ = 1 m ∑ x i , (average of the data) m i = 1 Σ = 1 m ( x − µ )( x − µ ) T , ∑ a matrix of size m × m m i = 1 Translates all the axes to a mean = 0 and slide by Julia Hockenmeier variance = 1 (shift and scale invariance) 15
Distance measures • Some algorithms require distances between a point x and a set of points A d(x, A) This might be defined e.g. as min/max/avg distance between x and any point in A. • Others require distances between two sets of points A, B, d(A, B). This might be defined e.g as min/max/avg distance between any point in A and any point in B. slide by Julia Hockenmeier 16
� � � � � � � � Clustering algorithms • Partitioning algorithms � � %; - Construct various partitions � and then evaluate them by � � some criterion � • K-means • Mixture of Gaussians � • Spectral Clustering • Hierarchical algorithms � � - Create a hierarchical decomposition � � of the set of objects using some � � � criterion - Bottom-up – agglomerative - Top-down – divisive slide by Eric Xing � � 17 � � � �
Desirable Properties of a Clustering Algorithm • Scalability (in terms of both time and space) • Ability to deal with di ff erent data types • Minimal requirements for domain knowledge to determine input parameters • Ability to deal with noisy data • Interpretability and usability • Optional slide by Andrew Moore - Incorporation of user-specified constraints 18
K-Means 19
K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 20
K-Means • An iterative clustering algorithm - Initialize: Pick K random points as cluster centers (means) - Alternate: Assign data instances • to closest mean Assign each mean to • the average of its assigned points - Stop when no points’ slide by David Sontag assignments change 21
K-Means Clustering: Example • Pick K random points as cluster centers (means) Shown here for K=2 slide by David Sontag 22
K-Means Clustering: Example Iterative Step 1 • Assign data points to closest cluster centers slide by David Sontag 23
K-Means Clustering: Example Iterative Step 2 • Change the cluster center to the average of the assigned points slide by David Sontag 24
K-Means Clustering: Example • Repeat until convergence slide by David Sontag 25
K-Means Clustering: Example slide by David Sontag 26
K-Means Clustering: Example slide by David Sontag 27
Properties of K-Means Algorithms • Guaranteed to converge in a finite number of iterations • Running time per iteration: 1. Assign data points to closest cluster center O( KN ) time 2. Change the cluster center to the average of its assigned points O( N ) time slide by David Sontag 28
K-Means Convergence Objective 1. Fix μ , optimize C : 2. Fix C , optimize μ : Take partial derivative of μ i and set to zero, we have – K-Means takes an alternating optimization approach, each step is slide by Alan Fern guaranteed to decrease the objective – thus guaranteed to converge 29
Demo time… 30
K-Means Example Applications 31
Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 Goal of Segmentation K = 10 is to partition an image into regions each of which has reasonably homogenous visual appearance. slide by David Sontag 32
Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 33
Example: K-Means for Segmentation K=2 Original K=3 K=10 K = 2 Original image K = 3 K = 10 slide by David Sontag 34
Example: Vector quantization FIGURE 14.9. Sir Ronald A. Fisher ( 1890 − 1962 ) was one of the founders of modern day statistics, to whom we owe maximum-likelihood, su ffi ciency, and many other fundamental concepts. The image on the left is a 1024 × 1024 grayscale image at 8 bits per pixel. The center image is the result of 2 × 2 block VQ, using 200 code vectors, with a compression rate of 1 . 9 bits/pixel. The right image uses only four code vectors, with a compression rate of 0 . 50 bits/pixel slide by David Sontag [Figure from Hastie et al. book] 35
Example: Simple Linear Iterative Clustering (SLIC) superpixels λ : spatial regularization parameter R. Achanta, A. Shaji, K. Smith, A. Lucchi, P . Fua, and S. Susstrunk SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE T-PAMI, 2012 36
Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … slide by Carlos Guestrin Zaire 0 37
38 slide by Fei Fei Li
Object Bag of ‘words’ slide by Fei Fei Li 39
Interest Point Features Compute Normalize SIFT patch descriptor [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] slide by Josef Sivic 40
41 Patch Features … slide by Josef Sivic
Dictionary Formation … slide by Josef Sivic 42
Recommend
More recommend