Lecture 10 Jan-Willem van de Meent Clustering Clustering - PowerPoint PPT Presentation

Unsupervised Machine Learning   and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 10 Jan-Willem van de Meent

Clustering

Clustering • Unsupervised learning (no labels for training) • Group data into similar classes that • Maximize inter-cluster similarity • Minimize intra-cluster similarity

Four Types of Clustering 1. Centroid-based (K-means, K-medoids) Notion of Clusters: Voronoi tesselation

Four Types of Clustering 2. Connectivity-based (Hierarchical) Notion of Clusters: Cut off dendrogram at some depth

Four Types of Clustering 3. Density-based (DBSCAN, OPTICS) Notion of Clusters: Connected regions of high density

Four Types of Clustering 4. Distribution-based (Mixture Models) Notion of Clusters: Distributions on features

  Review: K-means Clustering Objective: Sum of Squares μ 1 μ 2 One-hot assignment Center for cluster k µ k Alternate between two steps   μ 3 1. Minimize SSE w.r.t. z n 2. Minimize SSE w.r.t. μ k

K-means Clustering 5 4 μ 1 3 μ 2 2 1 μ 3 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Assign each point to closest centroid, then update centroids to average of points

K-means Clustering 5 4 μ 1 3 2 μ 3 μ 2 1 0 0 1 2 3 4 5 Repeat until convergence   (no points reassigned, means unchanged)

K-means Clustering 5 4 μ 1 3 2 μ 2 μ 3 1 0 0 1 2 3 4 5 Repeat until convergence   (no points reassigned, means unchanged)

“Good” Initialization of Centroids Iteration 1 Iteration 2 Iteration 3 + 3 3 3 + + 2.5 2.5 2.5 + + 2 2 2 + 1.5 1.5 1.5 + y y y + + 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 + + 0.5 0.5 0.5 + + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

“Bad” Initialization of Centroids + Iteration 1 Iteration 2 3 3 + 2.5 2.5 + + 2 2 1.5 1.5 y y 1 1 0.5 0.5 + + 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 + + + 2.5 2.5 2.5 + + + 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 + + + 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x

Importance of Initial Centroids What is the chance of randomly selecting one point from each of K clusters? (assume each cluster has size n = N/K) Implication : We will almost always have   multiple initial centroids in same cluster.

Example: 10 Clusters Iteration 1 Iteration 3 Iteration 2 Iteration 4 8 8 8 8 6 6 6 6 4 4 4 4 2 2 2 2 y y y y 0 0 0 0 -2 -2 -2 -2 -4 -4 -4 -4 -6 -6 -6 -6 0 0 0 0 5 5 5 5 10 10 10 10 15 15 15 15 20 20 20 20 x x x x 5 pairs of clusters, two initial points in each pair

Importance of Initial Centroids Initialization tricks • Use multiple restarts • Initialize with hierarchical clustering • Select more than K points,   keep most widely separated points  

Choosing K K=1, SSE=873 K=2, SSE=173 K=3, SSE=134 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

equals 1 to 6… Choosing K “ ” or “knee finding”. 1.00E+03 9.00E+02 Cost Function 8.00E+02 7.00E+02 6.00E+02 5.00E+02 4.00E+02 3.00E+02 2.00E+02 1.00E+02 0.00E+00 K 1 2 3 4 5 6 “Elbow finding” (a.k.a. “knee finding”)   Set K to value just above “abrupt” increase (we’ll talk about better methods later in this course)

K-means Limitations: Differing Sizes Original Points K-means (3 clusters)

K-means Limitations: Different Densities Original Points K-means (3 clusters)

K-means Limitations: Non-globular Shapes Original Points K-means (2 clusters)

Overcoming K-means Limitations Intuition: “Combine” smaller clusters into larger clusters • One Solution: Hierarchical Clustering • Another Solution: Density-based Clustering

Hierarchical Clustering

Dendrogram ( a.k.a. a similarity tree ) Similarity of A and B is D(A,B) represented as height   of lowest shared   internal node (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147,   (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

Dendrogram ( a.k.a. a similarity tree ) D(A,B) Natural when measuring   genetic similarity, distance   to common ancestor (Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147,   (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

Example: Iris data Iris Setosa Iris versicolor Iris virginica https://en.wikipedia.org/wiki/Iris_flower_data_set

Hierarchical Clustering ( Euclidian Distance ) https://en.wikipedia.org/wiki/Iris_flower_data_set

Edit Distance Distance Patty and Selma Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point D(Patty, Selma) = 3 Distance Marge and Selma Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Lose weight, 1 point D(Marge,Selma) = 5 Can be defined for any set of discrete features

Edit Distance for Strings • Transform string Q into string C , using only Similarity “Peter” and “Piotr”? Substitution , Insertion and Deletion . Substitution 1 Unit • Assume that each of these operators has a Insertion 1 Unit cost associated with it. Deletion 1 Unit • The similarity between two strings can be D ( Peter , Piotr ) is 3 defined as the cost of the cheapest transformation from Q to C. Peter Substitution (i for e) Piter Insertion (o) Pioter Deletion (e) Pedro Piotr Peter Piotr Piero Pyotr Petros Pietro Pierre

Hierarchical Clustering ( Edit Distance ) Pedro (Portuguese) Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Cristovao (Portuguese) Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English) Miguel (Portuguese) Michalis (Greek), Michael (English), Mick (Irish) Cristovao Pedro Miguel Christoph n Piotr r Petros o Pierre o Peter Peka r Michalis Michael Mick Christopher e Cristobal Cristoforo Kristoffer f r o t a e r r h a o t t e d d p e s y e i a e y P P o d i P e r P t s K s P i r i r C h C

Meaningful Patterns Edit distance yields clustering according to geography Slide from Eamonn Keogh Pedro ( Portuguese/Spanish ) Petros ( Greek ), Peter ( English ), Piotr ( Polish ), Peadar (Irish), Pierre ( French ), Peder ( Danish ), Peka (Hawaiian), Pietro ( Italian ), Piero ( Italian Alternative ), Petr (Czech), Pyotr ( Russian )

Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia)

Spurious Patterns In general clusterings will only be as meaningful as your distance metric spurious; there is no connection between the two South Georgia & Serbia & St. Helena & U.K. AUSTRALIA ANGUILLA FRANCE NIGER INDIA IRELAND BRAZIL South Sandwich Montenegro Dependencies Islands (Yugoslavia) Former UK colonies No relation

“Correct” Number of Clusters to determine the “correct”

“Correct” Number of Clusters to determine the “correct” Determine number of clusters by looking at distance

Detecting Outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier

Bottom-up vs Top-down Since we cannot test all possible The number of dendrograms with n trees we will have to heuristic leafs = (2 n -3)!/[(2 ( n -2) ) ( n -2)!] search of all possible trees. We could do this.. Number Number of Possible of Leafs Dendrograms 2 1 Bottom-Up (agglomerative): 3 3 4 15 Starting with each item in its own 5 105 cluster, find the best pair to merge ... … into a new cluster. Repeat until all 10 34,459,425 clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

Lecture 10 Jan-Willem van de Meent Clustering Clustering - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 10 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels for training) Group data into similar classes that Maximize

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

javascripts in the javascripts ffconf 2014 andy wingo the es-discuss clownshoes es6 C++

Connected Treewidth and Connected Graph Searching Pierre Fraigniaud 1 Nicolas Nisse 2 CNRS, LRI,

Issues and Trends in Edtech in 2016 January 21, 2016 The webcast will begin shortly. There

Abstract Interpretation + Impure Catalysts Our Sparrow Experience YI Jhee, MS Jin, YB Jung, DH

CS/COE 1520 pitt.edu/~ach54/cs1520 Client-side scripting: An introduction to Javascript Why?

A TOOL TO LINK THE MALICIOUS WEB Agenda Introduction Fireshark Details Web

INTRODUCTION TO BIGCOUCH robert newson couchdb conf berlin january 2013 1 Friday, 25 January

Optimization Coaching for JavaScript Vincent St-Amour PLT @ Northwestern University Shu-yu Guo

Sambuz

Useful Links

Newsletter

Mail Us