Lecture 10 Jan-Willem van de Meent Clustering Clustering - - PowerPoint PPT Presentation

lecture 10
SMART_READER_LITE
LIVE PREVIEW

Lecture 10 Jan-Willem van de Meent Clustering Clustering - - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 10 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels for training) Group data into similar classes that Maximize


slide-1
SLIDE 1

Unsupervised Machine Learning 
 and Data Mining

DS 5230 / DS 4420 - Fall 2018

Lecture 10

Jan-Willem van de Meent

slide-2
SLIDE 2

Clustering

slide-3
SLIDE 3

Clustering

  • Unsupervised learning (no labels for training)
  • Group data into similar classes that
  • Maximize inter-cluster similarity
  • Minimize intra-cluster similarity
slide-4
SLIDE 4

Four Types of Clustering

  • 1. Centroid-based (K-means, K-medoids)

Notion of Clusters: Voronoi tesselation

slide-5
SLIDE 5

Four Types of Clustering

  • 2. Connectivity-based (Hierarchical)

Notion of Clusters: Cut off dendrogram at some depth

slide-6
SLIDE 6

Four Types of Clustering

  • 3. Density-based (DBSCAN, OPTICS)

Notion of Clusters: Connected regions of high density

slide-7
SLIDE 7

Four Types of Clustering

  • 4. Distribution-based (Mixture Models)

Notion of Clusters: Distributions on features

slide-8
SLIDE 8

Review: K-means Clustering

Objective: Sum of Squares μ1 μ2 μ3 Alternate between two steps
 


  • 1. Minimize SSE w.r.t. zn
  • 2. Minimize SSE w.r.t. μk

µk

One-hot assignment Center for cluster k

slide-9
SLIDE 9

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Assign each point to closest centroid, then update centroids to average of points

slide-10
SLIDE 10

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Assign each point to closest centroid, then update centroids to average of points

slide-11
SLIDE 11

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Repeat until convergence 
 (no points reassigned, means unchanged)

slide-12
SLIDE 12

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Repeat until convergence 
 (no points reassigned, means unchanged)

slide-13
SLIDE 13

“Good” Initialization of Centroids

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 6

+ + + + + + + + + + + + + + + + + +

slide-14
SLIDE 14

“Bad” Initialization of Centroids

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

+ + + + + + + + + + + + + + +

slide-15
SLIDE 15

Importance of Initial Centroids

What is the chance of randomly selecting

  • ne point from each of K clusters?

(assume each cluster has size n = N/K) Implication: We will almost always have 
 multiple initial centroids in same cluster.

slide-16
SLIDE 16

Example: 10 Clusters

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 1

5 pairs of clusters, two initial points in each pair

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 2

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 3

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 4

slide-17
SLIDE 17

Example: 10 Clusters

5 pairs of clusters, two initial points in each pair

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 1

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 2

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 3

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 4

slide-18
SLIDE 18

Importance of Initial Centroids

Initialization tricks

  • Use multiple restarts
  • Initialize with hierarchical clustering
  • Select more than K points,


keep most widely separated points


slide-19
SLIDE 19

Choosing K

1 2 3 4 5 6 7 8 9 10

K=1, SSE=873

1 2 3 4 5 6 7 8 9 10

K=2, SSE=173

1 2 3 4 5 6 7 8 9 10

K=3, SSE=134

slide-20
SLIDE 20

Choosing K

0.00E+00 1.00E+02 2.00E+02 3.00E+02 4.00E+02 5.00E+02 6.00E+02 7.00E+02 8.00E+02 9.00E+02 1.00E+03 1 2 3 4 5 6

equals 1 to 6… “ ” or “knee finding”.

K Cost Function

“Elbow finding” (a.k.a. “knee finding”)
 Set K to value just above “abrupt” increase

(we’ll talk about better methods later in this course)

slide-21
SLIDE 21

K-means Limitations: Differing Sizes

Original Points K-means (3 clusters)

slide-22
SLIDE 22

K-means Limitations: Different Densities

Original Points K-means (3 clusters)

slide-23
SLIDE 23

K-means Limitations: Non-globular Shapes

Original Points K-means (2 clusters)

slide-24
SLIDE 24

Overcoming K-means Limitations

Intuition: “Combine” smaller clusters into larger clusters

  • One Solution: Hierarchical Clustering
  • Another Solution: Density-based Clustering
slide-25
SLIDE 25

Hierarchical Clustering

slide-26
SLIDE 26

Dendrogram

Similarity of A and B is represented as height


  • f lowest shared 


internal node

(a.k.a. a similarity tree)

(Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

D(A,B)

slide-27
SLIDE 27

Dendrogram

Natural when measuring
 genetic similarity, distance 
 to common ancestor

(a.k.a. a similarity tree)

(Bovine: 0.69395, (Spider Monkey: 0.390, (Gibbon:0.36079,(Orang: 0.33636, (Gorilla: 0.17147, 
 (Chimp: 0.19268, Human: 0.11927): 0.08386): 0.06124): 0.15057): 0.54939);

D(A,B)

slide-28
SLIDE 28

Example: Iris data

https://en.wikipedia.org/wiki/Iris_flower_data_set Iris Setosa Iris versicolor Iris virginica

slide-29
SLIDE 29

Hierarchical Clustering

https://en.wikipedia.org/wiki/Iris_flower_data_set (Euclidian Distance)

slide-30
SLIDE 30

Edit Distance

Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point D(Patty, Selma) = 3 Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Lose weight, 1 point D(Marge,Selma) = 5

Distance Patty and Selma Distance Marge and Selma Can be defined for any set of discrete features

slide-31
SLIDE 31

Edit Distance for Strings

Peter

Piter Pioter

Piotr

Substitution (i for e) Insertion (o) Deletion (e)

  • Transform string Q into string C, using only

Substitution, Insertion and Deletion.

  • Assume that each of these operators has a

cost associated with it.

  • The similarity between two strings can be

defined as the cost of the cheapest transformation from Q to C. Similarity “Peter” and “Piotr”? Substitution 1 Unit Insertion 1 Unit Deletion 1 Unit D(Peter,Piotr) is 3

Piotr Pyotr Petros Pietro

Pedro

Pierre Piero Peter

slide-32
SLIDE 32

Hierarchical Clustering

(Edit Distance)

Piotr P y

  • t

r Petros P i e t r

  • Pedro

Pierre P i e r

  • Peter

P e d e r Peka P e a d a r Michalis Michael Miguel Mick Cristovao Christopher C h r i s t

  • p

h e Christoph C r i s d e a n Cristobal Cristoforo Kristoffer K r y s t

  • f

Pedro (Portuguese)

Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian)

Cristovao (Portuguese)

Christoph (German), Christophe (French), Cristobal (Spanish), Cristoforo (Italian), Kristoffer (Scandinavian), Krystof (Czech), Christopher (English)

Miguel (Portuguese)

Michalis (Greek), Michael (English), Mick (Irish)

slide-33
SLIDE 33

Meaningful Patterns

Pedro (Portuguese/Spanish)

Petros (Greek), Peter (English), Piotr (Polish), Peadar (Irish), Pierre (French), Peder (Danish), Peka (Hawaiian), Pietro (Italian), Piero (Italian Alternative), Petr (Czech), Pyotr (Russian) Slide from Eamonn Keogh

Edit distance yields clustering according to geography

slide-34
SLIDE 34

Spurious Patterns

ANGUILLA AUSTRALIA

  • St. Helena &

Dependencies South Georgia & South Sandwich Islands U.K. Serbia & Montenegro (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL

spurious; there is no connection between the two

In general clusterings will only be as meaningful as your distance metric

slide-35
SLIDE 35

Spurious Patterns

ANGUILLA AUSTRALIA

  • St. Helena &

Dependencies South Georgia & South Sandwich Islands U.K. Serbia & Montenegro (Yugoslavia) FRANCE NIGER INDIA IRELAND BRAZIL

spurious; there is no connection between the two

In general clusterings will only be as meaningful as your distance metric Former UK colonies No relation

slide-36
SLIDE 36

“Correct” Number of Clusters

to determine the “correct”

slide-37
SLIDE 37

“Correct” Number of Clusters

to determine the “correct”

Determine number of clusters by looking at distance

slide-38
SLIDE 38

Detecting Outliers

Outlier

The single isolated branch is suggestive of a data point that is very different to all others

slide-39
SLIDE 39

Bottom-up vs Top-down

The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possible

  • f Leafs

Dendrograms 2 1 3 3 4 15 5 105 ... … 10 34,459,425

Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this.. Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together. Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively

  • perate on both sides.
slide-40
SLIDE 40

Distance Matrix

8 8 7 7 2 4 4 3 3 1

D( , ) = 8 D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

slide-41
SLIDE 41

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best merges…

merges…

slide-42
SLIDE 42

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best merges…

slide-43
SLIDE 43

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best Consider all possible merges… Choose the best

slide-44
SLIDE 44

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best Consider all possible merges… Choose the best

merges… merges…

merges…

slide-45
SLIDE 45

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best Consider all possible merges… Choose the best

merges… merges…

merges…

Can you now implement this?

slide-46
SLIDE 46

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best Consider all possible merges… Choose the best

merges… merges…

merges…

Distances between examples (can calculate using metric)

slide-47
SLIDE 47

Bottom-up (Agglomerative Clustering)

25

Consider all possible merges… Choose the best Consider all possible merges…

Choose the best Consider all possible merges… Choose the best

merges… merges…

merges…

How do we calculate the 
 distance to a cluster?

slide-48
SLIDE 48

Clustering Criteria

Single link:


(Closest point)

d(A, B) = min

a∈A,b∈B d(a, b)

Complete link: 


(Furthest point)

d(A, B) = max

a∈A,b∈B d(a, b)

Group average: 


(Average distance)

d(A, B) = 1 |A||B| X

a∈A,b∈B

d(a, b)

Centroid:


(Distance of average)

d(A, B) = d(µA,µB) µX = 1 |X| X

x∈X

x

Ward:


(Intra-cluster variance)

SA∪B = X

x∈A∪B

d(x,µA∪B)2

slide-49
SLIDE 49

Lance-Williams Methods

Clustering Method αA αB β γ Single Link 1/2 1/2 −1/2 Complete Link 1/2 1/2 1/2 Group Average

mA mA+mB mB mA+mB

Centroid

mA mA+mB mB mA+mB −mAmB (mA+mB)2

Ward’s

mA+mQ mA+mB+mQ mB+mQ mA+mB+mQ −mQ mA+mB+mQ

Recursively minimize/maximize proximity for a merger R:=A∪B relative to all existing Q

d(x,y) = |x-y| d(x,y) = |x-y|2

p(R,Q) =αAp(A,Q) + αBp(B,Q) + βp(A, B) + γ|p(A,Q) − p(B,Q)|

slide-50
SLIDE 50

Hierarchical Clustering Summary

+ No need to specify number of clusters + Hierarchical structure maps nicely onto


human intuition in some domains

  • Scaling: Time complexity at least O(n2) 


in number of examples

  • Heuristic search method: 


Local optima are a problem

  • Interpretation of results is (very) subjective
slide-51
SLIDE 51

Density-based Clustering

slide-52
SLIDE 52

DBSCAN

arbitrarily shaped clusters noise

(one of the most-cited clustering methods)

slide-53
SLIDE 53

DBSCAN

Intuition

  • A cluster is a region of high density
  • Noise points lie in regions of low density

arbitrarily shaped clusters noise

slide-54
SLIDE 54

Defining “High Density”

Naïve approach

For each point in a cluster there are at least a minimum number (MinPts)

  • f points in an Eps-neighborhood of that point.

cluster

slide-55
SLIDE 55

Eps-neighborhood of a point p NEps(p) = { q ∈ D | dist (p, q) ≤ Eps }

Eps

p

Defining “High Density”

slide-56
SLIDE 56

Defining “High Density”

  • In each cluster there are two kinds of points:

̶ points inside the cluster (core points) ̶ points on the border (border points)

̶ ̶

cluster

̶ ̶ An Eps-neighborhood of a border point contains significantly less points than an Eps-neighborhood of a core point.

slide-57
SLIDE 57

Defining “High Density”

For every point p in a cluster C there is a point q ∈ C, so that (1) p is inside of the Eps-neighborhood of q and (2) NEps(q) contains at least MinPts points.

core bo

p q ∈ (q) | = 6 ≥ 5 =

border points are connected to core points

core points = high density

Better notion of cluster

slide-58
SLIDE 58

Density Reachability

Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)

Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)

| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)

q not directly density reachable from p

| NEps (p) | = 4 < 5 = MinPts (core point condition)

Note: This is an asymmetric relationship

p q ∈ (q) | = 6 ≥ 5 =

slide-59
SLIDE 59

Density Reachability

Definition A point p is density-reachable from a point q with regard to the parameters Eps and MinPts if there is a chain of points p1, p2, . . . ,ps with p1 = q and ps = p such that pi+1 is directly density-reachable from pi for all 1 < i < s-1.

) | = 6 ≥ 5 = p MinPts = 5 q | NEps(q) | = 5 = MinPts (core point condition) p1 | NEps(p1) | = 6 ≥ 5 = MinPts (core point condition)

slide-60
SLIDE 60

Density Connectivity

Definition (density-connected) A point p is density-connected to a point q with regard to the parameters Eps and MinPts if there is a point v such that both p and q are density-reachable from v.

p

MinPts = 5

q v

Note: This is a symmetric relationship

slide-61
SLIDE 61

Definition of a Cluster

A cluster with regard to the parameters Eps and MinPts is a non-empty subset C of the database D with 1) For all p, q ∈ D: If p ∈ C and q is density-reachable from p with regard to the parameters Eps and MinPts, then q ∈ C. 2) For all p, q ∈ C: The point p is density-connected to q with regard to the parameters Eps and MinPts. ∈ ∈ ∈ ∈ (Maximality) (Connectivity)

slide-62
SLIDE 62

Definition of Noise

ly shaped clusters n

Noise Cluster

Let C1,...,Ck be the clusters of the database D with regard to the parameters Eps i and MinPts I (i=1,...,k). The set of points in the database D not belonging to any cluster C1,...,Ck is called noise:

Noise = { p ∈ D | p ∉ Ci for all i = 1,...,k}

slide-63
SLIDE 63

DBSCAN Algorithm

(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.

and all points in the cluster are classified.

slide-64
SLIDE 64

DBSCAN Algorithm

Original Points Point types: core, border and noise

slide-65
SLIDE 65

DBSCAN Complexity

  • Time complexity: O(N2) if done naively, 


O(N log N) when using a spatial index
 (works in relatively low dimensions)

  • Space complexity: O(N)
slide-66
SLIDE 66

Original Points Clusters

DBSCAN strengths

+ Resistant to noise + Can handle arbitrary shapes

slide-67
SLIDE 67

DBSCAN Weaknesses

  • Varying densities
  • High dimensional data
  • Overlapping clusters
  • Ground Truth

MinPts = 4, Eps=9.92 MinPts = 4, Eps=9.75

slide-68
SLIDE 68

Determining EPS and MINPTS

Eps noise cluster 1 cluster 2

  • Calculate distance of k-th nearest 


neighbor for each point

  • Plot in ascending / descending order
  • Set EPS to max distance before “jump”
slide-69
SLIDE 69

K-means vs DBSCAN

K-means DBSCAN