Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. - - PowerPoint PPT Presentation

machine learning 2
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. - - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning So far we have reviewed some fundamentals, discussed Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop


slide-1
SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Clustering I

Byron C. Wallace

slide-2
SLIDE 2

Unsupervised learning

  • So far we have reviewed some fundamentals, discussed

Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD

  • We have mostly considered supervised settings (implicitly)

although the above methods are general; we will shift focus to unsupervised learning for a few weeks

  • Both the probabilistic and neural perspectives will continue to

be relevant here — and we will consider the former explicitly for clustering next week

slide-3
SLIDE 3

Unsupervised learning

  • So far we have reviewed some fundamentals, discussed

Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD

  • We have mostly considered supervised settings (implicitly)

although the above methods are general; we will shift focus to unsupervised learning for a few weeks

  • Both the probabilistic and neural perspectives will continue to

be relevant here — and we will consider the former explicitly for clustering next week

slide-4
SLIDE 4

Unsupervised learning

  • So far we have reviewed some fundamentals, discussed

Maximum Likelihood Estimation (MLE) for probabilistic models, and neural networks/backprop SGD

  • We have mostly considered supervised settings (implicitly)

although the above methods are general; we will shift focus to unsupervised learning for a few weeks

  • Both the probabilistic and neural perspectives will continue to

be relevant here — and we will consider the former explicitly for clustering next week

slide-5
SLIDE 5

Clustering

slide-6
SLIDE 6

Clustering

Unsupervised learning (no labels for training) Group data into similar classes that

  • Maximize inter-cluster similarity
  • Minimize intra-cluster similarity
slide-7
SLIDE 7

Clustering

Unsupervised learning (no labels for training) Group data into similar classes that

  • Maximize inter-cluster similarity
  • Minimize intra-cluster similarity
slide-8
SLIDE 8

What is a natural grouping?

Simpson’s Family School Employees Females Males

Choice of clustering criterion can be task-dependent

slide-9
SLIDE 9

What is a natural grouping?

Simpson’s Family School Employees Females Males

Choice of clustering criterion can be task-dependent

slide-10
SLIDE 10

What is a natural grouping?

Simpson’s Family School Employees Females Males

Choice of clustering criterion can be task-dependent

slide-11
SLIDE 11

Defining Distance Measures

0.2 3 342.7

Peter Piotr

Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)

slide-12
SLIDE 12

Defining Distance Measures

0.2 3 342.7

Peter Piotr

Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)

slide-13
SLIDE 13

Defining Distance Measures

0.2 3 342.7

Peter Piotr

Dissimilarity/distance: d(x1, x2) Similarity: s(x1, x2) } Proximity: p(x1, x2)

slide-14
SLIDE 14

Distance Measures

Euclidean Distance s (

k

P

i=1

(xi − yi)2) Mahattan Distance

k

P

i=1

|xi − yi| Minkowski Distance ✓ k P

i=1

(|xi − yi|)q ◆ 1

q

slide-15
SLIDE 15

Distance Measures

Euclidean Distance s (

k

P

i=1

(xi − yi)2) Mahattan Distance

k

P

i=1

|xi − yi| Minkowski Distance ✓ k P

i=1

(|xi − yi|)q ◆ 1

q

slide-16
SLIDE 16

Distance Measures

Euclidean Distance s (

k

P

i=1

(xi − yi)2) Mahattan Distance

k

P

i=1

|xi − yi| Minkowski Distance ✓ k P

i=1

(|xi − yi|)q ◆ 1

q

slide-17
SLIDE 17

Similarity over functions of inputs

  • The preceding measures are distances defined on the original

input space X


  • A better representation may be some function of these

features classification representation φ(x) xplore the two

slide-18
SLIDE 18

Similarity: Kernels

Radial Basis Function (RBF) Polynomial Linear (inner-product)

slide-19
SLIDE 19

First feature Second feature First feature Second feature

Linear RBF kernel

Figure from MML book

slide-20
SLIDE 20

Why kernels?

“The key insight in kernel-based learning is that you can rewrite many linear models in a way that doesn’t require you to ever explicitly compute φ(x)


  • Daume, CIML
slide-21
SLIDE 21

Similarities vs Distance Measure

  • D(A, B) = D(B, A)
  • D(A, A) ≥ 0
  • D(A, B) = 0 iff A = B
  • D(A, B) ≤ D(A, C) + D(B, C)

Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure

slide-22
SLIDE 22

Similarities vs Distance Measure

  • D(A, B) = D(B, A)
  • D(A, A) ≥ 0
  • D(A, B) = 0 iff A = B
  • D(A, B) ≤ D(A, C) + D(B, C)

Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure

slide-23
SLIDE 23

Similarities vs Distance Measure

  • D(A, B) = D(B, A)
  • D(A, A) ≥ 0
  • D(A, B) = 0 iff A = B
  • D(A, B) ≤ D(A, C) + D(B, C)

Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure

slide-24
SLIDE 24

Similarities vs Distance Measure

  • D(A, B) = D(B, A)
  • D(A, A) ≥ 0
  • D(A, B) = 0 iff A = B
  • D(A, B) ≤ D(A, C) + D(B, C)

Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure

slide-25
SLIDE 25

Similarities vs Distance Measure

Similarity functions

  • Less formal; encodes some notion of similarity but not

necessarily well defined

  • Can be negative
  • May not satisfy triangular inequality
  • D(A, B) = D(B, A)
  • D(A, A) ≥ 0
  • D(A, B) = 0 iff A = B
  • D(A, B) ≤ D(A, C) + D(B, C)

Symmetry Reflexivity Positivity (Separation) Triangular Inequality Distance Measure

slide-26
SLIDE 26

Cosine similarity

slide-27
SLIDE 27

Four Types of Clustering

  • 1. Centroid-based (K-means, K-medoids)
slide-28
SLIDE 28

Four Types of Clustering

  • 2. Connectivity-based (Hierarchical)

Notion of Clusters: Cut off dendrogram at some depth

slide-29
SLIDE 29

Four Types of Clustering

  • 3. Density-based (DBSCAN, OPTICS)

Notion of Clusters: Connected regions of high density

slide-30
SLIDE 30

Four Types of Clustering

  • 4. Distribution-based (Mixture Models)

Notion of Clusters: Distributions on features

slide-31
SLIDE 31

K-Means clustering (board)

slide-32
SLIDE 32

K-means Algorithm

Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence

1

For i = 1, . . . , K do Ci = {x 2 X|i = arg min

1jK k x µj k2}

2

For i = 1, . . . , K do µi = arg min

z

P

x2Ci

k z x k2} Output: C1, C2, . . . , CK

slide-33
SLIDE 33

K-means Algorithm

Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence

1

For i = 1, . . . , K do Ci = {x 2 X|i = arg min

1jK k x µj k2}

2

For i = 1, . . . , K do µi = arg min

z

P

x2Ci

k z x k2} Output: C1, C2, . . . , CK

slide-34
SLIDE 34

K-means Algorithm

Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence

1

For i = 1, . . . , K do Ci = {x 2 X|i = arg min

1jK k x µj k2}

2

For i = 1, . . . , K do µi = arg min

z

P

x2Ci

k z x k2} Output: C1, C2, . . . , CK

slide-35
SLIDE 35

K-means Algorithm

Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence

1

For i = 1, . . . , K do Ci = {x 2 X|i = arg min

1jK k x µj k2}

2

For i = 1, . . . , K do µi = arg min

z

P

x2Ci

k z x k2} Output: C1, C2, . . . , CK

slide-36
SLIDE 36

K-means Clustering

1 2 3 4 5 1 2 3 4 5

thm: K-means, Distance Metric: Euclidean Distanc μ1 μ2 μ3

Randomly initialize K centroids μk

slide-37
SLIDE 37

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Assign each point to closest centroid, then update centroids to average of points

slide-38
SLIDE 38

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Assign each point to closest centroid, then update centroids to average of points

slide-39
SLIDE 39

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Repeat until convergence 
 (no points reassigned, means unchanged)

slide-40
SLIDE 40

1 2 3 4 5 1 2 3 4 5

μ1 μ2 μ3

K-means Clustering

Repeat until convergence 
 (no points reassigned, means unchanged)

slide-41
SLIDE 41

K-means Algorithm

Input: X = {x1, x2, . . . , xN} Number of clusters K Initialize: K random centroids µ1, µ2, . . . , µK Repeat Until Convergence

1

For i = 1, . . . , K do Ci = {x 2 X|i = arg min

1jK k x µj k2}

2

For i = 1, . . . , K do µi = arg min

z

P

x2Ci

k z x k2} Output: C1, C2, . . . , CK

  • K-means: Set μ to mean of points in C
  • K-medoids: Set μ=x for point in C with minimum SSE
slide-42
SLIDE 42

Let's see some examples in Python

slide-43
SLIDE 43

“Good” Initialization of Centroids

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 6

+ + + + + + + + + + + + + + + + + +

slide-44
SLIDE 44

“Bad” Initialization of Centroids

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 1

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 2

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 3

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 4

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

x y

Iteration 5

+ + + + + + + + + + + + + + +

slide-45
SLIDE 45

Example: 10 Clusters

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 1

5 pairs of clusters, two initial points in each pair

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 2

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 3

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 4

slide-46
SLIDE 46

Example: 10 Clusters

5 pairs of clusters, two initial points in each pair

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 1

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 2

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 3

5 10 15 20

  • 6
  • 4
  • 2

2 4 6 8

x y

Iteration 4

slide-47
SLIDE 47

Importance of Initial Centroids

Initialization tricks

  • Use multiple restarts
  • Initialize with hierarchical clustering
  • Select more than K points,


keep most widely separated points


slide-48
SLIDE 48

Choosing K

1 2 3 4 5 6 7 8 9 10

K=1, SSE=873

1 2 3 4 5 6 7 8 9 10

K=2, SSE=173

1 2 3 4 5 6 7 8 9 10

K=3, SSE=134

slide-49
SLIDE 49

Choosing K

0.00E+00 1.00E+02 2.00E+02 3.00E+02 4.00E+02 5.00E+02 6.00E+02 7.00E+02 8.00E+02 9.00E+02 1.00E+03 1 2 3 4 5 6

equals 1 to 6… “ ” or “knee finding”.

K Cost Function

“Elbow finding” (a.k.a. “knee finding”)
 Set K to value just above “abrupt” increase

slide-50
SLIDE 50

K-means Limitations: Differing Sizes

Original Points K-means (3 clusters)

slide-51
SLIDE 51

K-means Limitations: Different Densities

Original Points K-means (3 clusters)

slide-52
SLIDE 52

K-means Limitations: Non-globular Shapes

Original Points K-means (2 clusters)

slide-53
SLIDE 53

Overcoming K-means Limitations

Intuition: “Combine” smaller clusters into larger clusters

  • One Solution: Hierarchical Clustering
  • Another Solution: Density-based Clustering
slide-54
SLIDE 54

K-means in action: Download the notebook starter for today from blackboard (and CSV file)

slide-55
SLIDE 55

Density-based Clustering

slide-56
SLIDE 56

DBSCAN

arbitrarily shaped clusters noise

(one of the most-cited clustering methods)

slide-57
SLIDE 57

DBSCAN

Intuition

  • A cluster is a region of high density
  • Noise points lie in regions of low density

arbitrarily shaped clusters noise

slide-58
SLIDE 58

Defining “High Density”

Naïve approach

For each point in a cluster there are at least a minimum number (MinPts)

  • f points in an Eps-neighborhood of that point.

cluster

slide-59
SLIDE 59

Eps-neighborhood of a point p NEps(p) = { q ∈ D | dist (p, q) ≤ Eps }

Eps

p

Defining “High Density”

slide-60
SLIDE 60

Defining “High Density”

  • In each cluster there are two kinds of points:

̶ points inside the cluster (core points) ̶ points on the border (border points)

̶ ̶

cluster

̶ ̶ An Eps-neighborhood of a border point contains significantly less points than an Eps-neighborhood of a core point.

slide-61
SLIDE 61

Defining “High Density”

For every point p in a cluster C there is a point q ∈ C, so that (1) p is inside of the Eps-neighborhood of q and (2) NEps(q) contains at least MinPts points.

core bo

p q ∈ (q) | = 6 ≥ 5 =

border points are connected to core points

core points = high density

Better notion of cluster

slide-62
SLIDE 62

Density Reachability

Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)

Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)

| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)

q not directly density reachable from p

| NEps (p) | = 4 < 5 = MinPts (core point condition)

Note: This is an asymmetric relationship

p q ∈ (q) | = 6 ≥ 5 =

slide-63
SLIDE 63

Density Reachability

Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)

Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)

| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)

q not directly density reachable from p

| NEps (p) | = 4 < 5 = MinPts (core point condition)

Note: This is an asymmetric relationship

p q ∈ (q) | = 6 ≥ 5 =

slide-64
SLIDE 64

Density Reachability

Definition A point p is density-reachable from a point q with regard to the parameters Eps and MinPts if there is a chain of points p1, p2, . . . ,ps with p1 = q and ps = p such that pi+1 is directly density-reachable from pi for all 1 < i < s-1.

) | = 6 ≥ 5 = p MinPts = 5 q | NEps(q) | = 5 = MinPts (core point condition) p1 | NEps(p1) | = 6 ≥ 5 = MinPts (core point condition)

slide-65
SLIDE 65

Density Connectivity

Definition (density-connected) A point p is density-connected to a point q with regard to the parameters Eps and MinPts if there is a point v such that both p and q are density-reachable from v.

p

MinPts = 5

q v

Note: This is a symmetric relationship

slide-66
SLIDE 66

Definition of a Cluster

A cluster with regard to the parameters Eps and MinPts is a non-empty subset C of the database D with 1) For all p, q ∈ D: If p ∈ C and q is density-reachable from p with regard to the parameters Eps and MinPts, then q ∈ C. 2) For all p, q ∈ C: The point p is density-connected to q with regard to the parameters Eps and MinPts. ∈ ∈ ∈ ∈ (Maximality) (Connectivity)

slide-67
SLIDE 67

Definition of Noise

ly shaped clusters n

Noise Cluster

Let C1,...,Ck be the clusters of the database D with regard to the parameters Eps i and MinPts I (i=1,...,k). The set of points in the database D not belonging to any cluster C1,...,Ck is called noise:

Noise = { p ∈ D | p ∉ Ci for all i = 1,...,k}

slide-68
SLIDE 68

DBSCAN Algorithm

(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.

and all points in the cluster are classified.

slide-69
SLIDE 69

DBSCAN Algorithm

(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.

and all points in the cluster are classified.

slide-70
SLIDE 70

DBSCAN Algorithm

(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.

and all points in the cluster are classified.

slide-71
SLIDE 71

DBSCAN Algorithm

Original Points Point types: core, border and noise

slide-72
SLIDE 72

Original Points Clusters

DBSCAN strengths

+ Resistant to noise + Can handle arbitrary shapes

slide-73
SLIDE 73

DBSCAN Weaknesses

  • Ground Truth

MinPts = 4, Eps=9.92 MinPts = 4, Eps=9.75

Sensitive to hyperparameters

slide-74
SLIDE 74

K-means vs DBSCAN

K-means DBSCAN

slide-75
SLIDE 75

Let’s see what it does with Trump’s tweets…