L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

l ecture 26 c lustering
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier - - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 L ECTURE 26: C LUSTERING Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 1 Clustering


slide-1
SLIDE 1

CS446 Introduction to Machine Learning (Spring 2015) University of Illinois at Urbana-Champaign

http://courses.engr.illinois.edu/cs446

  • Prof. Julia Hockenmaier

juliahmr@illinois.edu

LECTURE 26: CLUSTERING

CS446 Machine Learning

1

slide-2
SLIDE 2

CS446 Machine Learning

Clustering

What should a cluster algorithm achieve? – A cluster is a set of entities which are alike. – Entities in different clusters are not alike. What is alikeness?? This depends on the application/task.

2

slide-3
SLIDE 3

CS446 Machine Learning

Clustering

Can we formalize this? A cluster is a set of points such that the distance between any two points in the same cluster is less than the distance between any point in the cluster and any point not in it. The distance metric has to be appropriate for the task in mind.

3

slide-4
SLIDE 4

CS446 Machine Learning

Distance Measures for Clustering

4

slide-5
SLIDE 5

CS446 Machine Learning

Distance measures

In studying clustering techniques we will assume that we are given a matrix of distances between all pairs of data points:

5

m 4 3 2 1

x x x x x

1

x

2

x

3

x

m

x

4

x

  • • •

) x , d(x

j i

slide-6
SLIDE 6

CS446 Machine Learning

Distance measures

A distance measure d: Rd × Rd R is a function that satisfies d(x,y) ≥ 0 ⇔ x ≠ y and d(x,y) = 0 ⇔ x = y d is a metric if it also satisfies: – The triangle inequality: d(x,y) + d(y,z) ≥d(x,z) – Symmetry: d(x,y) = d(y,x)

6

slide-7
SLIDE 7

CS446 Machine Learning

Distance measures

Euclidean (L2) Manhattan (L1) Infinity (Sup) Distance L∞

Note that L∞ < L1 < L2, but different distances do not induce the same ordering on points.

7

d(x, y) = (xi − yi)2

i=1 d

d(x, y) = x - y = xi − yi

i=1 d

d(x, y) = max1≤i≤d xi − yi

slide-8
SLIDE 8

Distance measures

x = (x1, x2) y = (x1–2, x2+4)

8

Euclidean: (42 + 22)1/2 = 4.47 Manhattan: 4+ 2 = 6 Sup: Max(4,2) = 4

2 4

slide-9
SLIDE 9

Distance Measures

Different distances do not induce the same

  • rdering on points

9

4

ε ε + = + = =

5 ) (5 b) (a, L 5 b) (a, L

1/2 2 2 2

4 5

66 . 5 4 = = + = =

2 4 ) (4 d) (c, L 4 d) (c, L

1/2 2 2 2

b) (a, L d) (c, L b) (a, L d) (c, L

2 2

> <

∞ ∞

slide-10
SLIDE 10

Distance measures

Clustering is sensitive to the distance measure. Sometimes it is beneficial to use a distance measure that is invariant to transformations that are natural to the problem: – Mahalanobis distance: Shift and scale invariance

10

slide-11
SLIDE 11

Mahalanobis Distance

Σ is a (symmetric) Covariance Matrix: Translates all the axes to a mean = 0 and variance = 1 (shift and scale invariance)

11

d(x, y) = (x - y)T Σ(x − y)

µ = 1 m xi

i=1 m

, (average of the data) Σ = 1 m (x −µ)(x −µ)T,

i=1 m

a matrix of size m× m

slide-12
SLIDE 12

CS446 Machine Learning

Distance measures

Some algorithms require distances between a point x and a set of points A d(x, A)

This might be defined e.g. as min/max/avg distance between x and any point in A.

Others require distances between two sets of points A, B, d(A, B).

This might be defined e.g as min/max/avg distance between any point in A and any point in B.

12

slide-13
SLIDE 13

CS446 Machine Learning

Clustering Methods

13

slide-14
SLIDE 14

CS446 Machine Learning

Clustering Methods

Do the clusters partition the data? – Hard (yes) vs. soft clustering (no) Do the clusters have structure? – Hierarchical (yes) vs. flat clustering (no) Is the hierarchy induced top-down or bottom-up? – Top-Down: divisive – Bottom-up: agglomerative How do we represent the data points? – As vectors or as vertices in a graph?

14

slide-15
SLIDE 15

Graph-based clustering

Each data point is a vertex in an undirected graph. Edge weights correspond to (non-zero) similarities, not distance: 0 ≤ sim(x,y) ≤ 1 Clustering = Graph partitioning! – Graph cuts, minimum spanning tree, etc.

15

slide-16
SLIDE 16

Vector-based clustering

Each data point is a vector in a vector space. We can define a distance metric in this space.

16

slide-17
SLIDE 17

(Hard) clustering

Given an unlabeled dataset D = {x1,…,xN}, a distance metric d(x,x’) over pairs of points and a clustering algorithm A, return a partition C of D. – Partition: a set of sets C, such that each element of D belongs to exactly one Ci

17

slide-18
SLIDE 18

Hierarchical Clustering

Hierarchical clustering is a nested sequence of partitions Agglomerative: Places each object in its own cluster and gradually merge the atomic clusters into larger and larger clusters. Divisive: Start with all objects in one cluster and subdivide into smaller clusters. {(a) ,(b),(c),(d),(e)} {(a,b),(c),(d),(e)} {(a,b),(c,d),(e)} {(a,b,c,d),(e)} {(a,b,c,d,e)}

a b c d e

slide-19
SLIDE 19

Agglomerative Clustering

Assume a distance measure between points d(x1,x2) Define a distance measure between clusters D(C1,C2) Algorithm: – Initialization: Put each point in a separate cluster. – At each stage, merge the two closest clusters according to D (merge the two D-closest clusters). Different definitions of D (for the same d) give rise to radically different partitions of the data.

19

slide-20
SLIDE 20

CS446 Machine Learning

Agglomerative Clustering

Single Link Clustering

Define cluster distance as distance of the closest pair DSL(C1,C2) = min{x1∈C1,x2∈C2} d(x1,x2)

Complete Link Clustering

Define cluster distance as distance of the furthest pair DCL(C1,C2) = max{x1∈C1,x2∈C2} d(x1,x2)

Group Average Clustering

Define cluster distance as the average distance of all pairs DGA(C1,C2) = avg{x1∈C1,x2∈C2} d(x1,x2)

Error-sum-of squares Clustering (Ward):

ESS(C) = Σx∈C(x-ηC)2 (ηC is the cluster mean) DESS(C1,C2) = ESS(C1 ∪C2) – ESS(C1) – ESS(C2)

20

slide-21
SLIDE 21

Association-Dissociation

Given a collection of points, one way to define the goal of a clustering process is to use the following two measures: – A measure of similarity within a group of points – A measure of similarity between different groups Ideally, we would like to define these so that: The within-similarity can be maximized The between-similarity can be minimized at the same time. This turns out to be hard. We often only optimize one of these objectives.