Lecture 11 Jan-Willem van de Meent Clustering Clustering - - PowerPoint PPT Presentation

lecture 11
SMART_READER_LITE
LIVE PREVIEW

Lecture 11 Jan-Willem van de Meent Clustering Clustering - - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 11 Jan-Willem van de Meent Clustering Clustering Unsupervised learning (no labels for training) Group data into similar classes that Maximize


slide-1
SLIDE 1

Unsupervised Machine Learning 
 and Data Mining

DS 5230 / DS 4420 - Fall 2018

Lecture 11

Jan-Willem van de Meent

slide-2
SLIDE 2

Clustering

slide-3
SLIDE 3

Clustering

  • Unsupervised learning (no labels for training)
  • Group data into similar classes that
  • Maximize inter-cluster similarity
  • Minimize intra-cluster similarity
slide-4
SLIDE 4

Four Types of Clustering

  • 1. Centroid-based (K-means, K-medoids)

Notion of Clusters: Voronoi tesselation

slide-5
SLIDE 5

Four Types of Clustering

  • 2. Connectivity-based (Hierarchical)

Notion of Clusters: Cut off dendrogram at some depth

slide-6
SLIDE 6

Four Types of Clustering

  • 3. Density-based (DBSCAN, OPTICS)

Notion of Clusters: Connected regions of high density

slide-7
SLIDE 7

Four Types of Clustering

  • 4. Distribution-based (Mixture Models)

Notion of Clusters: Distributions on features

slide-8
SLIDE 8

Density-based Clustering

slide-9
SLIDE 9

DBSCAN

arbitrarily shaped clusters noise

(one of the most-cited clustering methods)

slide-10
SLIDE 10

DBSCAN

Intuition

  • A cluster is a region of high density
  • Noise points lie in regions of low density

arbitrarily shaped clusters noise

slide-11
SLIDE 11

Defining “High Density”

Naïve approach

For each point in a cluster there are at least a minimum number (MinPts)

  • f points in an Eps-neighborhood of that point.

cluster

slide-12
SLIDE 12

Eps-neighborhood of a point p NEps(p) = { q ∈ D | dist (p, q) ≤ Eps }

Eps

p

Defining “High Density”

slide-13
SLIDE 13

Defining “High Density”

  • In each cluster there are two kinds of points:

̶ points inside the cluster (core points) ̶ points on the border (border points)

̶ ̶

cluster

̶ ̶ An Eps-neighborhood of a border point contains significantly less points than an Eps-neighborhood of a core point.

slide-14
SLIDE 14

Defining “High Density”

For every point p in a cluster C there is a point q ∈ C, so that (1) p is inside of the Eps-neighborhood of q and (2) NEps(q) contains at least MinPts points.

core bo

p q ∈ (q) | = 6 ≥ 5 =

border points are connected to core points

core points = high density

Better notion of cluster

slide-15
SLIDE 15

Density Reachability

Definition A point p is directly density-reachable from a point q with regard to the parameters Eps and MinPts, if 1) p ∈ NEps(q) 2) | NEps(q) | ≥ MinPts (core point condition) (q) | = 6 ≥ 5 = (reachability)

Parameter: MinPts = 5 p directly density reachable from q p ∈ NEps(q)

| NEps(q) | = 6 ≥ 5 = MinPts (core point condition)

q not directly density reachable from p

| NEps (p) | = 4 < 5 = MinPts (core point condition)

Note: This is an asymmetric relationship

p q ∈ (q) | = 6 ≥ 5 =

slide-16
SLIDE 16

Density Reachability

Definition A point p is density-reachable from a point q with regard to the parameters Eps and MinPts if there is a chain of points p1, p2, . . . ,ps with p1 = q and ps = p such that pi+1 is directly density-reachable from pi for all 1 < i < s-1.

) | = 6 ≥ 5 = p MinPts = 5 q | NEps(q) | = 5 = MinPts (core point condition) p1 | NEps(p1) | = 6 ≥ 5 = MinPts (core point condition)

slide-17
SLIDE 17

Density Connectivity

Definition (density-connected) A point p is density-connected to a point q with regard to the parameters Eps and MinPts if there is a point v such that both p and q are density-reachable from v.

p

MinPts = 5

q v

Note: This is a symmetric relationship

slide-18
SLIDE 18

Definition of a Cluster

A cluster with regard to the parameters Eps and MinPts is a non-empty subset C of the database D with 1) For all p, q ∈ D: If p ∈ C and q is density-reachable from p with regard to the parameters Eps and MinPts, then q ∈ C. 2) For all p, q ∈ C: The point p is density-connected to q with regard to the parameters Eps and MinPts. ∈ ∈ ∈ ∈ (Maximality) (Connectivity)

slide-19
SLIDE 19

Definition of Noise

ly shaped clusters n

Noise Cluster

Let C1,...,Ck be the clusters of the database D with regard to the parameters Eps i and MinPts I (i=1,...,k). The set of points in the database D not belonging to any cluster C1,...,Ck is called noise:

Noise = { p ∈ D | p ∉ Ci for all i = 1,...,k}

slide-20
SLIDE 20

DBSCAN Algorithm

(1) Start with an arbitrary point p from the database and retrieve all points density-reachable from p with regard to Eps and MinPts. (2) If p is a core point, the procedure yields a cluster with regard to Eps and MinPts (3) If p is a border point, no points are density-reachable from p and DBSCAN visits the next unclassified point in the database.

and all points in the cluster are classified.

slide-21
SLIDE 21

DBSCAN Algorithm

Original Points Point types: core, border and noise

slide-22
SLIDE 22

DBSCAN Complexity

  • Time complexity: O(N2) if done naively, 


O(N log N) when using a spatial index
 (works in relatively low dimensions)

  • Space complexity: O(N)
slide-23
SLIDE 23

Original Points Clusters

DBSCAN strengths

+ Resistant to noise + Can handle arbitrary shapes

slide-24
SLIDE 24

DBSCAN Weaknesses

  • Varying densities
  • High dimensional data
  • Overlapping clusters
  • Ground Truth

MinPts = 4, Eps=9.92 MinPts = 4, Eps=9.75

slide-25
SLIDE 25

Determining EPS and MINPTS

Eps noise cluster 1 cluster 2

  • Calculate distance of k-th nearest 


neighbor for each point

  • Plot in ascending / descending order
  • Set EPS to max distance before “jump”
slide-26
SLIDE 26

K-means vs DBSCAN

K-means DBSCAN