Distance-based Methods: Drawbacks Hard to find clusters with - - PowerPoint PPT Presentation

distance based methods drawbacks
SMART_READER_LITE
LIVE PREVIEW

Distance-based Methods: Drawbacks Hard to find clusters with - - PowerPoint PPT Presentation

Distance-based Methods: Drawbacks Hard to find clusters with irregular shapes Hard to specify the number of clusters Heuristic: a cluster must be dense Jian Pei: CMPT 459/741 Clustering (3) 1 How to Find Irregular Clusters? Divide


slide-1
SLIDE 1

Jian Pei: CMPT 459/741 Clustering (3) 1

Distance-based Methods: Drawbacks

  • Hard to find clusters with irregular shapes
  • Hard to specify the number of clusters
  • Heuristic: a cluster must be dense
slide-2
SLIDE 2

How to Find Irregular Clusters?

  • Divide the whole space into many small

areas

– The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster

  • Start from a dense area, traverse connected

dense areas and discover clusters in irregular shape

Jian Pei: CMPT 459/741 Clustering (3) 2

slide-3
SLIDE 3

Jian Pei: CMPT 459/741 Clustering (3) 3

Directly Density Reachable

  • Parameters

– Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps- neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}

  • Core object p: |NEps(p)|≥MinPts

– A core object is in a dense area

  • Point q directly density-reachable from p iff

q ∈NEps(p) and p is a core object

p q MinPts = 3 Eps = 1 cm

slide-4
SLIDE 4

Jian Pei: CMPT 459/741 Clustering (3) 4

Density-Based Clustering

  • Density-reachable

– Directly density reachable p1àp2, p2àp3, …, pn-1à pn – pn density-reachable from p1

  • Density-connected

– If points p, q are density-reachable from o then p and q are density-connected p q

  • p

q p1

slide-5
SLIDE 5

Jian Pei: CMPT 459/741 Clustering (3) 5

DBSCAN

  • A cluster: a maximal set of density-

connected points

– Discover clusters of arbitrary shape in spatial databases with noise

Core Border Outlier Eps = 1cm MinPts = 5

slide-6
SLIDE 6

Jian Pei: CMPT 459/741 Clustering (3) 6

DBSCAN: the Algorithm

  • Arbitrary select a point p
  • Retrieve all points density-reachable from p

wrt Eps and MinPts

  • If p is a core point, a cluster is formed
  • If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database

  • Continue the process until all of the points

have been processed

slide-7
SLIDE 7

Jian Pei: CMPT 459/741 Clustering (3) 7

Challenges for DBSCAN

  • Different clusters may have very different

densities

  • Clusters may be in hierarchies
slide-8
SLIDE 8

Jian Pei: CMPT 459/741 Clustering (3) 8

OPTICS: A Cluster-ordering Method

  • Idea: ordering points to identify the

clustering structure

  • “Group” points by density connectivity

– Hierarchies of clusters

  • Visualize clusters and the hierarchy
slide-9
SLIDE 9

Jian Pei: CMPT 459/741 Clustering (3) 9

Ordering Points

  • Points strongly density-connected should be

close to one another

  • Clusters density-connected should be close

to one another and form a “cluster” of clusters

slide-10
SLIDE 10

Jian Pei: CMPT 459/741 Clustering (3) 10

Reachability-distance Cluster-order of the objects

ε

ε

undefined

ε‘

OPTICS: An Example

slide-11
SLIDE 11

Jian Pei: CMPT 459/741 Clustering (3) 11

DENCLUE: Using Density Functions

  • DENsity-based CLUstEring
  • Major features

– Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets – Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45) – But need a large number of parameters

slide-12
SLIDE 12

Jian Pei: CMPT 459/741 Clustering (3) 12

DENCLUE: Techniques

  • Use grid cells

– Only keep grid cells actually containing data points – Manage cells in a tree-based access structure

  • Influence function: describe the impact of a data

point on its neighborhood

  • Overall density of the data space is the sum of the

influence function of all data points

  • Clustering by identifying density attractors

– Density attractor: local maximal of the overall density function

slide-13
SLIDE 13

Jian Pei: CMPT 459/741 Clustering (3) 13

Density Attractor

slide-14
SLIDE 14

Jian Pei: CMPT 459/741 Clustering (3) 14

Center-defined and Arbitrary Clusters

slide-15
SLIDE 15

Jian Pei: CMPT 459/741 Clustering (3) 15

A Shrinking-based Approach

  • Difficulties of Multi-dimensional Clustering

– Noise (outliers) – Clusters of various densities – Not well-defined shapes

  • A novel preprocessing concept “Shrinking”
  • A shrinking-based clustering approach
slide-16
SLIDE 16

Jian Pei: CMPT 459/741 Clustering (3) 16

Intuition & Purpose

  • For data points in a data set, what if we

could make them move towards the centroid

  • f the natural subgroup they belong to?
  • Natural sparse subgroups become denser,

thus easier to be detected

– Noises are further isolated

slide-17
SLIDE 17

Jian Pei: CMPT 459/741 Clustering (3) 17

Inspiration

  • Newton’s Universal Law of Gravitation

– Any two objects exert a gravitational force of attraction

  • n each other

– The direction of the force is along the line joining the

  • bjects

– The magnitude of the force is directly proportional to the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them – G: universal gravitational constant

  • G = 6.67 x 10-11 N m2 /kg2

r m m

G Fg

2 2 1

=

slide-18
SLIDE 18

Jian Pei: CMPT 459/741 Clustering (3) 18

The Concept of Shrinking

  • A data preprocessing technique

– Aim to optimize the inner structure of real data sets

  • Each data point is “attracted” by other data

points and moves to the direction in which way the attraction is the strongest

  • Can be applied in different fields
slide-19
SLIDE 19

Jian Pei: CMPT 459/741 Clustering (3) 19

Apply shrinking into clustering field

  • Shrink the natural sparse clusters to make

them much denser to facilitate further cluster-detecting process.

Multi- attribute hyperspac e

slide-20
SLIDE 20

Jian Pei: CMPT 459/741 Clustering (3) 20

Data Shrinking

  • Each data point moves along the direction
  • f the density gradient and the data set

shrinks towards the inside of the clusters

  • Points are “attracted” by their neighbors

and move to create denser clusters

  • It proceeds iteratively; repeated until the

data are stabilized or the number of iterations exceeds a threshold

slide-21
SLIDE 21

Jian Pei: CMPT 459/741 Clustering (3) 21

Approximation & Simplification

  • Problem: Computing mutual attraction of

each data points pair is too time consuming O(n2)

– Solution: No Newton's constant G, m1 and m2 are set to unit

  • Only aggregate the gravitation surrounding

each data point

  • Use grids to simplify the computation
slide-22
SLIDE 22

Jian Pei: CMPT 459/741 Clustering (3) 22

Termination condition

  • Average movement of all points in the

current iteration is less than a threshold

  • The number of iterations exceeds a

threshold

slide-23
SLIDE 23

Jian Pei: CMPT 459/741 Clustering (3) 23

Optics on Pendigits Data

Before data shrinking After data shrinking

slide-24
SLIDE 24

Biclustering

  • Clustering both objects and attributes

simultaneously

  • Four requirements

– Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of attributes – An object may participate in multiple biclusters or no biclusters – An attribute may be involved in multiple biclusters,

  • r no biclusters

Jian Pei: Big Data Analytics -- Clustering 24

slide-25
SLIDE 25

Application Examples

  • Recommender systems

– Objects: users – Attributes: items – Values: user ratings

  • Microarray data

– Objects: genes – Attributes: samples – Values: expression levels

Jian Pei: Big Data Analytics -- Clustering 25

nm

w

gene sample/condition

w

11

w

21

w

31

w

n1

w

12

w

32

w

22

w

n2

w

1m

w

3m

w

2m

slide-26
SLIDE 26

Biclusters with Constant Values

Jian Pei: Big Data Analytics -- Clustering 26

· · · b6 · · · b12 · · · b36 · · · b99 · · · a1 · · · 60 · · · 60 · · · 60 · · · 60 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · a33 · · · 60 · · · 60 · · · 60 · · · 60 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · a86 · · · 60 · · · 60 · · · 60 · · · 60 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

10 10 10 10 10 20 20 20 20 20 50 50 50 50 50

On rows

slide-27
SLIDE 27

Biclusters with Coherent Values

  • Also known as pattern-based clusters

Jian Pei: Big Data Analytics -- Clustering 27

slide-28
SLIDE 28

Biclusters with Coherent Evolutions

  • Only up- or down-regulated changes over

rows or columns

Jian Pei: Big Data Analytics -- Clustering 28

10 50 30 70 20 20 100 50 1000 30 50 100 90 120 80 80 20 100 10

Coherent evolutions on rows

slide-29
SLIDE 29

Jian Pei: Big Data Analytics -- Clustering 29

Differences from Subspace Clustering

  • Subspace clustering uses global distance/

similarity measure

  • Pattern-based clustering looks at patterns
  • A subspace cluster according to a globally

defined similarity measure may not follow the same pattern

slide-30
SLIDE 30

Jian Pei: Big Data Analytics -- Clustering 30

Objects Follow the Same Pattern?

pScore D1 D2 Objectblue Obejctgreen

The less the pScore, the more consistent the objects

slide-31
SLIDE 31

Jian Pei: Big Data Analytics -- Clustering 31

Pattern-based Clusters

  • pScore: the similarity between two objects

rx, ry on two attributes au, av

  • δ-pCluster (R, D): for any objects rx, ry∈R

and any attributes au, av∈D,

) . . ( ) . . ( . . . .

v y v x u y u x v y u y v x u x

a r a r a r a r a r a r a r a r pScore − − − = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ) ( . . . . ≥ ≤ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ δ δ

v y u y v x u x

a r a r a r a r pScore

slide-32
SLIDE 32

Jian Pei: Big Data Analytics -- Clustering 32

Maximal pCluster

  • If (R, D) is a δ-pCluster , then every sub-

cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D

– An anti-monotonic property – A large pCluster is accompanied with many small pClusters! Inefficacious

  • Idea: mining only the maximal pClusters!

– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster

slide-33
SLIDE 33

Jian Pei: Big Data Analytics -- Clustering 33

Mining Maximal pClusters

  • Given

– A cluster threshold δ – An attribute threshold mina – An object threshold mino

  • Task: mine the complete set of significant

maximal δ-pClusters

– A significant δ-pCluster has at least mino objects

  • n at least mina attributes
slide-34
SLIDE 34

Jian Pei: Big Data Analytics -- Clustering 34

pCluters and Frequent Itemsets

  • A transaction database can be modeled as

a binary matrix

  • Frequent itemset: a sub-matrix of all 1’s

– 0-pCluster on binary data – Mino: support threshold – Mina: no less than mina attributes – Maximal pClusters – closed itemsets

  • Frequent itemset mining algorithms cannot

be extended straightforwardly for mining pClusters on numeric data

slide-35
SLIDE 35

Jian Pei: Big Data Analytics -- Clustering 35

Where Should We Start from?

  • How about the pClusters having only 2
  • bjects or 2 attributes?

– MDS (maximal dimension set) – A pCluster must have at least 2 objects and 2 attributes

  • Finding MDSs

Attribute Objects a b c d e f g h x 13 11 9 7 9 13 2 15 y 7 4 10 1 12 3 4 7 x - y 6 7

  • 1

6

  • 3

10

  • 2

8

slide-36
SLIDE 36

Jian Pei: Big Data Analytics -- Clustering 36

How to Assemble Larger pClusters?

  • Systematically enumerate

every combination of attributes D

– For each attribute subset, find the maximal subsets of

  • bjects R s.t. (R, D) is a

pCluster – Check whether (R, D) is maximal

  • Prune search branches as

early as possible

  • Why attribute-first-object-

later?

– # of objects >> # attributes

  • Algorithm MaPle (Pei et al,

2003)

slide-37
SLIDE 37

Jian Pei: Big Data Analytics -- Clustering 37

More Pruning Techniques

  • Only possible attributes should be

considered to get larger pClusters

  • Pruning local maximal pClusters having

insufficient possible attributes

  • Extracting common attributes from possible

attribute set directly

  • Prune non-maximal pClusters
slide-38
SLIDE 38

Jian Pei: Big Data Analytics -- Clustering 38

Gene-Sample-Time Series Data

gene1 gene2 sample1 time1 time2 sample2 Time Sample Gene Gene-Time Matrix Gene-Sample Matrix Sample-Time Matrix expression level of gene i on sample j at time k

slide-39
SLIDE 39

Jian Pei: Big Data Analytics -- Clustering 39

Mining GST Microarray Data

  • Reduce the gene-sample-time series data to

gene-sample data

– Use the Pearson's correlation coeffcient as the coherence measure

slide-40
SLIDE 40

Jian Pei: Big Data Analytics -- Clustering 40

Basic Approaches

  • Sample-gene search

– Enumerate the subsets of samples systematically – For each subset of samples, find the genes that are coherent on the samples

  • Gene-sample search

– Enumerate the subsets of genes systematically – For each subset of genes, find the samples on which the genes are coherent

slide-41
SLIDE 41

Jian Pei: Big Data Analytics -- Clustering 41

Basic Tools

  • Set enumeration tree
  • Sample-gene search and gene-sample

search are not symmetric!

– Many genes, but a few samples – No requirement on samples coherent on genes

slide-42
SLIDE 42

Jian Pei: Big Data Analytics -- Clustering 42

Phenotypes and Informative Genes

Informative Genes Non- informative Genes gene1 gene6 gene7 gene2 gene4 gene5 gene3 1 2 3 4 5 6 7

samples

slide-43
SLIDE 43

Jian Pei: Big Data Analytics -- Clustering 43

The Phenotype Mining Problem

  • Input: a microarray matrix and k
  • Output: phenotypes and informative genes

– Partitioning the samples into k exclusive subsets – phenotypes – Informative genes discriminating the phenotypes

  • Machine learning methods

– Heuristic search – Mutual reinforcing adjustment

slide-44
SLIDE 44

Jian Pei: Big Data Analytics -- Clustering 44

Requirements

  • The expression levels of each informative

gene should be similar over the samples within each phenotype

  • The expression levels of each informative

gene should display a clear dissimilarity between each pair of phenotypes

slide-45
SLIDE 45

To-Do List

  • Read Chapters 10.4 and 11.2
  • Assignment 3

Jian Pei: CMPT 459/741 Clustering (3) 45