Cl Clustering t i A Categorization of Major Clustering Methods - - PowerPoint PPT Presentation

cl clustering t i
SMART_READER_LITE
LIVE PREVIEW

Cl Clustering t i A Categorization of Major Clustering Methods - - PowerPoint PPT Presentation

Clustering What is Clustering? Types of Data in Cluster Analysis Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Hierarchical Methods 1 2 What is Clustering?


slide-1
SLIDE 1

Cl t i Clustering

1

Clustering

 What is Clustering?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods  Hierarchical Methods

2

What is Clustering? What is Clustering?

Clustering of data is a method by which large sets of

 Clustering of data is a method by which large sets of

data are grouped into clusters of smaller sets of similar data data.

 Cluster: a collection of data objects  Cluster: a collection of data objects

 Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Dissimilar to the objects in other clusters

 Clustering is unsupervised classification: no predefined

classes

3

classes

What is Clustering? What is Clustering?

 Typical applications  Typical applications

 As a stand-alone tool to get insight into data

distribution distribution

 As a preprocessing step for other algorithms

 Use cluster detection when you suspect that there are

natural groupings that may represent groups of customers

  • r products that have lot in common
  • r products that have lot in common.

 When there are many competing patterns in the data,

y p g p , making it hard to spot a single pattern, creating clusters of similar records reduces the complexity within clusters so that other data mining techniques are more likely to

4

that other data mining techniques are more likely to succeed.

slide-2
SLIDE 2

Examples of Clustering Applications E amp s of ust r ng pp cat ons

 Marketing: Help marketers discover distinct groups in

g p g p their customer bases, and then use this knowledge to develop targeted marketing programs

 Land use: Identification of areas of similar land use in an

earth observation database

 Insurance: Identifying groups of motor insurance policy

holders with a high average claim cost

 City-planning: Identifying groups of houses according to

their house type, value, and geographical location

 Earth-quake studies: Observed earth quake epicenters

should be clustered along continent faults

5

Clustering definition Clustering definition

 Given a set of data points each having a set of attributes  Given a set of data points, each having a set of attributes,

and a similarity measure among them, find clusters such that:

 data points in one cluster are more similar to one another  data points in one cluster are more similar to one another

(high intra-class similarity)

 data points in separate clusters are less similar to one  data points in separate clusters are less similar to one

another (low inter-class similarity )

 Similarity measures: e.g. Euclidean distance if attributes are

continuous continuous.

6

R i t f Cl t i i D t Mi i Requirements of Clustering in Data Mining

Scalability

Scalability

Ability to deal with different types of attributes Di f l t ith bit h

Discovery of clusters with arbitrary shape

Minimal requirements for domain knowledge to determine input parameters input parameters

Able to deal with noise and outliers I i i d f i d

Insensitive to order of input records

High dimensionality

Incorporation of user-specified constraints

Interpretability and usability

7 8 http://webscripts.softpedia.com/screenshots/Efficient-K-Means-Clustering-using-JIT_1.png

slide-3
SLIDE 3

http://api.ning.com/files/uI4*osegkS5tF-JjFYZai3mGuslDu*- 9 http://api.ning.com/files/uI4 osegkS5tF JjFYZai3mGuslDu BQ1rFsozaAaDw9IBdc99OjNas3FPKIrdgPXAz34DU0KsbZwl7G8tM5-n4DXTk6Fab/clustering.gif http://wiki.na-mic.org/Wiki/index.php/Progress_Report:DTI_Clustering Project aiming at developing tools in the 3D Slicer for automatic clustering of tractographic paths through diffusion tensor MRI (DTI) data. ‘characterize the strength of connectivity between selected regions in the brain’ 10

N ti f Cl t i A bi Notion of a Cluster is Ambiguous

Initial points. Six Clusters Initial points. Six Clusters Four Clusters Two Clusters

11

Clustering Clustering

 What is Cluster Analysis?  What is Cluster Analysis?  Types of Data in Cluster Analysis

yp y

 A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods

12

slide-4
SLIDE 4

Data Matrix Data Matrix

 Represents n objects with p variables (attributes,

measures)

 A relational table

      p 1 x f 1 x 11 x                ip x if x 1 i x                np x nf x 1 n x p       

13

    np nf 1 n

Dissimilarity Matrix Dissimilarity Matrix

P i iti f i f bj t

 Proximities of pairs of objects  d(i,j): dissimilarity between objects i and j  Nonnegative  Close to 0: similar

Close to 0 similar

      (2 1) d       (3,2) d (3,1) d (2,1) d         2) n ( d 1) n ( d     

14

    ,2) n ( d ,1) n ( d  

T f d t i l t i l i Type of data in clustering analysis

 Continuous variables

B bl

 Binary variables  Nominal and ordinal variables  Nominal and ordinal variables  Variables of mixed types

15

Continuous variables Continuous variables

To avoid dependence on the choice of measurement units the data should be standardized be standardized.

Standardize data

Calculate the mean absolute deviation: Calculate the mean absolute deviation |) f m nf |x ... | f m f 2 |x | f m f 1 (|x n 1 f s       

where

Calculate the standardized measurement (z-score) ) nf x ... f 2 x f 1 (x n 1 f m     f s f m if x if z  

Using mean absolute deviation is more robust than using standard

  • deviation. Since the deviations are not squared the effect of outliers is

somewhat reduced but their z-scores do not become to small; therefore, f

16

somewhat reduced but their z scores do not become to small; therefore, the outliers remain detectable.

slide-5
SLIDE 5

Similarity/Dissimilarity Between Objects y y j

 Distances are normally used to measure the similarity or

dissimil it b t t d t bj ts dissimilarity between two data objects

 Euclidean distance is probably the most commonly chosen

type of distance It is the geometric distance in the type of distance. It is the geometric distance in the multidimensional space:

p 2

 Required properties for

    p 1 k 2 ) kj x ki x ( ) j , i ( d

q p p a distance function

 d(i,j)  0

( ,j)

 d(i,i) = 0  d(i,j) = d(j,i) 17

d(i,j) d(j,i)

 d(i,j)  d(i,k) + d(k,j) 18

http://uk.geocities.com/ahf_alternate/dist.htm#S2

Si il it /Di i il it B t Obj t Similarity/Dissimilarity Between Objects

 City-block (Manhattan) distance. This distance is simply the

sum of differences across dimensions. In most cases, this di t i ld lt i il t th E lid distance measure yields results similar to the Euclidean

  • distance. However, note that in this measure, the effect of

single large differences (outliers) is dampened (since they g g ( ) p ( y are not squared). | x x | | x x | | x x | ) j i ( d Th d f h E l d d l h ld | jp x ip x | ... | 2 j x 2 i x | | 1 j x 1 i x | ) j , i ( d       

 The properties stated for the Euclidean distance also hold

for this measure.

19

 Manhattan distance = distance if you had to travel  Manhattan distance = distance if you had to travel

along coordinates only. (9 8) y = (9,8) euc.: dist(x,y) = (42+ 32) = 5 3 5 4 3 5 x = (5,5) manh.: 4

20

dist(x,y) = 4+ 3 = 7

slide-6
SLIDE 6

Si il it /Di i il it B t Obj t Similarity/Dissimilarity Between Objects

 Minkowski distance. Sometimes one may want to

increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This measure enables to accomplish that and is computed as: that and is computed as:

q 1 q | jp x ip x | ... q | 2 j x 2 i x | q | 1 j x 1 i x | ) j , i ( d              jp p j j  

21

Si il it /Di i il it B t Obj t Similarity/Dissimilarity Between Objects

 Weighted distances

g

 If we have some idea of the relative importance that

should be assigned to each variable, then we can weight g g them and obtain a weighted distance measure.

2 2 2 ) jp x ip x ( p w 2 ) 1 j x 1 i x ( 1 w ) j , i ( d      

22

Binary Variables Binary Variables

 Binary variable has only two states: 0 or 1  Binary variable has only two states: 0 or 1  A binary variable is symmetric if both of its states are

nary ar a ymm tr c f th f t tat ar equally valuable, that is, there is no preference on which

  • utcome should be coded as 1.

 A binary variable is asymmetric if the outcome of the

states are not equally important such as positive or states are not equally important, such as positive or negative outcomes of a disease test.

 Similarity that is based on symmetric binary variables is

called invariant similarity.

23

Binary Variables

 A contingency table for binary data

y

sum 1

Object j

d c d c b a b a   1

Object i

p d b c a sum  

24

slide-7
SLIDE 7

Binary Variables y

Object j

b a b a 1 sum 1 

j j

p d b c a sum d c d c   

Object i

 Symmetric binary dissimilarity:

p d c a sum d c b a c b ) j , i ( d     

 Jaccard coefficient (asymmetric binary

dissimilarity):

25

c b a c b ) j , i ( d    

Dissimilarity between Binary Variables Dissimilarity between Binary Variables

 Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N M F Y N P N P N

d b

Mary F Y N P N P N Jim M Y P N N N N

 gender is a symmetric attribute  the remaining attributes are asymmetric binary  let the values Y and P be set to 1 and the value N be set to 0  let the values Y and P be set to 1, and the value N be set to 0

33 . 1 2 1 ) mary , jack ( d      67 . 1 1 1 1 1 ) jim , jack ( d 1 2       

Jaccard coefficient (for the asymetric

26

75 . 2 1 1 2 1 ) mary , jim ( d     

variables)

Nominal Variables Nominal Variables

 A generalization of the binary variable in that it can take  A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green

 Method 1: simple matching

 m: # of matches, p: total # of variables

p m p ) j , i ( d  

 Method 2: use a large number of binary variables

ti bi i bl f h f th M

p j

 creating a new binary variable for each of the M

nominal states

27

Ordinal Variables Ordinal Variables

 On ordinal variables order is important

 e.g. Gold, Silver, Bronze

 Can be treated like continuous

 the ordered states define the ranking 1,...,Mf

g

f

 replacing xif by their rank  map the range of each variable onto [0 1] by replacing i-th

} f M ,..., 1 { if r 

 map the range of each variable onto [0, 1] by replacing i th

  • bject in the f-th variable by

1 if r 

 compute the dissimilarity using methods for continuous

1 f M 1 if r if z  

28  compute the dissimilarity using methods for continuous

variables

slide-8
SLIDE 8

Variables of Mixed Types yp

 A database may contain several/all types of variables

i i bi i bi i l d

 continuous, symmetric binary, asymmetric binary, nominal and

  • rdinal.

 One may use a weighted formula to combine their effects  One may use a weighted formula to combine their effects.

p

(f) (f) δ d ij ij

1 1 f p f

ij ij d(i, j) (f) δij

 

 

ij=0 if xif is missing or xif=xjf=0 and the variable f is asymmetric binary

ij=1 otherwise

continuous and ordinal variables dij: normalized absolute distance bi d i l i bl dij 0 if th i dij 1

29 

binary and nominal variables dij=0 if xif=xjf; otherwise dij=1

Clustering Clustering

What is Cluster Analysis?

 What is Cluster Analysis?  Types of Data in Cluster Analysis

Types of Data in Cluster Analysis

 A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods

30

Major Clustering Approaches Major Clustering Approaches

P titi i l ith C t t i titi d

 Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion

 Hierarchy algorithms: Create a hierarchical decomposition

  • f the set of data (or objects) using some criterion

 Density-based: Based on connectivity and density

functions Able to find clusters of arbitrary shape

  • functions. Able to find clusters of arbitrary shape.

Continues growing a cluster as long as the density of points in the neighborhood exceeds a specified limit.

 Model-based: A model is hypothesized for each of the

clusters and the idea is to find the best fit of that model

31

clusters and the idea is to find the best fit of that model to each other

Clustering Clustering

What is Cluster Analysis?

 What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Partitioning Methods  Hierarchical Methods

32

slide-9
SLIDE 9

P titi i Al ith B i C t Partitioning Algorithms: Basic Concept

 Partitioning method: Construct a partition of a database

D of n objects into a set of k clusters

 Given a k, find a partition of k clusters that optimizes

the chosen partitioning criterion the chosen partitioning criterion

 Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms

k means: Each cluster is represented by the center of the

 k-means: Each cluster is represented by the center of the

cluster k medoids or PAM (Partition around medoids): Each cluster is

33

 k-medoids or PAM (Partition around medoids): Each cluster is

represented by one of the objects in the cluster

The K-Means Clustering Method The K Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

  • 1. Partition objects into k nonempty subsets

j p y

  • 2. Compute centroids of the clusters of the current partition.

The centroid is the center (mean point) of the cluster. (m p ) f u .

  • 3. Assign each object to the cluster with the nearest centroid.
  • 4. Go back to Step 2; stop when no more new assignment.

34

K-means clustering (k=3) K-means clustering (k=3)

35

Comments on the K-Means Method Comments on the K Means Method

 Strengths & Weaknesses

 Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n. y

 Often terminates at a local optimum

A li bl l h i d fi d

 Applicable only when mean is defined  Need to specify k, the number of clusters, in advance  Sensitive to noise and outliers as a small number of

such points can influence the mean value p

 Not suitable to discover clusters with non-convex

shapes

36

p

slide-10
SLIDE 10

Importance of Choosing Initial Centroids

2.5 3

Iteration 1

2.5 3

Iteration 2

2.5 3

Iteration 3

Importance of Choosing Initial Centroids

1 1.5 2 y 1 1.5 2 y 1 1.5 2 y
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x 3

Iteration 4

3

Iteration 5

3

Iteration 6

1 1.5 2 2.5 y 1 1.5 2 2.5 y 1 1.5 2 2.5 y
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 1
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 1
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 1

37

2 1.5 1 0.5 0.5 1 1.5 2 x 2 1.5 1 0.5 0.5 1 1.5 2 x 2 1.5 1 0.5 0.5 1 1.5 2 x

Importance of Choosing Initial Centroids Importance of Choosing Initial Centroids

2.5 3

Iteration 1

2.5 3

Iteration 2

1 1.5 2 y 1 1.5 2 y
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x x x 2.5 3

Iteration 3

2.5 3

Iteration 4

2.5 3

Iteration 5

1 1.5 2 2.5 y 1 1.5 2 2.5 y 1 1.5 2 2.5 y
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x
  • 2
  • 1.5
  • 1
  • 0.5
0.5 1 1.5 2 0.5 x

38

x x x

Getting k Right Getting k Right

Try different k looking at the change in the average

 Try different k, looking at the change in the average

distance to centroid, as k increases.

 Average falls rapidly until right k, then changes little.

Average Best value

  • f k

distance to centroid

  • f k

39

k

The K Medoids Clustering Method The K-Medoids Clustering Method

F d b ll d d d

 Find representative objects, called medoids, in

clusters

 PAM (Partitioning Around Medoids, 1987)

 starts from an initial set of medoids and iteratively

replaces one of the medoids by one of the non- medoids if it improves the total distance of the l i l i resulting clustering

 CLARA (Kaufmann & Rousseeuw, 1990)

40

 CLARANS (Ng & Han, 1994): Randomized sampling

slide-11
SLIDE 11

PAM (Partitioning Around Medoids) PAM (Partitioning Around Medoids)

PAM (Kaufman and Rousseeuw 1987)

PAM (Kaufman and Rousseeuw, 1987)

Use real object to represent the cluster Select k representative objects arbitrarily

 Select k representative objects arbitrarily  For each pair of a non-selected object h and a selected object i,

calculate the total swapping cost TC calculate the total swapping cost TCih

 Select pair of i and h which corresponds to the minimum TCih

If i TC 0 i i l d b h

 If min.TCih < 0, i is replaced by h  Then assign each non-selected object to the most similar

representative object representative object

 Repeat steps 2-3 until there is no change 41

PAM Clustering: Total swapping cost TCih=jCjih

i t d id i,t: medoids h: medoid candidate j: a point j p Cjih: swapping cost due cost due to j

If the medoid i is replaced by h as the cluster representative the cost associated to j is

42

Cjih=d(j,h)‐d(j,i)

p

j

PAM Clustering: Total swapping cost TCih=jCjih g

pp g

ih j jih

9 10

j

examples of other scenarios

5 6 7 8 9

j h t

there is no cost induced by j

1 2 3 4 1 2 3 4 5 6 7 8 9 10

i h

y j

1 2 3 4 5 6 7 8 9 10

Cjih=0

6 7 8 9 10

i

6 7 8 9 10

h j

1 2 3 4 5

t i h j

1 2 3 4 5

i t

43

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Cjih=d(j,t)-d(j,i) Cjih=d(j,h)-d(j,t)

Comments on PAM Comments on PAM

Pam is more robust than k means in the presence of

 Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less influenced by

  • utliers or other extreme values than a mean
  • utliers or other extreme values than a mean

 PAM works effectively for small data sets but does not  PAM works effectively for small data sets, but does not

scale well for large data sets

44

slide-12
SLIDE 12

CLARA (Clustering LARge Applications) CLARA (Clustering LARge Applications)

CLARA (Kaufmann and Rousseeuw in 1990) draws a sample of the dataset and applies PAM on the sample in order to find the medoids.

If the sample is representative the medoids of the sample should approximate the medoids of the entire dataset.

To improve the approximation, multiple samples are drawn and the best clustering is returned as the output

The clustering accuracy is measured by the average dissimilarity of all

  • bjects in the entire dataset.

Experiments show that 5 samples of size 40+2k give satisfactory results

45

CLARA (Cl st i LAR A li ti s) CLARA (Clustering LARge Applications)

 Strengths and Weaknesses:

 Deals with larger data sets than PAM  Efficiency depends on the sample size  A good clustering based on samples will not necessarily represent

g g p y p a good clustering of the whole data set if the sample is biased

46

CLARANS (“Randomized” CLARA) CLARANS ( Randomized CLARA)

CLARANS (A Clustering Algorithm based on Randomized Search) ( g g )

(Ng and Han’94)

The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids

Two nodes are neighbours if their sets differ by only one medoid

Each node can be assigned a cost that is defined to be the total dissimilarity between every object and the medoid of its cluster

The problem corresponds to search for a minimum on the graph

At each step all neighbours of current node are searched; the

At each step, all neighbours of current node are searched; the neighbour which corresponds to the deepest descent in cost is chosen as the next solution

47

CLARANS (“Randomized” CLARA) CLARANS ( Randomized CLARA)

For large values of n and k, examining k(n-k) neighbours is time g g ( ) g consuming.

At each step, CLARANS draws sample of neighbours to examine. p p g

Note that CLARA draws a sample of nodes at the beginning of search; therefore, CLARANS has the benefit of not confining the g search to a restricted area.

If the local optimum is found, CLARANS starts with a new randomly p y selected node in search for a new local optimum. The number of local optimums to search for is a parameter.

It is more efficient and scalable than both PAM and CLARA; returns higher quality clusters.

48

slide-13
SLIDE 13

Clustering Clustering

What is Cluster Analysis?

 What is Cluster Analysis?  Types of Data in Cluster Analysis

Types of Data in Cluster Analysis

 A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods

49

Hierarchical Clustering Hierarchical Clustering

 Use distance matrix as clustering criteria.  These methods work by grouping data into a tree of

clusters clusters.

 There are two types of hierarchical clustering:

 Agglomerative: bottom-up strategy

Di i i t d t t

 Divisive: top-down strategy

 Does not require the number of clusters as an input,

q p but needs a termination condition, e. g., could be the desired number of clusters or a distance threshold f i

50

for merging

Hierarchical Clustering Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative

b a a b b c c d e a b c d e d e d e c d e

Step 4 Step 3 Step 2 Step 1 Step 0

e

divisive

51

Step 4 Step 3 Step 2 Step 1 Step 0

Agglomerative hierarchical clustering Agglomerative hierarchical clustering

52

slide-14
SLIDE 14

Clustering result: dendrogram Clustering result: dendrogram

53

Linkage rules (1) g

( )

Single link (nearest neighbour). The distance between two g ( g ) clusters is determined by the distance of the two closest

  • bjects (nearest neighbours) in the different clusters.

 This rule will in a sense string objects together to form  This rule will, in a sense, string objects together to form

clusters, and the resulting clusters tend to represent long "chains."

1 2 4 5 3

d24

54

Linkage rules (2) g

( )

Complete link (furthest neighbour). The distances between p ( g ) clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbours"). neighbours ).

 This method usually p\erforms quite well in cases when the

  • bjects actually form naturally distinct "clumps." If the

clusters tend to be somehow elongated or of a "chain" type clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate.

1 4 1 2 3 5

d15

55

Linkage rules (3) Linkage rules (3)

Pair-group average. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very ff m y efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters clusters.

1 4 1 2 3 5 56

(d14+ d15+ d24+ d25+ d34+ d35)/6

slide-15
SLIDE 15

Linkage rules (4) Linkage rules (4)

Pair-group centroid. The distance between two clusters is determined as the distance between centroids.

1 2 4 5

+ +

3

+

57

Comparing single and complete link Comparing single and complete link

SL CL

58

AGNES (Agglomerative Nesting) AGNES (Agglomerative Nesting)

h l k h d d h d l

Use the Single-Link method and the dissimilarity matrix.

Repeatedly merge nodes that have the least dissimilarity

 merge C1 and C2 if objects from C1 and C2 give the minimum

Euclidean distance between any two objects from different clusters.

Eventually all nodes belong to the same cluster

6 7 8 9 10 6 7 8 9 10 6 7 8 9 10 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

59

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

DIANA (Divisive Analysis) DIANA (Divisive Analysis)

I t d d i K f d R (1990)

Introduced in Kaufmann and Rousseeuw (1990)

Inverse order of AGNES All bj t d t f i iti l l t

 All objects are used to form one initial cluster  The largest cluster is split according to some principle

the maximum Euclidean distance between the closest neighbouring

 the maximum Euclidean distance between the closest neighbouring

  • bjects in different clusters

Eventually each node forms a cluster on its own y

7 8 9 10 8 9 10 7 8 9 10 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7

60

1 1 2 3 4 5 6 7 8 9 10 1 1 2 3 4 5 6 7 8 9 10 1 1 2 3 4 5 6 7 8 9 10
slide-16
SLIDE 16

More on Hierarchical Clustering More on Hierarchical Clustering

Do not scale well

 Do not scale well

time complexity of at least O(n2), where n is the number of total objects number of total objects

 Can never undo what was done previously  It’s nice that you get a hierarchy instead of an

amorphous collection of groups D ’ d f k

 Don’t need to specify k.

If you want k groups, just cut the (k-1) longest links

 In general give better quality clusters than k-means’ like

methods

61

More on Hierarchical Clustering More on Hierarchical Clustering

I t ti f hi hi l ith dist b s d

 Integration of hierarchical with distance-based

clustering

 BIRCH: uses CF-tree and incrementally

adjusts the quality of sub-clusters adjusts the qual ty of sub clusters

 DBscan: Density-based Alg. based on local

ti it d d it f ti connectivity and density functions

62

BIRCH algorithm BIRCH algorithm

BIRCH: Balanced Iterative Reducing and Clustering

 BIRCH: Balanced Iterative Reducing and Clustering

using Hierarchies, by Zhang, Ramakrishnan, Livny

(SIGMOD’96) ( )

A t i b ilt th t t d d i f ti t

 A tree is built that captures needed information to

perform clustering

 Introduces two new concepts

 Clustering Feature (contains info about a cluster)

g

 Clustering Feature Tree

hi h d t i l t t ti

63

which are used to summarize cluster representation

BIRCH - Clustering Feature Vector g

A clustering feature is a triplet summarizing information about sub clusters of objects sub-clusters of objects.

It registers crucial measurements for computing clusters in a compact form Clustering Feature: CF = (N, LS, SS) N: N b f d t i ts p N: Number of data points Linear sum LS: N

i=1=Xi N 2

CF = (5, (16,30),(54,190)) Square Sum SS: N

i=1=Xi 2

8 9 10

(3,4) LS (3 2 4 4 3 4 6 5 7 8) (16 30)

3 4 5 6 7 8

(2,6) (4,5) (4 7) LS= (3+2+4+4+3, 4+6+5+7+8) = (16,30) SS = (3^2+2^2+4^2+4^2+3^2,

^2 ^2 ^2 ^2 ^2

64

1 2 1 2 3 4 5 6 7 8 9 10

(4,7) (3,8) 4^2+6^2+5^2+7^2+8^2) = (54, 190)

slide-17
SLIDE 17

BIRCH - Clustering Feature Tree g

A tree that stores the clustering features for hierarchical clustering

65

B = Max. no. of CF in a non-leaf node L = Max. no. of CF in a leaf node

cluster

Notes on Birch Notes on Birch

A Leaf node represents a cluster

A Leaf node represents a cluster.

A sub-cluster in a leaf node must have a diameter no greater than a i h h ld T given threshold T.

A point is inserted into the leaf node (cluster) to which is closer.

When one item is inserted into a cluster at the leaf node, the restriction T (for the corresponding sub-cluster) must be satisfied. The corresponding CF must be updated.

If there is no space on the node the node is split. f p p

66

BIRCH algorithm BIRCH algorithm

Incrementally construct a CF tree a hierarchical data structure for

Incrementally construct a CF tree, a hierarchical data structure for multiphase clustering

Phase 1: scan DB to build an initial in-memory CF tree

If threshold condition is violated

 If there is room to insert – Insert point as a single cluster  If not  If not 

Leaf node split: take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new one) into the closest node

Update CF for non-leafs Insert new non-leaf entry into parent node

Update CF for non leafs. Insert new non leaf entry into parent node

We may have to split the parent as well. Spilt the root increases tree height by

  • ne.

If not

If not

 Insert point into the closest cluster

67

Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Some Comments on Birch Some Comments on Birch

It can be shown that CF vectors can be stored and calculated incrementally and accurately as clusters are merged

Experiments have shown that scales linearly with the number of

  • bjects.

j

Finds a good clustering with a single scan and improves the quality with a few additional scans with a few additional scans

Handles only numeric data, and sensitive to the order of the data record.

Better suited to find spherical clusters

68 

Better suited to find spherical clusters.

slide-18
SLIDE 18

DBSCAN algorithm DBSCAN algorithm

 Density-based Alg: based on local connectivity and

density functions Major features:

 Major features:

 Discover clusters of arbitrary shape  Handle noise

O

 One scan

69

DBSCAN D

it B d Cl t i

DBSCAN: Density-Based Clustering

Clusterin based on density (local cluster criterion) such as

Clustering based on density (local cluster criterion), such as density-connected points

Each cluster has a considerable higher density of points than outside of the cluster

70

DBSCAN: Density Concepts (1) DBSCAN: Density Concepts (1)

 Density: the minimum number of points within a certain  Density: the minimum number of points within a certain

distance of each other.

 Two parameters:

Eps : Maximum radius of the neighborhood MinPts : Minimum number of points in an Eps-neighborhood

  • f that point

 Core Point: object with at least MinPts objects within a

radius ‘Eps-neighborhood’

71

p g

Core point for EPS=4cm and MinPts = 5

72

Not a Core point

slide-19
SLIDE 19

DBSCAN: Density Concepts (2) DBSCAN: Density Concepts (2)

 Directly Density-Reachable: A point p is directly density-

reachable from a point q with respect to Eps, MinPts if

 1) p belongs to NEps(q)  2) core point condition: |NEps (q)| >= MinPts  2) core point condition: |NEps (q)| >= MinPts

(a DDR point needs to be close to a core point but it does not (a DDR point needs to be close to a core point but it does not need to be a core point itself, if not it is a border point)

p q

MinPts = 5 Eps = 1 cm

73

Core point

74

Directly density reachable point

DBSCAN: Density Concepts (2) DBSCAN: Density Concepts (2)

Density-reachable:

A point p is density-reachable from a

p

p p y point q wrt. Eps, MinPts if there is a chain

  • f points p1, …, pn, p1 = q, pn = p such that

pi+1 is directly density-reachable from pi

q p1

pi+1 is directly density reachable from pi

Density-connected:

A point p is density-connected to a point q

  • wrt. Eps, MinPts if there is a point o such

that both, p and q are density-reachable

p q

, p q y from o wrt. Eps and MinPts.

  • 75

Core points Core points Density reachable from

76

Density connected to

slide-20
SLIDE 20

DBSCAN: Cluster definition DBSCAN: Cluster definition

A cluster is defined as a maximal set of density-connected points

A cluster has a core set of points very close to a large number of

  • ther points (core points) and then some other points (border

) h ff l l l points) that are sufficiently close to at least one core point.

77

Intuition on DBSCAN Intuition on DBSCAN

For each core point which is not in a cluster

 For each core point which is not in a cluster

 Explore its neighbourhood in search for every density

reachable point reachable point

 For each neighbourhood point explored

 If it is a core point -> further explore it  If it is a core point > further explore it  It it is not a core point -> assign to the cluster and do not

explore it

 The fact that a cluster is composed by the maximal set

f p ints that are density c nnect it is a pr perty (and

  • f points that are density-connect it is a property (and

therefore a consequence) of the method

78

DBSCAN: The Algorithm DBSCAN: The Algorithm

 Arbitrary select a point p

If i t i t i t d it h bl f

 If p is not a core point, no points are density-reachable from p

and DBSCAN visits the next point of the database. If i i l i f d

 If p is a core point, a cluster is formed.

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 Continue the process until all of the points have been processed.

(it is possible that a border point could belong to two clusters. Such point will be assigned to whichever cluster is generated first)

79

DBScan DBScan

 Experiments have shown DBScan to be faster and more

precise than CLARANS

 Expected time complexity O(n lg n)

80

slide-21
SLIDE 21

Clustering Summary Clustering Summary

d h d f d f

Unsupervised method two find groups of instances

Many approaches

Many approaches

 partitioning  hierarchical 

Solution evaluation is difficult M l i i b

 Manual inspection by experts  Benchmarking on existing labels  Cluster quality measures  Cluster quality measures

 (measure the "tightness" or "purity" of clusters)  distance measures

81

 high similarity within a cluster, low across clusters

Rapidminer Rapidminer

http://rapid i com/

 http://rapid-i.com/  Open-Source Data Mining with the Java Software

RapidMiner RapidMiner

 “RapidMiner

is the world-wide leading open-source data mining solution due to the combination of its data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real- world data mining tasks.”

82

k-means

NAME Calories Protein Fat Calcium Iron Label BEEF BRAISED 340 20 28 9 2.6 1 HAMBURGER 245 21 17 9 2.7 1 BEEF ROAST 420 15 39 7 2 1

k-means example

BEEF ROAST 420 15 39 7 2 1 BEEF STEAK 375 19 32 9 2.6 1 BEEF CANNED 180 22 10 17 3.7 1 CHICKEN BROILED 115 20 3 8 1.4 2 CHICKEN CANNED 170 25 7 12 1.5 2

p

BEEF HEART 160 26 5 14 5.9 3 LAMB LEG ROAST 265 20 20 9 2.6 1 LAMB SHOULDER ROAST 300 18 25 9 2.3 1 SMOKED HAM 340 20 28 9 2.5 1 PORK ROAST 340 19 29 9 2 5 1 PORK ROAST 340 19 29 9 2.5 1 PORK SIMMERED 355 19 30 9 2.4 1 BEEF TONGUE 205 18 14 7 2.5 1 VEAL CUTLET 185 23 9 9 2.7 1 BLUEFISH BAKED 135 22 4 25 0.6 2 CLAMS RAW 70 11 1 82 6 3 CLAMS CANNED 45 7 1 74 5.4 3 CRABMEAT CANNED 90 14 2 38 0.8 2 HADDOCK FRIED 135 16 5 15 0.5 2 MACKEREL BROILED 200 19 13 5 1 2 MACKEREL BROILED 200 19 13 5 1 2 MACKEREL CANNED 155 16 9 157 1.8 3 PERCH FRIED 195 16 11 14 1.3 2 SALMON CANNED 120 17 5 159 0.7 3 SARDINES CANNED 180 22 9 367 2.5 3 TUNA CANNED 170 25 7 7 1.2 2 SHRIMP CANNED 110 23 1 98 2.6 3 83

k-means example k-means example

84

slide-22
SLIDE 22

85 86 87 88

slide-23
SLIDE 23

DBscan example DBscan example

89

Labeled data Labeled data

90

Results with k-means Results with k-means

91

DBscan DBscan

92

slide-24
SLIDE 24

References

Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber (Morgan Kaufmann - 2000)

Data Mining: Introductory and Advanced Topics, Margaret Dunham (Prentice Hall 2002) Dunham (Prentice Hall, 2002)

A Tutorial on Clustering Algorithms

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ind ex.html

Clustering Web Search Results, Iwona Białynicka-Birula, http://www.di.unipi.it/~iwona/Clustering.ppt

93

Solutions nearly always come from the direction you least expect, which means there's no point y p p in trying to look in that direction because it wont be coming from there. Douglas Adams

94