Hierarchy An arrangement or classification of things according to - - PowerPoint PPT Presentation

hierarchy
SMART_READER_LITE
LIVE PREVIEW

Hierarchy An arrangement or classification of things according to - - PowerPoint PPT Presentation

Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize a given set of objects to a


slide-1
SLIDE 1

Hierarchy

  • An arrangement or classification of things

according to inclusiveness

  • A natural way of abstraction, summarization,

compression, and simplification for understanding

  • Typical setting: organize a given set of
  • bjects to a hierarchy

– No or very little supervision – Some heuristic quality guidances on the quality

  • f the hierarchy

Jian Pei: CMPT 459/741 Clustering (2) 1

slide-2
SLIDE 2
  • Group data objects into a tree of clusters
  • Top-down versus bottom-up

Jian Pei: CMPT 459/741 Clustering (2) 2

Hierarchical Clustering

Step 0 Step 1 Step 2

Step 3 Step 4 b d c e a a b d e c d e a b c d e

Step 4 Step 3 Step 2

Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)

slide-3
SLIDE 3

Jian Pei: CMPT 459/741 Clustering (2) 3

AGNES (Agglomerative Nesting)

  • Initially, each object is a cluster
  • Step-by-step cluster merging, until all
  • bjects form a cluster

– Single-link approach – Each cluster is represented by all of the objects in the cluster – The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters

slide-4
SLIDE 4

Jian Pei: CMPT 459/741 Clustering (2) 4

Dendrogram

  • Show how to merge clusters

hierarchically

  • Decompose data objects into a multi-

level nested partitioning (a tree of clusters)

  • A clustering of the data objects: cutting

the dendrogram at the desired level

– Each connected component forms a cluster

slide-5
SLIDE 5

Jian Pei: CMPT 459/741 Clustering (2) 5

DIANA (Divisive ANAlysis)

  • Initially, all objects are in one cluster
  • Step-by-step splitting clusters until each

cluster contains only one object

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

slide-6
SLIDE 6

Jian Pei: CMPT 459/741 Clustering (2) 6

Distance Measures

  • Minimum distance
  • Maximum distance
  • Mean distance
  • Average distance

∑ ∑

∈ ∈ ∈ ∈ ∈ ∈

= = = =

i j j i j i

C p C q j i j i avg j i j i mean C q C p j i C q C p j i

q p d n n C C d m m d C C d q p d C C d q p d C C d ) , ( 1 ) , ( ) , ( ) , ( ) , ( max ) , ( ) , ( min ) , (

, max , min

m: mean for a cluster C: a cluster n: the number of objects in a cluster

slide-7
SLIDE 7

Jian Pei: CMPT 459/741 Clustering (2) 7

Challenges

  • Hard to choose merge/split points

– Never undo merging/splitting – Merging/splitting decisions are critical

  • High complexity O(n2)
  • Integrating hierarchical clustering with other

techniques

– BIRCH, CURE, CHAMELEON, ROCK

slide-8
SLIDE 8

Jian Pei: CMPT 459/741 Clustering (2) 8

BIRCH

  • Balanced Iterative Reducing and Clustering

using Hierarchies

  • CF (Clustering Feature) tree: a hierarchical

data structure summarizing object info

– Clustering objects à clustering leaf nodes of the CF tree

slide-9
SLIDE 9

Jian Pei: CMPT 459/741 Clustering (2) 9

Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: ∑N

i=1=oi

SS: ∑N

i=1=oi 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

Clustering Feature Vector

slide-10
SLIDE 10

Jian Pei: CMPT 459/741 Clustering (2) 10

CF-tree in BIRCH

  • Clustering feature:

– Summarize the statistics for a cluster – Many cluster quality measures (e.g., radium, distance) can be derived – Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

  • A CF tree: a height-balanced tree storing the

clustering features for a hierarchical clustering

– A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children

slide-11
SLIDE 11

Jian Pei: CMPT 459/741 Clustering (2) 11

CF Tree

CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6

prev next

CF1 CF2 CF4

prev next

B = 7 L = 6 Root Non-leaf node Leaf node Leaf node

slide-12
SLIDE 12

Jian Pei: CMPT 459/741 Clustering (2) 12

Parameters of a CF-tree

  • Branching factor: the maximum number of

children

  • Threshold: max diameter of sub-clusters

stored at the leaf nodes

slide-13
SLIDE 13

Jian Pei: CMPT 459/741 Clustering (2) 13

BIRCH Clustering

  • Phase 1: scan DB to build an initial in-

memory CF tree (a multi-level compression

  • f the data that tries to preserve the inherent

clustering structure of the data)

  • Phase 2: use an arbitrary clustering

algorithm to cluster the leaf nodes of the CF-tree

slide-14
SLIDE 14

Jian Pei: CMPT 459/741 Clustering (2) 14

Pros & Cons of BIRCH

  • Linear scalability

– Good clustering with a single scan – Quality can be further improved by a few additional scans

  • Can handle only numeric data
  • Sensitive to the order of the data records
slide-15
SLIDE 15

Jian Pei: CMPT 459/741 Clustering (2) 15

Drawbacks of Square Error Based Methods

  • One representative per cluster

– Good only for convex shaped having similar size and density

  • K: the parameter of number of clusters

– Good only if k can be reasonably estimated

slide-16
SLIDE 16

Jian Pei: CMPT 459/741 Clustering (2) 16

CURE: the Ideas

  • Each cluster has c representatives

– Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by a fraction of α – The representatives capture the physical shape and geometry of the cluster

  • Merge the closest two clusters

– Distance of two clusters: the distance between the two closest representatives

slide-17
SLIDE 17

Jian Pei: CMPT 459/741 Clustering (2) 17

Cure: The Algorithm

  • Draw random sample S
  • Partition sample to p partitions
  • Partially cluster each partition
  • Eliminate outliers

– Random sampling + remove clusters growing too slowly

  • Cluster partial clusters until only k clusters left

– Shrink representatives of clusters towards the cluster center

slide-18
SLIDE 18

Jian Pei: CMPT 459/741 Clustering (2) 18

Data Partitioning and Clustering

x x x y y y y x y x

slide-19
SLIDE 19

Jian Pei: CMPT 459/741 Clustering (2) 19

Shrinking Representative Points

  • Shrink the multiple representative points

towards the gravity center by a fraction of α

  • Representatives capture the shape

x y x y

è

slide-20
SLIDE 20

Jian Pei: CMPT 459/741 Clustering (2) 20

Clustering Categorical Data: ROCK

  • Robust Clustering using links

– # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based –

  • Basic ideas:

– Similarity function and neighbors:

  • Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n n

m a

( log )

2 2

+ +

Sim T T T T T T ( , )

1 2 1 2 1 2

= ∩ ∪

Sim T T ( , ) { } { , , , , } . 1 2 3 1 2 3 4 5 1 5 0 2 = = =

slide-21
SLIDE 21

Jian Pei: CMPT 459/741 Clustering (2) 21

Limitations

  • Merging decision based on static modeling

– No special characteristics of clusters are considered

C1 C2 C1’ C2’ CURE and BIRCH merge C1 and C2 C1’ and C2’ are more appropriate for merging

slide-22
SLIDE 22

Jian Pei: CMPT 459/741 Clustering (2) 22

Chameleon

  • Hierarchical clustering using dynamic modeling
  • Measures the similarity based on a dynamic model

– Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters

  • A two-phase algorithm

– Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters – Find the genuine clusters by repeatedly combining sub- clusters

slide-23
SLIDE 23

Jian Pei: CMPT 459/741 Clustering (2) 23

Overall Framework of CHAMELEON

Construct Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set

slide-24
SLIDE 24

To-Do List

  • Read Chapter 10.3
  • (for thesis-based graduate students only)

read the paper “BIRCH: an efficient data clustering method for very large databases”

Jian Pei: CMPT 459/741 Clustering (2) 24