Outline Linkage-based Clustering Motivation Definitions - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Linkage-based Clustering Motivation Definitions - - PowerPoint PPT Presentation

Outline Linkage-based Clustering Motivation Definitions Clustering with Semantic Links Applications Algorithms SimRank Yin X., Han J. and Yu P. S., 2006, LinkClus: Efficient ReCoM Clustering via Heterogeneous


slide-1
SLIDE 1

1

Clustering with Semantic Links

Yin X., Han J. and Yu P. S., 2006, LinkClus: Efficient Clustering via Heterogeneous Semantic Links, Int. Conf. on Very Large Data Bases

Outline

  • Linkage-based Clustering

– Motivation – Definitions – Applications

  • Algorithms

– SimRank – ReCoM – LinkClus

  • Comparison
  • Conclusions

http://www.cs.umd.edu/hcil/InfovisRepository/contest-2004/17/unzip/entry.html

Why Linkage-Based Clustering

  • Links contain semantic information

– we can extract complex relationships based on topology of individual links – Additional information regarding inter- and intra- relationships among objects in a database beyond attributes of specific objects – Objects of different types can be clustered based

  • n linkages to other similar and different objects
  • Multi-typed links

More Reasons…

– Attributes of individual objects may be unavailable – Clustering objects based on attributes may offer no insight into the data, or insight of no interest at the time

slide-2
SLIDE 2

2

Definitions – 1/2

  • Linkage-based clustering

– Same general definition: clustering is a process of partitioning a set of

  • bjects into a set of meaningful sub-classes.
  • Linkage-based cluster

– Also the same: a collection of data objects that are “similar.” – The difference is in measuring “similarity”

  • Similarity

– The similarity between two objects is the average similarity between

  • bjects linked with them.

– Two objects are said to be similar if they are linked with similar objects – Can cluster in various ways once similarity is calculated (hierarchical, k-means, k-medoids, etc…)

Definitions – 2/2

  • Link

– A connection between two objects

  • Multi-type link

– A connection between two objects of differing type – Both are essentially pointers.

Applications

  • Recommender systems

– Collaborative filtering: similar users and items are grouped based on their user preferences

  • Web queries
  • Bioinformatics

– Automatic recognition of protein families for phylogenomic analysis

  • Social networks
  • Other cross-referenced databases

SimRank – 1/3

  • Glen Jeh and Jennifer Widom, “SimRank: A measure of structural-context

similarity,” In Proceedings of the Eighth ACM SIGKDD International Conference

  • n Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 2002
  • Measures similarity of the structural context in

which objects occur, based on their relationships with other objects.

  • Based on definition that two objects are similar

if they are related to similar objects

– Fuzzy version of: “if a=b and b=c, then a=c”

slide-3
SLIDE 3

3

SimRank – 2/3

  • Convert web graphs to

node-pair (directed) graphs for analysis

– Nodes objects – Edges relationships

  • Similarity for objects

a and b:

Simple webpage hierarchy ( )

( ) ( ) 1 1

1 if ( , ) ( ), ( ) if ( ) ( ) 0 if and ( )

  • r ( )

I a I b i j i j

a b C s a b s I a I b a b I a I b a b I a I b

= =

=    = ≠    ≠ = ∅ = ∅ 

∑ ∑

(0,1] ( ) number of items linked to (in-neighbor pairs) C I a a ∈

SimRank – 3/3

  • Several versions

– Naïve: computes all pair-wise similarities, O(Kn2d)

  • K is number of iterations, d the average in-neighbor

pairs between any two nodes.

– Pruning: only searches so far down chains of links, O(Kndrd)

  • dr is equivalent to a search radius measured in links

– Fingerprinting (Fogaras and Racz, 2005)

  • Uses pre-computed random walks through linkages to

approximate similarity

ReCoM – 1/3

  • J.D. Wang, H.J. Zeng, Z. Chen, H.J. Lu, L. Tao, W.Y. Ma. “ReCoM:

Reinforcement clustering of multi-type interrelated data objects,” SIGIR, 2003

  • Similarity of objects measured by their own

attributes as well as their relationships

  • Uses inter-relationships to reinforce clustering

process of related objects

– An iterative process done until clusters reach steady state – Reinforcement clustering

ReCoM – 2/3

  • Reinforcement clustering

– Propagate clustering results of one type to all its related types by updating their relationship features. – Perform clustering on the updated features – Finished when updated features have no effect on current clustering results.

slide-4
SLIDE 4

4

ReCoM – 3/3

  • Also looked at assigning “importance” to
  • bjects of the same type

– e.g. More authoritative web pages or authors having more publications – This feature was not incorporated in comparison with LinkClus

LinkClus

  • Hierarchical linkage-based clustering technique

– Other clustering can be used on similarity measures (CLARANS) – Similarities are multi-granular: detailed between closely related objects and overall between groups of objects.

  • Propose a hierarchical data structure: SimTree

– Based on:

  • Hierarchical structures naturally exist for many object types

(animal/plant taxonomy, merchandise categories, research communities, etc…)

  • Linkages through these hierarchies tend to a power-law distribution

(internet topology, human respiratory system, social networks, automobile networks)

Power-law Distribution Among Linkages

  • Power-law
  • Metric (similarity)

measures connectivity

  • f nodes
  • SimTree’s take

advantage of Power- law assumption

, variables of interest (similarity and proportion of objects) scaling exponent

a

y x x y a ∝

SimRank similarities for DBLP authors

SimTree

  • Main structure behind LinkClus
  • Designed around power-law assumption

– Stores significant similarities and compresses insignificant ones

  • There is a high proportion of insignificant similarities

– Reduces the number of pair-wise similarities to be evaluated

  • Insignificant similarities are aggregated
slide-5
SLIDE 5

5

SimTree Architecture

  • Leaf-nodes – Contain

individual objects

  • Parent-nodes – Contain

groups of leaf-nodes or parent nodes at level k-1

  • Construction – bottom-up
  • Similarities:

– Between sibling nodes – Between leafs and parents (adjustment factors) – Path-based ( ) ( ) ( ) ( ) ( ) ( )

7 9 7 4 4 5 5 8 1 1 1 1

, , , , , ,

k k i i i

s n n s n n s n n s n n s n n s n n

− + =

= =Π

SimTree Construction

  • 1. Initialization

a. Leaf-node similarity = 1 to itself, 0 to others

  • b. Find tight sets of leaf-nodes and merge into

parent-nodes (frequent pattern mining)

  • 2. Iterative similarity updating

a. SimTree restructuring using other SimTrees

Initialization

  • Start with 2 or more data types

linked together

– Each data type has its own SimTree – Each object forms a leaf-node – Convert to transactions and use frequent pattern mining to choose parent nodes

SimTree 1, leaf-nodes SimTree 2, leaf-nodes

Grouping Leaf-nodes

  • Tight group – set of nodes co-linked with many
  • bjects of other types
  • Frequent pattern – a set of items that co-appear in

many transactions

  • tightness of a group equal to item-set support
  • Choose tightest, non-overlapping groups as parent-

nodes

  • Constraints on node generation

[ ]

Max children/parent: 10,20 Parent-nodes: , 1 2

l l

c N N Np c c α α ∈ ≤ ≤ < ≤

slide-6
SLIDE 6

6

Grouping Leaf-nodes Example Calculating Similarity – 1/2

  • Calculate similarity

between sibling nodes in the SimTrees for each

  • bject
  • Similarity calculated as

average similarity between the objects linked with them

{ } { }

( )

( , ) 10,11,12,16 , 10,13,14,17 s a b s =

Calculating Similarity – 2/2

  • Average similarity is the aggregated path-

based similarity

  • Consider s({10,11,12},{13,14})
  • Simplifies time complexity from quadratic (as

in SimRank) to linear

{ } { }

( )

12 14 10 13

1 1 10,11,12 , 13,14 ( ,4) (4,5) (5, ) 3 3

i j

s s i s s j

= =

= × ×

∑ ∑

Aggregated adjustment factors Sibling-similarity

SimTree Restructuring

  • Move leaf/child-nodes to most similar parent-

sibling after updating similarities

  • Do not exceed node constraints

(10,4) (10,5) s s < (13,5) (13,4) s s <

slide-7
SLIDE 7

7

Comparison

  • Tested LinkClus against algorithms discussed

– SimRank

  • Naïve
  • Pruning (P-SimRank)
  • Fingerprinting (F-SimRank)

– ReCoM

  • Measure clustering accuracy with modified

Jaccard coefficient

– Assumes two objects correctly clustered if they share at least one common class label

Databases

  • DBLP (dblp.uni-trier.de)

– Database and logic programming bibliographies – Clustered 4170 authors, 2517 proceedings, 154 conferences and 2518 keywords

  • Email Dataset

– 370 emails on conferences, 272 on jobs, 789 spam emails (kept 371), and 2500 most frequent words – Clustered two types of objects: emails and words with ~141,000 links between them.

  • Other Synthetic Sets

DBLP

  • Goal was to correctly

label an author’s research area and conference subject area

– Manually labeled 400 authors and all conferences for accuracy calculations

  • Grouped all object types

into 20 clusters

DBLP Accuracy: no keywords

0.84 0.86 0.88 0.90 0.92 0.94 0.96 1 3 5 7 9 11 13 15 17 19 iteration accuracy LinkClus SimRank ReCom F-SimRank 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 3 5 7 9 11 13 15 17 19 iteration accuracy LinkClus SimRank ReCom F-SimRank

Authors Conferences

1020 0.760 0.958 SimRank 43.1 0.457 0.907 ReCom 83.6 0.583 0.908 F-SimRank 107.7 0.752 0.953 LinkClus (CLARANS) 76.7 0.723 0.957 LinkClus Time/Iteration Conferences Authors

slide-8
SLIDE 8

8

DBLP Accuracy: keywords

25348 (1020) 0.795 (0.760) 0.966 (0.958) SimRank 101.2 (43.1) 0.545 (0.457) 0.936 (0.907) ReCom 136.3 (83.6) 0.303 (0.583) 0.674 (0.908) F-SimRank 654.9 (107.7) 0.729 (0.752) 0.934 (0.953) LinkClus (CLARANS) 614.0 (76.7) 0.774 (0.723) 0.941 (0.957) LinkClus Time/Iteration Conferences Authors

Conclusions

  • Several available algorithms for linkage-based clustering

– All based on definition that two objects are similar if they are associated with similar objects

  • Linkage-based clustering can incorporate information from

heterogeneous data

– Multi-type links – Incorporation of more, non-contradictory information should lead to better clustering

  • LinkClus

– Results show that it is scalable – Forfeits very little accuracy (as compared to SimRank) for speed

  • Would be interesting to see how ReCoM and F-SimRank

perform with other parameter setups