As Strong as the Weakest Link: Mining Diverse Cliques in Weighted - - PowerPoint PPT Presentation

as strong as the weakest link mining diverse cliques in
SMART_READER_LITE
LIVE PREVIEW

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted - - PowerPoint PPT Presentation

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD,


slide-1
SLIDE 1

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs

Petko Bogdanov (UC Santa Barbara),

with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara)

ECML/PKDD, Prague, 2013

slide-2
SLIDE 2

Example: Collaboration in sports

Significance of a pair’s success when

  • n a team

2

slide-3
SLIDE 3

Influential groups

3

slide-4
SLIDE 4

Multiple teams

4

slide-5
SLIDE 5

Cliques in gene networks

Gene Interaction Networks Complexes - interacting functional units*

* Leemor Joshua-Tor, Structure and Function of Nucleic Acid Regulatory Complexeshttp://www.hhmi.org/research/structure-and- function-nucleic-acid-regulatory-complexes

5

slide-6
SLIDE 6

6

Cliques in other domains

  • Sets of duplicates and

near-duplicates in similarity networks

○ images ○ video ○ other complex objects with similarity function

  • Co-evolving time series

○ stocks of companies related in a supply chain ○ brain regions co-associated in performing a task*

* Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLoS Biology Vol. 6, No. 7, e159

slide-7
SLIDE 7

Challenges

7

  • Enumeration of cliques

○ MAX CLIQUE is NP-hard

  • Ensuring diversity in the result set

○ Managing overlap “adds” complexity

  • Size and density of real-world networks
  • How to find the best diverse cliques

efficiently while maintaining good quality

  • f the solution
slide-8
SLIDE 8

Outline

  • Motivation and examples
  • Problem statement and properties
  • Proposed solutions
  • Experiments
  • Conclusion

8

slide-9
SLIDE 9

Basic notions

  • A graph G(V,E,w) represents a network of

entities V with edges E among them

  • w defines weights on edges

○ higher weight means stronger association

  • A clique is a complete subgraph, i.e. all

edges among the selected entities exist 9

slide-10
SLIDE 10

Clique strength

10

  • Strong ties of all

pairwise edges

  • A clique is as strong

as its weakest link

  • “Flat” teams in which

all connections are important

  • Bigger cliques

featuring all strong edges are better

slide-11
SLIDE 11
  • Linear

combination

  • f score and

diversity via α

  • Higher

number of distinct nodes in solution means higher diversity Score Diversity

11

Diversity

slide-12
SLIDE 12

Example: Top-2 cliques

Too much

  • verlap

12

slide-13
SLIDE 13

Slightly lower score but less

  • verlap

13

Example: Top-2 cliques

slide-14
SLIDE 14

Complexity

  • m-Diverse k-Structures

(mDkS) is NP-hard

○ reduction from SET COVER

  • Even if we are interested

in sets of arbitrary structure, maximizing diversity is NP-hard 14

Included in solution

slide-15
SLIDE 15

Approximation

  • Good news

○ monotonic ○ submodular ○ Allows a (1-1/e)-APX

  • Challenges

○ Requires greedily finding the next best clique ○ MAX CLIQUE NP-hard to approximate to a constant

  • Questions

○ Can we develop a solution with APX guarantees that is fast? Limitations? ○ Can we develop a very fast solution of good quality?

15

Candidate to add to solution Diminishing return

slide-16
SLIDE 16

Outline

  • Motivation and examples
  • Problem statement and properties
  • Proposed solutions
  • Experiments
  • Conclusion

16

slide-17
SLIDE 17

Intuition

  • How to obtain good cliques while reducing

the cost of enumeration?

  • Exploit the distribution of edge

weights in a real network.

  • Consider good edges first.
  • Include good cliques in

solution before considering all edges based on bounding the contribution of partial cliques 17

slide-18
SLIDE 18

Optimistic completion

C

Current lowest weight will be the lowest in the whole clique The rest of the nodes will not overlap

18

Upper bound for an incomplete clique contribution

slide-19
SLIDE 19

DiCliQ - threshold and prune

19

slide-20
SLIDE 20

DiCliQ - threshold and prune

20

  • 1. Enumerate

cliques in a thresholded graph

  • 2. Upper bound
  • 3. If there is a

candidate with a better score contribution than the best UB, add it to the solution

slide-21
SLIDE 21

DiCliQ - threshold and prune

21

  • 1. Enumerate

cliques in a thresholded graph

  • 2. Upper bound
  • 3. If there is a

candidate with a better score contribution than the best UB, add it to the solution

  • 4. Lower threshold

and repeat

slide-22
SLIDE 22

DiCliQ - threshold and prune

  • Implements a GREEDY and hence has a (1-

1/e)-approximation factor

  • Exhaustive enumeration of all cliques might

incur high cost in very large/dense instances

  • How to scale up the discovery of diverse

cliques without compromising the quality much? 22

slide-23
SLIDE 23

BUDiC - Bottom-up greedy heuristic

  • Greedy expansion

around a node based

  • n the UB contribution
  • Incorporates diversity

C

Already in the solution A

UB? UB?

23

slide-24
SLIDE 24

BUDiC - Bottom-up greedy heuristic C

  • Greedy expansion

around a node based

  • n the UB contribution
  • Incorporates diversity

24

Already in the solution A

Grow away from included nodes based on UB

slide-25
SLIDE 25

BUDiC - Bottom-up greedy heuristic C

Already in the solution A

  • Greedy expansion

around a node based

  • n the UB contribution
  • Incorporates diversity
  • Repeat for all nodes
  • Scales much better: O

(m*k*|E|)

  • No APX guarantee
  • Good quality on real

datasets 25

Grow away from included nodes based on UB

slide-26
SLIDE 26

Outline

  • Motivation and examples
  • Problem statement and properties
  • Proposed solutions
  • Experiments
  • Conclusion

26

slide-27
SLIDE 27

Data

27

slide-28
SLIDE 28

Scalability

28

  • Compare running time to a Baseline (No

thresholding) and relative quality to iMDV*

  • α = 0.5, m = 10, k = 5

* S. Bandyopadhyay and M. Bhattacharyya. Mining the largest dense vertexlet in a weighted scale-free graph. Fundam. Inform., 96(1-2):1–25, 2009

  • Apx. guarantee

Scalable, High Quality

slide-29
SLIDE 29

Scalability on YeastNet

27 α=0.5, k=5 α=0.5, m=5

slide-30
SLIDE 30

Quality

28

slide-31
SLIDE 31

Discovering gene complexes

29

slide-32
SLIDE 32

Conclusion

  • General results for diverse clique mining

○ application to discovery of effective groups in collaboration ○ complexes in gene networks ○ similarity/correlation graphs

  • Two scalable algorithms, one with constant

factor approximation

  • More than 3 orders of magnitude running time

improvement while preserving good quality 30

slide-33
SLIDE 33

Thank You

Q&A

The research was supported by the Army Research Laboratory under cooperative agreement W911NF-09-2-0053 (NS-CTA).

slide-34
SLIDE 34

Effect of diversity parameter α

32

slide-35
SLIDE 35

Groups in the other datasets

  • The Harry Potter cast in the movies data set
  • NBA: Nowitzki-Chandler-Stevenson of the

defending champion Dallas Mavericks (addition of Chandler positive)

  • MLB: Ramirez-Blake-Kuo of the LA Dodgers

(13/14 with an otherwise unremarkable lineup reached the playoffs in 2008) 33

slide-36
SLIDE 36

Related work

  • Quasi-cliques

○ frequency of clique occurrence (not score) ○ non-unique labels

  • Weighted cliques

○ Bandyopadhyay et al. 2009: no APX guarantees, single clique, extended version does not have as good quality

  • Other subgraph types

○ Steiner trees ○ Clique percolation (CFinder) ○ Edge weights are constraints and not part of score

  • Diversity of nodes labels within a clique

34