as strong as the weakest link mining diverse cliques in
play

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted - PowerPoint PPT Presentation

As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD,


  1. As Strong as the Weakest Link: Mining Diverse Cliques in Weighted Graphs Petko Bogdanov (UC Santa Barbara), with Ben Baumer (Smith College), Prithwish Basu (Raytheon BBN) , Amotz Bar-Noy (CUNY) and Ambuj K. Singh (UC Santa Barbara) ECML/PKDD, Prague, 2013

  2. Example: Collaboration in sports Significance of a pair’s success when on a team 2

  3. Influential groups 3

  4. Multiple teams 4

  5. Cliques in gene networks Complexes - interacting Gene Interaction Networks functional units* 5 * Leemor Joshua-Tor, Structure and Function of Nucleic Acid Regulatory Complexeshttp://www.hhmi.org/research/structure-and- function-nucleic-acid-regulatory-complexes

  6. Cliques in other domains ● Sets of duplicates and near-duplicates in similarity networks ○ images ○ video ○ other complex objects with similarity function ● Co-evolving time series ○ stocks of companies related in a supply chain ○ brain regions co-associated 6 in performing a task* * Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLoS Biology Vol. 6, No. 7, e159

  7. Challenges ● Enumeration of cliques ○ MAX CLIQUE is NP-hard ● Ensuring diversity in the result set ○ Managing overlap “adds” complexity ● Size and density of real-world networks ● How to find the best diverse cliques efficiently while maintaining good quality of the solution 7

  8. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 8

  9. Basic notions ● A graph G(V,E,w) represents a network of entities V with edges E among them ● w defines weights on edges ○ higher weight means stronger association ● A clique is a complete subgraph, i.e. all edges among the selected entities exist 9

  10. Clique strength ● Strong ties of all pairwise edges ● A clique is as strong as its weakest link ● “Flat” teams in which all connections are important ● Bigger cliques featuring all strong edges are better 10

  11. Diversity Score Diversity ● Linear combination of score and diversity via α ● Higher number of distinct nodes in solution means higher diversity 11

  12. Example: Top-2 cliques Too much overlap 12

  13. Example: Top-2 cliques Slightly lower score but less overlap 13

  14. Complexity ● m-Diverse k- Structures (mDk S ) is NP-hard ○ reduction from SET COVER ● Even if we are interested in sets of arbitrary structure, maximizing diversity is NP-hard Included in solution 14

  15. Approximation Diminishing ● Good news return ○ monotonic Candidate to ○ submodular add to solution ○ Allows a (1-1/e)-APX ● Challenges ○ Requires greedily finding the next best clique ○ MAX CLIQUE NP-hard to approximate to a constant ● Questions ○ Can we develop a solution with APX guarantees that is fast? Limitations? ○ Can we develop a very fast solution of good quality? 15

  16. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 16

  17. Intuition ● How to obtain good cliques while reducing the cost of enumeration? ● Exploit the distribution of edge weights in a real network. ● Consider good edges first. ● Include good cliques in solution before considering all edges based on bounding the contribution of partial cliques 17

  18. Upper bound for an incomplete clique contribution Current lowest C The rest of weight will be the the nodes will lowest in the not overlap whole clique Optimistic completion 18

  19. DiCliQ - threshold and prune 19

  20. DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 20

  21. DiCliQ - threshold and prune 1. Enumerate cliques in a thresholded graph 2. Upper bound 3. If there is a candidate with a better score contribution than the best UB, add it to the solution 4. Lower threshold and repeat 21

  22. DiCliQ - threshold and prune ● Implements a GREEDY and hence has a (1- 1/e)-approximation factor ● Exhaustive enumeration of all cliques might incur high cost in very large/dense instances ● How to scale up the discovery of diverse cliques without compromising the quality much? 22

  23. BUDiC - Bottom-up greedy heuristic ● Greedy expansion UB? Already in around a node based the solution on the UB contribution A ● Incorporates diversity C UB? 23

  24. BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity C Grow away from included nodes 24 based on UB

  25. BUDiC - Bottom-up greedy heuristic ● Greedy expansion Already in around a node based the solution on the UB contribution A ● Incorporates diversity ● Repeat for all nodes ● Scales much better: O C (m*k*|E|) ● No APX guarantee ● Good quality on real Grow away from datasets included nodes 25 based on UB

  26. Outline ● Motivation and examples ● Problem statement and properties ● Proposed solutions ● Experiments ● Conclusion 26

  27. Data 27

  28. Scalability Apx. guarantee Scalable, High Quality ● Compare running time to a Baseline (No thresholding) and relative quality to iMDV* ● α = 0.5, m = 10, k = 5 * S. Bandyopadhyay and M. Bhattacharyya. Mining the largest dense vertexlet in 28 a weighted scale-free graph. Fundam. Inform., 96(1-2):1–25, 2009

  29. Scalability on YeastNet α=0.5, k=5 α=0.5, m=5 27

  30. Quality 28

  31. Discovering gene complexes 29

  32. Conclusion ● General results for diverse clique mining ○ application to discovery of effective groups in collaboration ○ complexes in gene networks ○ similarity/correlation graphs ● Two scalable algorithms, one with constant factor approximation ● More than 3 orders of magnitude running time improvement while preserving good quality 30

  33. Thank You Q&A The research was supported by the Army Research Laboratory under cooperative agreement W911NF-09-2-0053 (NS-CTA).

  34. Effect of diversity parameter α 32

  35. Groups in the other datasets ● The Harry Potter cast in the movies data set ● NBA: Nowitzki-Chandler-Stevenson of the defending champion Dallas Mavericks (addition of Chandler positive) ● MLB: Ramirez-Blake-Kuo of the LA Dodgers (13/14 with an otherwise unremarkable lineup reached the playoffs in 2008) 33

  36. Related work ● Quasi-cliques ○ frequency of clique occurrence (not score) ○ non-unique labels ● Weighted cliques ○ Bandyopadhyay et al. 2009: no APX guarantees, single clique, extended version does not have as good quality ● Other subgraph types ○ Steiner trees ○ Clique percolation (CFinder) ○ Edge weights are constraints and not part of score ● Diversity of nodes labels within a clique 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend