scalable algorithm for probabilistic overlapping
play

Scalable Algorithm for Probabilistic Overlapping Community Detection - PowerPoint PPT Presentation

Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei Wakabayashi University of Tsukuba WSDM 2017 workshop on SWM Large Graph Its hard to analyze a large graph. Examples: Citation networks Co-author


  1. Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei Wakabayashi University of Tsukuba WSDM 2017 workshop on SWM

  2. Large Graph It’s hard to analyze a large graph. Examples: • Citation networks • Co-author relationships • Social networks • Hyperlinks on web pages Needs: Decomposition a large graph into some smaller subgraphs 2

  3. Community Structures in Graph In the same community, • nodes are densely connected internally • nodes resemble the others – Same affiliation – Same interest – Related research area 3

  4. Overlapping Community Each node belongs to multiple communities. Many graphs have overlapping communities – Ex. Related Research areas in co-author graph B x w E • Blue : Data mining • Red : Machine learning A y A has published in both areas C 4 D z

  5. Bag-of-nodes Representation Bag-of-words for graph • A node corresponds to one document • The node and its adjacency list correspond to words in the document Node as doc Nodes as words A, B, C, D, A E, x, y, z w B B, A, E B x E C C, D, E, A D D, E, C, A A y E E, B, A, C, D x x, y, A, w C y y, A, x, z z z z, y, A D w w, x Graph 5 Bag-of-nodes

  6. Latent Dirichlet Allocation (LDA) [Blei+, 2003] • Probabilistic generative model for bag-of-words • Find topics from words co-occurrence • Each topic defines a distribution over all words Coffee shop author 0.14 coffee 0.15 drink drink cite 0.12 drink 0.15 citation 0.11 coffee coffee beans 0.14 review 0.11 cafe 0.13 coffee coffee … … beans beans espresso cafe Topics (distribution over all words) Documents (bag-of-words) 6

  7. LDA for Graph A topic represents an overlapping community. Each community is an affiliation probability distribution over nodes. Node as doc Nodes as words E , C , D , B and A belong to A, B, C, D, the community with high probability A E, x, y, z B B, A, E E 0.20 x 0.22 C C, D, E, A C 0.20 y 0.22 D 0.18 z 0.20 D D, E, C, A B 0.18 A 0.15 E E, B, A, C, D A 0.12 w 0.07 x x, y, A, w x 0.04 C 0.05 y y, A, x, z y 0.03 D 0.04 z 0.03 B 0.04 z z, y, A w 0.02 E 0.01 w w, x Graph as documents Communities (distributions over nodes) 7

  8. Stochastic Variational Inference [Mimno+, 2012] Inference algorithms based on stochastic gradient descent – Update parameters based on sampling nodes in each iteration – mini-batch size : # sampling nodes as document Node as doc Nodes as words A, B, C, D, A Node Nodes E, x, y, z as doc as words B B, A, E B B, A, E C C, D, E, A D D, E, C, A C C, D, E, A E E, B, A, C, D z z, y, A x x, y, A, w sampling w w, x y y, A, x, z When mini-batch size is 4 z z, y, A w w, x 8 Graph as documents

  9. Experiment Evaluation of scalability for the graph size • Runtime for overlapping community detection Quality metrics for overlapping communities • Triangle participation ratio (TPR) – Ratio of #nodes that belong to a triangle – Higher is better • Conductance – Ratio of #edges that link to an outer node – Lower is better 9

  10. Experimental Datasets Name #nodes #edges DBLP 317,080 1,049,866 Orkut 3,072,441 117,185,083 Friendster 65,608,366 1,806,067,135 From SNAP Datasets Only Friendster: • Store into MySQL • Sample mini-batch size records from the table 10

  11. Comparison of Runtime 2hours 7 min. #communities: 4,000 #iterations: 1,000 Mini-batch size: 2,000 11

  12. The Metrics of DBLP Communities TPR: the median of SVBLDA is the third best Conductance: the median of SVBLDA is the third worst #communities: 4,000 #iterations: 1,000 12 Mini-batch size: 2,000

  13. Parameter Sensitivity in DBLP • Varying mini-batch size or # iterations when fixing the other parameter • No significantly improvement of TPR/Conductance when mini-batch size > 3000 or # iterations > 2000 13 Mini-batch size: 2,000 #iterations: 1,000

  14. Conclusion • Scalable community detection algorithm based on LDA for large graph • About 2 hours to detect communities from the large graph • It’s unnecessary to set large mini-batch size and #iteration for DBLP datasets 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend