Imp mproved Par arall allel l Algorit ithms for r Densit ity-Base sed Ne Network rk Clusterin ing
Slobodan Mitrović MIT Silvio Lattanzi Google Mohsen Ghaffari ETH
Imp mproved Par arall allel l Algorit ithms for r Densit - - PowerPoint PPT Presentation
Imp mproved Par arall allel l Algorit ithms for r Densit ity-Base sed Ne Network rk Clusterin ing Mohsen Ghaffari Silvio Lattanzi Slobodan Mitrovi ETH Google MIT Why density-based network clustering? A wide range of
Slobodan Mitrović MIT Silvio Lattanzi Google Mohsen Ghaffari ETH
Community detection
[Leskovec et al. ‘08; Chen & Saad ‘12; Gionis & Tsourakakis’15; Mitzenmacher et al. ‘15]
Community detection
[Leskovec et al. ‘08; Chen & Saad ‘12; Gionis & Tsourakakis’15; Mitzenmacher et al. ‘15]
Spam detection
[Gibson et al. ‘05]
Community detection
[Leskovec et al. ‘08; Chen & Saad ‘12; Gionis & Tsourakakis’15; Mitzenmacher et al. ‘15]
Spam detection
[Gibson et al. ‘05]
Computational biology
[Altaf-Ul-Amin et al. ‘06; Fratkin et al. ‘06; Saha et al. ‘10]
…
Community detection
[Leskovec et al. ‘08; Chen & Saad ‘12; Gionis & Tsourakakis’15; Mitzenmacher et al. ‘15]
Spam detection
[Gibson et al. ‘05]
Computational biology
[Altaf-Ul-Amin et al. ‘06; Fratkin et al. ‘06; Saha et al. ‘10]
…
Goal: Given a graph G, find a subgraph H such that |E(H)| / |V(H)| is maximized.
Goal: Given a graph G, find a subgraph H such that |E(H)| / |V(H)| is maximized.
|𝐹 𝐻 | |𝑊 𝐻 | = 17 13
Goal: Given a graph G, find a subgraph H such that |E(H)| / |V(H)| is maximized.
|𝐹 𝐻 | |𝑊 𝐻 | = 17 13 |𝐹 𝐼 | |𝑊 𝐼 | = 11 7
Goal: Given k, find a maximal subgraph of minimum degree at least k. (k-core)
Goal: Given k, find a maximal subgraph of minimum degree at least k. (k-core)
1-core
2-core
Goal: Given k, find a maximal subgraph of minimum degree at least k. (k-core)
2-core
Goal: Given k, find a maximal subgraph of minimum degree at least k. (k-core) The corenessnumber of a vertex v is the maximum k for which v is part of the k-core.
1-core
1-core 2-core
1-core 2-core 3-core
1-core 2-core 3-core 4-core
Algorithms performed sequentially.
Algorithms performed sequentially.
Algorithms performed sequentially.
Algorithms performed sequentially.
Algorithms performed sequentially.
Examples:
An approach to handling massive data
N machines: Data:
N machines: Data:
N machines: Data:
N machines:
Next-round data: Data:
N machines:
Next-round data:
Data:
N machines:
Next-round data: Data:
Bahmani, Kumar, Vassilvitskii, VLDB 2012.
Bhattacharya, Henzinger, Nanongkai, Tsourakakis, STOC 2015.
Epasto, Lattanzi, Sozio, WWW 2015.
McGregor, Tench, Vorotnikova, Vu, MFCS 2015.
Esfandiari, Hajiaghayi, Woodruff, SPAA 2016.
Bahmani, Goel, Munagala, Workshop on Algorithms and Models for the Web-Graph 2014.
Esfandiari, Lattanzi, and Mirrokni, ICML 2018.
Saríyüce, Gedik, Jacques, Wu, Çatalyürek, VLDB 2013.
Aksu, Canim, Chang, Korpeoglu, Ulusoy, TKDE 2014.
Theorem 1
1 + 𝜗 -approximate k-core decomposition can be obtained in 𝑃 log log 𝑜 MPC rounds with ෨ 𝑃(𝑜) memory per machine.
Theorem 3
1 + 𝜗 -approximate densest subgraph can be
𝑃 log𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 max 𝑜1+𝜀,𝑛 .
Theorem 2
2 + 𝜗 -approximate k-core decomposition can be obtained in ෨ 𝑃 log 𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 max 𝑜1+𝜀,𝑛 .
Theorem 4
For a graph of arboricity 𝜇, a 2 + 𝜗 𝜇 orientation can be obtained in ෨ 𝑃 log 𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 𝜇𝑜 .
n = number of vertices
n = number of vertices
Theorem 1
1 + 𝜗 -approximate k-core decomposition can be obtained in 𝑃 log log 𝑜 MPC rounds with ෨ 𝑃(𝑜) memory per machine.
Theorem 3
1 + 𝜗 -approximate densest subgraph can be
𝑃 log𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 max 𝑜1+𝜀,𝑛 .
Theorem 2
2 + 𝜗 -approximate k-core decomposition can be obtained in ෨ 𝑃 log 𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 max 𝑜1+𝜀,𝑛 .
Theorem 4
For a graph of arboricity 𝜇, a 2 + 𝜗 𝜇 orientation can be obtained in ෨ 𝑃 log 𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 𝜇𝑜 .
Theorem 1
(1 + 𝜗)-approximate k-core decomposition can be obtained in 𝑃 log log 𝑜 MPC rounds with ෨ 𝑃 𝑜 memory per machine.
Theorem 1
(1 + 𝜗)-approximate k-core decomposition can be obtained in 𝑃 log log 𝑜 MPC rounds with ෨ 𝑃 𝑜 memory per machine.
High-level idea: Simulate the sequential algorithm.
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
remove all the vertices of degree less than k.
the largest k for which it is not removed.
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
Coreness value of all remaining vertices >= 2.
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
Coreness value of all remaining vertices >= 2. Implementing this approach directly can take too many rounds.
remove all the vertices of degree less than k.
the largest k for which it is not removed.
k=2
Coreness value of all remaining vertices >= 2. Implementing this approach directly can take too many rounds. Idea: Process only large thresholds.
Apply the sequential algorithm locally.
Partition the graph across 𝑜 machines. Apply the sequential algorithm locally.
Partition the graph across 𝑜 machines. The local degree of each vertex v with dv ≥ 𝑜 log 𝑜 is sharply concentrated around its
Apply the sequential algorithm locally.
Partition the graph across 𝑜 machines. The local degree of each vertex v with dv ≥ 𝑜 log 𝑜 is sharply concentrated around its
Run the sequential algorithm locally to find (1 + 𝜗)- approximate k-cores for 𝑙 ≥ 𝑜 log 𝑜. Apply the sequential algorithm locally.
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜?
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜.
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜. The number of remaining edges is ෨ 𝑃(𝑜 𝑜).
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜. The number of remaining edges is ෨ 𝑃(𝑜 𝑜). Partition the vertices across 𝑜
1 4 machines.
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜. The number of remaining edges is ෨ 𝑃(𝑜 𝑜). Partition the vertices across 𝑜
1 4 machines.
Detect k-cores for k ≥ 𝑜
1 4 log 𝑜.
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜. The number of remaining edges is ෨ 𝑃(𝑜 𝑜). Partition the vertices across 𝑜
1 4 machines.
Detect k-cores for k ≥ 𝑜
1 4 log 𝑜.
Repeat.
Partitioning across 𝑜 machines detects the k-cores for 𝑙 ≥ 𝑜 log 𝑜. How about 𝑙 < 𝑜 log 𝑜? Ignore all the edges between vertices of coreness ≥ 𝑜 log 𝑜. The number of remaining edges is ෨ 𝑃(𝑜 𝑜). Partition the vertices across 𝑜
1 4 machines.
Detect k-cores for k ≥ 𝑜
1 4 log 𝑜.
Repeat.
n → n1/2 → n1/4 → … → n1/log n log log n rounds
SKC = the algorithm in [Esfandiari et al. 2018] VKC = Theorem 1
SKC = the algorithm in [Esfandiari et al. 2018] VKC = Theorem 2
Theorem 2
(2 + 𝜗)-approximate k-core decomposition can be obtained in ෨ 𝑃 log 𝑜 MPC rounds with 𝑃 𝑜𝜀 memory per machine and the total memory of ෨ 𝑃 max 𝑜1+𝜀,𝑛 .
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
1-coreness
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
remove all the vertices of degree less than (2 + 𝜗)k.
which it is not removed.
k=1
1-coreness 𝟑 + 𝝑 -approximate 1-coreness
The algorithm terminates in 𝑃 log 𝑜 iterations!
High-level idea: Simulate 𝑃 log 𝑜 sequential in ෨ 𝑃 log 𝑜 MPC iterations.
𝟑 + 𝝑 -approximate 1-coreness
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations.
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
v
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
v
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
2-hop
v
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
2-hop
A log𝑜-hop neighborhood might be too big! E.g., a vertex has degree n.
v
Split the log 𝑜 iterations into log 𝑜 phase, each phase consisting of log 𝑜 iterations. Simulate each phase for each vertex by gathering its log 𝑜-hop neighborhood.
2-hop
A log𝑜-hop neighborhood might be too big! E.g., a vertex has degree n.
Idea: Sparsify the graph.
Given a parameter k, sparsify the graph by keeping each edge with probability Θ
log 𝑜 𝑙
.
Given a parameter k, sparsify the graph by keeping each edge with probability Θ
log 𝑜 𝑙
. The approximate k-core is preserved after the
Given a parameter k, sparsify the graph by keeping each edge with probability Θ
log 𝑜 𝑙
. Some vertices still might have too large degree. E.g., vertex of degree n for k=n0.1. The approximate k-core is preserved after the
Given a parameter k, sparsify the graph by keeping each edge with probability Θ
log 𝑜 𝑙
. Some vertices still might have too large degree. E.g., vertex of degree n for k=n0.1. “Freeze” all the vertices of degree more than 2 𝜀 log 𝑜 after the sparsification. The approximate k-core is preserved after the
Given a parameter k, sparsify the graph by keeping each edge with probability Θ
log 𝑜 𝑙
. Some vertices still might have too large degree. E.g., vertex of degree n for k=n0.1. “Freeze” all the vertices of degree more than 2 𝜀 log 𝑜 after the sparsification. The number of frozen vertices is small and affects the round complexity only by a constant. The approximate k-core is preserved after the
Is the non-empty k-core for the largest k the same as the densest subgraph?
Is the non-empty k-core for the largest k the same as the densest subgraph?
Is the non-empty k-core for the largest k the same as the densest subgraph?
2-core 3-core
Is the non-empty k-core for the largest k the same as the densest subgraph?
2-core 3-core density = 11/7 density = 3/2