Efficient Diameter Approximation for Large Graphs in MapReduce - - PowerPoint PPT Presentation
Efficient Diameter Approximation for Large Graphs in MapReduce - - PowerPoint PPT Presentation
Efficient Diameter Approximation for Large Graphs in MapReduce Geppino Pucci - Universit` a di Padova, Italy Based on joint works ([SPAA15], [IPDPS16]) with: Matteo Ceccarello, Andrea Pietracaprina (U. Padova) Eli Upfal (Brown U.) Outline 1.
Outline
- 1. Context
- 2. Computational model
- 3. Previous work
- 4. Diameter approximation algorithm
- 5. Experiments
- 6. Conclusions
Context Scenario
◮ Large graph analytics: major discovery tool for diverse
application domains (e.g., social/road/biological network analysis, cybersecurity, NLP, cognitive computing)
◮ (Commodity) computer clusters: cheap, widespread platforms
with relatively high communication/synchronization costs
Focus
◮ Approximation of graph diameter
◮ very large, undirected, weighted (sparse) graphs ◮ linear space, few parallel rounds, practical efficiency
Computational Model MR model [PPRSU12]
◮ Abstraction of popular programming frameworks
(MapReduce/Hadoop, Spark)
◮ Builds upon and simplifies [Karloff+’10][Goodrich’11] ◮ Underlying platform: unspecified number of interconnected
commodity machines
◮ Algorithm: sequence of rounds ◮ 2 parameters: max local space ML, max aggregate space MA
MR(ML, MA) round
Transforms a multiset X of key-value pairs into a new multiset Y
- f key value pairs by applying a given reduce function to all input
pairs with the same key.
Previous work Sequential setting
◮ APSP (Johnson’s alg.): O(n · m + n2 log n) time ◮ Roditty et al. (STOC’13, SODA’14): 3/2-approximation in
O(min{m3/2, mn2/3}) = o(n · m) time.
◮ Empirically: very few SSSPs guarantee accurate estimates
([MLH09, CGHLM13, C+12, C+13, C+15]).
Parallel setting
◮ Exact diameter through matrix-multiplication: O(log n)
rounds but Ω(n2) space.
◮ Cohen (JACM’00): (1 + ǫ)-approximation in O(poly(log n))
time and superlinear space. Not easy to implement.
Previous work (cont’d)
2-Approximation achievable through SSSP PRAM algorithms
∆-stepping (Meyer and Sanders, JoA’03)
◮ Parallel time-work tradeoff by staggering edge relaxations
(dj ← min{dj, di + wij})
◮ At iteration i, compute distances ∈ [(i − 1)∆, i∆]. ◮ Small ∆’s: ≃ Dijkstra. Large ∆’s: ≃ Bellman-Ford ◮ Round complexity =Ω(ℓΦ(G)), where ℓΦ(G) edges are required
to connect any two nodes at distance Φ(G).
Our aim
Diameter approximation in linear space and o(ℓΦ(G)) rounds
Diameter approximation: high-level strategy
Based on shallow-depth clustering:
- 1. Compute a decomposition C of G into clusters of small radius
- 2. Estimate diameter Φ(G) from diameter Φ(GC), with GC a
suitable quotient graph derived from C
Remarks
◮ Previous decompositions ([MPX13, Mey08]) do not guarantee
small (unweighted+weighted) radius
◮ Cluster ganularity chosen so that GC fits into local memory ◮ Small radius → low round complexity, better approximation
Decomposition C: algorithm cluster(τ) Challenges
Cluster centers are sampled at random. In order to attain small (unweighted+weighted) cluster radius we must
- 1. Ensure higher sampling density in remote regions of the graph
- 2. Avoid heavy edges for cluster growth
Key ingredients
- 1. Progressive clustering strategy
- 2. ∆-stepping approach to cluster growing
Decomposition C: a pathological example
Decomposition C: algorithm cluster(τ) Progressive clustering [CPPU15]
- 1. Select random batch of τ centers from uncovered nodes
- 2. Grow both old and new clusters until covering half of the
uncovered nodes
- 3. Repeat steps 1-2 until complete coverage
∆-stepping-like cluster growth [CPPU16]
◮ ∆ ← guess on cluster’s minimum weighted radius ◮ In each iteration of progressive clustering (Steps 1-2):
◮ Use only light edges (weight < ∆) and stop at radius ∆ ◮ If desired coverage cannot be obtained then ∆ ← 2∆
Algorithm cluster(τ): example (τ = 1, ∆ = 4) Graph G
A B C D E F G H L I M P Q R S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
Algorithm cluster(τ): example (τ = 1, ∆ = 4) 1st batch of τ centers
A B C D E F G H L I M P Q R S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
Algorithm cluster(τ): example (τ = 1, ∆ = 4) 1st batch of τ centers
A B C D E F G H L I M P Q R S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
Algorithm cluster(τ): example (τ = 1, ∆ = 4) 2nd batch of τ centers
A B C D E F G H L I P Q R S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
M
Algorithm cluster(τ): example (τ = 1, ∆ = 4) 3rd batch of τ centers
A B C D E F G H L I P Q S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
M R
Decomposition C: algorithm cluster(τ) Theorem
W.h.p. cluster(τ) computes a decomposition C of G into O(τ log2 n) clusters
◮ Max cluster radius: O(R(G, τ) log n) ◮ Round complexity: O(min{n/τ, ℓR(G,τ)} log n)
- n MR(nǫ, m), for any constant ǫ ∈ (0, 1).
where:
◮ R(G, τ): minimum max radius in any τ-clustering of G ◮ ℓX: max number of edges in a min-weight path of weight X
Diameter approximation: example Graph G, weighted diameter Φ(G) = 16
A B C D E F G H L I M P Q R S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
Diameter approximation: example
A B F G H C D E L I P Q S N O
1 4 1 1 1 1 4 2 2 1 2 2 2 3 3 3 1 1 5 3
M R
Quotient graph GC Φ(GC) = 12 Φ(G) <= 12+4+2 = 18 (vs 16)
A
5
M R
7
Radius = 4 Radius = 2 Radius = 3
Diameter approximation: main result Theorem
For a given weighted graph G, w.h.p. we can compute an upper bound to Φ(G)
◮ Approximation ratio: O(log3 n) ◮ Round complexity: O(min{n/τ, ℓR(G,τ) log n} log n)
- n MR(nǫ, m), for any constant ǫ ∈ (0, 1).
Remarks
◮ Round complexity becomes o(ℓΦ(G)/nδ) on graphs of bounded
doubling dimension
◮ Practical implementation. On real-world graphs,
approximation ratio < 1.3
◮ Byproduct: linear-space, low-round k-center clustering in MR
Proof Idea
◮ 2-phase decomposition strategy:
◮ Phase 1. Compute an estimate R of R(G, τ) through
progressive sampling.
◮ Phase 2. Perform log n iterations of cluster-growing steps of
fixed radius R from batches of centers selected with geometrically increasing probability
◮ O(log3 n) Approximation: w.h.p. the nodes of each
shortest-path segment of length R belong to O(log2 n) clusters of radius O(R log n).
Diameter approximation: experiments Experimental setup
◮ In-house cluster with 16 machines ◮ 18GB RAM / Intel i7 nehalem 4-core processor ◮ Spark MapReduce platform
Datasets Graph n m Φ(G) roads-USA 23,947,347 29,166,673 55,859,820 roads-CAL 1,890,815 2,328,872 16,425,258 livejournal 3,997,962 32, 681, 189 9.41 twitter 41,652,230 1,468,365,182 9.07 mesh(S) S2 2S(S − 1) † R-MAT(S) 2S 16 · 2S † roads(S) ≈ S · 2.3 · 107 ≈ S · 5.3 · 107 † † the diameter depends on the size of the graph, controlled by S > 1. Scalability
21 22 23 24 machines 500 1000 1500 2000 2500 3000 3500 4000 4500 time (s) RMAT(26) roads(3)
Diameter approximation: experiments
◮ We compare our algorithm (CLUSTER) with ∆-stepping
Rounds
r
- a
d s U S A r
- a
d s C A L m e s h l i v e j
- u
r n a l t w i t t e r R M A T ( 2 4 ) 100 101 102 103 104 105 CLUSTER ∆stepping
Time
r
- a
d s U S A r
- a
d s C A L m e s h l i v e j
- u
r n a l t w i t t e r R M A T ( 2 4 ) 101 102 103 104 105 time (s) CLUSTER ∆stepping
Diameter approximation: experiments
Work
r
- a
d s U S A r
- a
d s C A L m e s h l i v e j
- u
r n a l t w i t t e r R M A T ( 2 4 ) 107 108 109 1010 1011 1012 CLUSTER ∆stepping
Approximation
r
- a
d s U S A r
- a
d s C A L m e s h l i v e j
- u
r n a l t w i t t e r R M A T ( 2 4 ) 1.0 1.1 1.2 1.3 1.4 1.5 CLUSTER ∆stepping