[PPT] - Argo: Architecture-Aware Graph Par33oning Angen Zheng PowerPoint Presentation

SLIDE 1

Argo: ¡Architecture-‑Aware ¡Graph ¡Par33oning ¡

Angen ¡Zheng ¡ ¡Alexandros ¡Labrinidis, ¡Panos ¡K. ¡Chrysanthis, ¡and ¡Jack ¡Lange ¡

Department ¡of ¡Computer ¡Science, ¡University ¡of ¡PiCsburgh ¡

hCp://db.cs.piC.edu/group/ ¡ hCp://www.prognosGclab.org/ ¡ ¡

1 ¡

SLIDE 2

Big ¡Graphs ¡Are ¡Everywhere ¡[SIGMOD’16 ¡Tutorial] ¡

2 ¡

SLIDE 3

A ¡Balanced ¡Par33oning ¡= ¡Even ¡Load ¡Distribu3on ¡ Minimal ¡Edge-‑Cut ¡= ¡Minimal ¡Data ¡Comm ¡ ¡

N3 ¡ N1 ¡ N2 ¡

Assump3on: ¡Network ¡is ¡the ¡boOleneck. ¡

3 ¡

SLIDE 4

The ¡End ¡of ¡Slow ¡Networks: ¡Network ¡is ¡now ¡ as ¡fast ¡as ¡DRAM ¡[C. ¡Bing, ¡VLDB’15] ¡

✓ Dual-‑socket ¡ ¡Xeon ¡ ¡E5v2 ¡ ¡server ¡with ¡ ¡

○ DDR3-‑1600 ¡ ○ 2 ¡FDR ¡4x ¡NICs ¡per ¡socket ¡

✓ Infiniband: ¡

¡1.7GB/s~37.5GB/s ¡ ¡

✓ DDR3: ¡

¡6.25GB/s~16.6GB/s ¡ ¡

4 ¡

SLIDE 5

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡ edge-‑cut ¡s3ll ¡maOer? ¡ ¡

5 ¡

SLIDE 6

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡s3ll ¡maOer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

6 ¡

SLIDE 7

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

Graph ¡Par33oners ¡ METIS ¡and ¡LDG ¡ Graph ¡Workloads ¡ BFS, ¡SSSP, ¡and ¡PageRank ¡ Graph ¡Dataset ¡ Orkut ¡(|V|=3M, ¡|E|=234M) ¡ Number ¡of ¡Par33ons ¡ 16 ¡(one ¡parGGon ¡per ¡core) ¡

7 ¡

SLIDE 8

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x m: ¡ ¡ ¡# ¡of ¡machines ¡used ¡ s: ¡ ¡ ¡ ¡# ¡of ¡sockets ¡used ¡per ¡machine ¡ c: ¡ ¡ ¡ ¡# ¡of ¡cores ¡used ¡per ¡socket ¡

✓ Denser ¡configura3ons ¡had ¡longer ¡execu3on ¡3me. ¡

○ Conten3on ¡on ¡the ¡memory ¡subsystems ¡impacted ¡performance. ¡ ¡ ○ Network ¡may ¡not ¡always ¡be ¡the ¡boOleneck. ¡ ¡

8 ¡

SLIDE 9

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

9x 235x

✓ Denser ¡configura3ons ¡had ¡longer ¡execu3on ¡3me. ¡

○ Conten3on ¡on ¡the ¡memory ¡subsystems ¡impacted ¡performance. ¡ ¡ ○ Network ¡may ¡not ¡always ¡be ¡the ¡boOleneck. ¡ ¡

9 ¡

SLIDE 10

m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x 235x

✓ Denser ¡configura3ons ¡had ¡longer ¡execu3on ¡3me. ¡

○ Conten3on ¡on ¡the ¡memory ¡subsystems ¡impacted ¡performance. ¡ ¡ ○ Network ¡may ¡not ¡always ¡be ¡the ¡boOleneck. ¡ ¡

10 ¡

SLIDE 11

m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280

9x 235x

✓ Denser ¡configura3ons ¡had ¡longer ¡execu3on ¡3me. ¡ ¡

○ ContenGon ¡on ¡the ¡memory ¡subsystems ¡impacted ¡performance. ¡ ○ Network ¡may ¡not ¡always ¡be ¡the ¡boCleneck. ¡ ¡

11 ¡

The distribution of edge-cut matters.

SLIDE 12

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

m:s:c SSSP Execution Time (s) METIS LDG 1:2:8 633 2,632 2:2:4 654 2,565 4:2:2 521 631 8:2:1 222 280 m:s:c SSSP LLC Misses (in Millions) METIS LDG 1:2:8 10,292 44,117 2:2:4 10,626 44,689 4:2:2 2,541 1,061 8:2:1 96 187

✓ METIS had lower execution time and LLC misses than LDG.

○ Edge-cut matters. ○ Higher edge-cut-->higher comm-->higher contention

9x 235x

12 ¡

SLIDE 13

Yes! ¡Both ¡edge-‑cut ¡and ¡its ¡ distribu3on ¡maOer! ¡

✓ Intra-‑Node ¡and ¡Inter-‑Node ¡Data ¡Communica3on ¡

○ Have ¡different ¡performance ¡impact ¡on ¡the ¡memory ¡

subsystems ¡of ¡modern ¡mulGcore ¡machines. ¡ ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

13 ¡

SLIDE 14

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡s3ll ¡maOers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

14 ¡

SLIDE 15

Send Buffer Sending Core Receiving Core Receive Buffer Shared Buffer

1. Load
3. Load
2b. Write
2a. Load
4a. Load
4b. Write

Extra ¡Memory ¡Copy ¡

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

15 ¡

SLIDE 16

Cached Send/Shared/Receive Buffer

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

LLC ¡and ¡Memory ¡Bandwidth ¡Conten3on ¡ Cache ¡Pollu3on ¡

16 ¡

SLIDE 17

Cached Send/Shared Buffer Cached Receive/Shared Buffer

LLC ¡and ¡Memory ¡Bandwidth ¡Conten3on ¡ Cache ¡Pollu3on ¡

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

17 ¡

SLIDE 18

Excess ¡intra-‑node ¡data ¡communica3on ¡ may ¡hurt ¡performance. ¡

18 ¡

SLIDE 19

Inter-‑Node ¡Data ¡Comm: ¡RDMA ¡Read/Write ¡

Send Buffer Sending Core

Node#1

IB HCA Receive Buffer Sending Core

Node#2

IB HCA

No ¡Extra ¡Memory ¡Copy ¡and ¡Cache ¡Pollu3on ¡ ¡

19 ¡

SLIDE 20

Offloading ¡excess ¡intra-‑node ¡data ¡comm ¡across ¡ nodes ¡may ¡achieve ¡beOer ¡performance. ¡

20 ¡

SLIDE 21

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

21 ¡

SLIDE 22

Argo: ¡Graph ¡Par33oning ¡Model ¡

Partitioner ... ... Vertex Stream

Streaming ¡Graph ¡ParGGoning ¡Model ¡[I. ¡Stanton, ¡KDD’12] ¡

22 ¡

SLIDE 23

Argo: ¡Architecture-‑Aware ¡Vertex ¡Placement ¡

Place ¡vertex, ¡v, ¡to ¡a ¡parGGon, ¡Pi, ¡that ¡maximize: ¡ ¡

✓ Weighted ¡by ¡the ¡rela3ve ¡network ¡comm ¡cost, ¡Argo ¡will ¡ ¡

○ avoid ¡edge-‑cut ¡across ¡nodes ¡(inter-‑node ¡data ¡comm). ¡

Penalize ¡the ¡placement ¡ based ¡on ¡the ¡load ¡of ¡Pi ¡

Weighted ¡Edge-‑cut ¡

Great for cases where the network is the bottleneck.

23 ¡

SLIDE 24

Argo: ¡Architecture-‑Aware ¡Vertex ¡Placement ¡

Refined ¡Intra-‑Node ¡ Network ¡Comm ¡Cost ¡ Maximal ¡Inter-‑Node ¡ Network ¡Comm ¡Cost ¡ Degree ¡of ¡Conten3on ¡ (𝞵 ¡∈ ¡[0, ¡1]) ¡ Original ¡Intra-‑Node ¡ Network ¡Comm ¡Cost ¡ ✓ Weighted ¡by ¡the ¡refined ¡rela3ve ¡network ¡comm ¡cost, ¡Argo ¡will ¡

○ avoid ¡ edge-‑cut ¡ across ¡ cores ¡ of ¡ the ¡ same ¡ node ¡ (intra-‑node ¡

data ¡comm). ¡ ¡

Bottleneck Network Memory

𝞵=0 ¡ 𝞵=1 ¡

24 ¡

SLIDE 25

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü Evalua3on ¡ ü Conclusions ¡

Roadmap ¡

25 ¡

SLIDE 26

ü Three ¡Classic ¡Graph ¡Workloads ¡

Breadth ¡First ¡Search ¡(BFS) ¡
Single ¡Source ¡Shortest ¡Path ¡(SSSP) ¡
PageRank ¡

ü Three ¡Real-‑World ¡Large ¡Graphs ¡

Evalua3on: ¡Workloads ¡& ¡Datasets ¡

Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B

26

SLIDE 27

Evalua3on: ¡Plaeorm ¡

Cluster ¡Configura.on ¡ # ¡of ¡Nodes ¡ 32 ¡ Network ¡Topology ¡ FDR ¡Infiniband ¡(Single ¡Switch) ¡ Network ¡Bandwidth ¡ 56Gbps ¡ Compute ¡Node ¡Configura.on ¡ # ¡of ¡Sockets ¡ 2 ¡Intel ¡Haswell ¡ ¡ (10 ¡cores ¡/ ¡socket) ¡ L3 ¡Cache ¡ 25MB ¡

27 ¡

SLIDE 28

ü METIS: ¡the ¡most ¡well-‑known ¡mul3-‑level ¡parGGoner. ¡ ¡ ü LDG: ¡the ¡most ¡well-‑known ¡streaming ¡parGGoner. ¡ ü ARGO-‑H: ¡network ¡is ¡the ¡boCleneck. ¡

weight ¡edge-‑cut ¡by ¡the ¡original ¡network ¡comm ¡costs. ¡

ü ARGO: ¡memory ¡is ¡the ¡boCleneck. ¡

weight ¡edge-‑cut ¡by ¡the ¡refined ¡network ¡comm ¡costs. ¡

Evalua3on: ¡Par33oners ¡

28

SLIDE 29

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡on ¡Orkut ¡dataset ¡

Message ¡Grouping ¡Size ¡

(Group ¡mulGple ¡msgs ¡by ¡a ¡single ¡SSSP ¡process ¡to ¡the ¡same ¡desGnaGon ¡into ¡one ¡msg) ¡

★ Orkut: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡3M, ¡ ¡|E| ¡= ¡234M ¡ ★ 60 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡three ¡20-‑core ¡machines ¡

2x 2x 3x 1x 4x 2x 5x 1x 2x 1.4x 3x 1x

✓ ARGO ¡had ¡the ¡lowest ¡SSSP ¡execu3on ¡3me. ¡

29 ¡

SLIDE 30

Message ¡Grouping ¡Size ¡

Evalua3on: ¡SSSP ¡LLC ¡Misses ¡on ¡Orkut ¡dataset ¡

★ Orkut: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡3M, ¡ ¡|E| ¡= ¡234M ¡ ★ 60 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡three ¡20-‑core ¡machines ¡

50x 38x 9x 4x 3x 6x 1x 1x 1x 9x 1.2x 12x

✓ ARGO ¡had ¡the ¡lowest ¡LLC ¡Misses. ¡

30 ¡

SLIDE 31

Evalua3on: ¡SSSP ¡Comm ¡Vol. ¡on ¡Orkut ¡dataset ¡

★ Orkut: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡3M, ¡ ¡|E| ¡= ¡234M ¡ ★ 60 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡three ¡20-‑core ¡machines ¡

64 ¡ Intra-‑Socket ¡ METIS ¡ 69% ¡ LDG ¡ 49% ¡ ARGO-‑H ¡ 70% ¡

✓ ARGO ¡had ¡the ¡lowest ¡intra-‑node ¡communica3on ¡volume. ¡ ✓ Distribu3on ¡of ¡the ¡edge-‑cut ¡also ¡maOers. ¡

31 ¡

SLIDE 32

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡Graph ¡Size ¡

★ TwiOer: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡52M, ¡|E| ¡= ¡3.9B ¡ ★ 80 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡four ¡20-‑core ¡machines ¡ ★ Message ¡Grouping ¡Size: ¡512 ¡

✓ ARGO ¡had ¡the ¡lowest ¡SSSP ¡execu3on ¡3me. ¡ ✓ Up ¡to ¡6x ¡improvement ¡against ¡ARGO-‑H. ¡ ✓ Improvement ¡became ¡larger ¡as ¡the ¡graph ¡size ¡increased. ¡

32 ¡

SLIDE 33

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡# ¡of ¡Par33ons ¡

★ TwiOer: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡52M, ¡|E| ¡= ¡3.9B ¡ ★ 80~200 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡four ¡up ¡to ¡ten ¡20-‑core ¡machines ¡ ★ Message ¡Grouping ¡Size: ¡512 ¡

✓ ARGO ¡always ¡outperformed ¡LDG ¡and ¡ARGO-‑H. ¡ ✓ Up ¡to ¡11x ¡improvement ¡against ¡ARGO-‑H. ¡

33 ¡

SLIDE 34

z

m

¡ i n ¡

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡# ¡of ¡Par33ons ¡ ¡

* 160 = 13h * 180 = 6h

★ TwiOer: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡|V| ¡= ¡52M, ¡|E| ¡= ¡3.9B ¡ ★ 80~200 ¡Par33ons: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡four ¡up ¡to ¡ten ¡20-‑core ¡machines ¡ ★ Message ¡Grouping ¡Size: ¡512 ¡

✓ Hours ¡CPU ¡Time ¡Saving. ¡

34 ¡

SLIDE 35

# ¡of ¡ParGGons ¡

ParGGoning ¡Time ¡as ¡a ¡Percentage ¡of ¡the ¡ CPU ¡Time ¡Saved ¡(SSSP ¡ExecuGon) ¡

Evalua3on: ¡Par33oning ¡Overhead ¡

★ TwiOer: ¡ ¡ ¡ ¡ ¡|V| ¡= ¡52M, ¡ ¡ ¡|E| ¡= ¡3.9B ¡ ★ 80~200 ¡Par33ons: ¡ ¡ ¡ ¡ ¡four ¡up ¡to ¡ten ¡20-‑core ¡machines ¡

# ¡of ¡ParGGons ¡

✓ ARGO ¡is ¡indeed ¡slower ¡than ¡LDG. ¡ ✓ The ¡overhead ¡was ¡negligible ¡in ¡comparison ¡to ¡the ¡CPU ¡3me ¡saved. ¡ ✓ Graph ¡analy3cs ¡usually ¡have ¡much ¡longer ¡execu3on ¡3me. ¡ ¡ ¡

35 ¡

SLIDE 36

ü Findings ¡

¡Network ¡is ¡not ¡always ¡the ¡boCleneck. ¡
¡ContenGon ¡on ¡memory ¡subsystems ¡may ¡

impact ¡the ¡performance ¡a ¡lot ¡

q ¡due ¡to ¡excess ¡intra-‑node ¡data ¡comm. ¡

¡Both ¡edge-‑cut ¡and ¡its ¡distribuGon ¡maCer. ¡

¡

Acknowledgments: ¡

ü Peyman ¡Givi ¡ ü Patrick ¡Pisciuneri ¡

¡ Funding: ¡

ü NSF ¡CBET-‑1609120 ¡ ü NSF ¡CBET-‑1250171 ¡ ü BigData’16 ¡Student ¡

Travel ¡Award ¡

ü ARGO ¡ ¡

voids ¡contenGon ¡by ¡offloading ¡excess ¡

intra-‑node ¡data ¡comm ¡across ¡nodes. ¡

Achieves ¡up ¡to ¡11x ¡improvement ¡on ¡

real-‑world ¡workloads. ¡

Scales ¡well ¡in ¡terms ¡of ¡both ¡graph ¡size ¡

and ¡number ¡of ¡parGGons. ¡ ¡

¡

Thanks! ¡

Conclusions ¡

36 ¡

Argo: ¡Architecture-­‑Aware ¡Graph ¡Par33oning ¡

Angen ¡Zheng ¡ ¡Alexandros ¡Labrinidis, ¡Panos ¡K. ¡Chrysanthis, ¡and ¡Jack ¡Lange ¡

Big ¡Graphs ¡Are ¡Everywhere ¡[SIGMOD’16 ¡Tutorial] ¡

A ¡Balanced ¡Par33oning ¡= ¡Even ¡Load ¡Distribu3on ¡ Minimal ¡Edge-­‑Cut ¡= ¡Minimal ¡Data ¡Comm ¡ ¡

N3 ¡ N1 ¡ N2 ¡

Assump3on: ¡Network ¡is ¡the ¡boOleneck. ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Network ¡is ¡now ¡ as ¡fast ¡as ¡DRAM ¡[C. ¡Bing, ¡VLDB’15] ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡ edge-­‑cut ¡s3ll ¡maOer? ¡ ¡

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-­‑cut ¡s3ll ¡maOer? ¡ ü Why ¡edge-­‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

Graph ¡Par33oners ¡ METIS ¡and ¡LDG ¡ Graph ¡Workloads ¡ BFS, ¡SSSP, ¡and ¡PageRank ¡ Graph ¡Dataset ¡ Orkut ¡(|V|=3M, ¡|E|=234M) ¡ Number ¡of ¡Par33ons ¡ 16 ¡(one ¡parGGon ¡per ¡core) ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

Yes! ¡Both ¡edge-­‑cut ¡and ¡its ¡ distribu3on ¡maOer! ¡

subsystems ¡of ¡modern ¡mulGcore ¡machines. ¡ ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-­‑cut ¡ s3ll ¡maOer? ¡

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-­‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-­‑cut ¡s3ll ¡maOers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

Extra ¡Memory ¡Copy ¡

Intra-­‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

Intra-­‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

LLC ¡and ¡Memory ¡Bandwidth ¡Conten3on ¡ Cache ¡Pollu3on ¡

LLC ¡and ¡Memory ¡Bandwidth ¡Conten3on ¡ Cache ¡Pollu3on ¡

Intra-­‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

Excess ¡intra-­‑node ¡data ¡communica3on ¡ may ¡hurt ¡performance. ¡

Inter-­‑Node ¡Data ¡Comm: ¡RDMA ¡Read/Write ¡

No ¡Extra ¡Memory ¡Copy ¡and ¡Cache ¡Pollu3on ¡ ¡

Offloading ¡excess ¡intra-­‑node ¡data ¡comm ¡across ¡ nodes ¡may ¡achieve ¡beOer ¡performance. ¡

Roadmap ¡

ü IntroducGon ¡ ü Does ¡edge-­‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-­‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

Argo: ¡Graph ¡Par33oning ¡Model ¡

Argo: ¡Architecture-­‑Aware ¡Vertex ¡Placement ¡

Place ¡vertex, ¡v, ¡to ¡a ¡parGGon, ¡Pi, ¡that ¡maximize: ¡ ¡

Great for cases where the network is the bottleneck.

Argo: ¡Architecture-­‑Aware ¡Vertex ¡Placement ¡

data ¡comm). ¡ ¡

ü IntroducGon ¡ ü Does ¡edge-­‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-­‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü Evalua3on ¡ ü Conclusions ¡

Roadmap ¡

Evalua3on: ¡Workloads ¡& ¡Datasets ¡

Dataset |V| |E| Orkut 3M 234M Friendster 124M 3.6B Twitter 52M 3.9B

Evalua3on: ¡Plaeorm ¡

Evalua3on: ¡Par33oners ¡

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡on ¡Orkut ¡dataset ¡

✓ ARGO ¡had ¡the ¡lowest ¡SSSP ¡execu3on ¡3me. ¡

Evalua3on: ¡SSSP ¡LLC ¡Misses ¡on ¡Orkut ¡dataset ¡

✓ ARGO ¡had ¡the ¡lowest ¡LLC ¡Misses. ¡

Evalua3on: ¡SSSP ¡Comm ¡Vol. ¡on ¡Orkut ¡dataset ¡

✓ ARGO ¡had ¡the ¡lowest ¡intra-­‑node ¡communica3on ¡volume. ¡ ✓ Distribu3on ¡of ¡the ¡edge-­‑cut ¡also ¡maOers. ¡

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡Graph ¡Size ¡

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡# ¡of ¡Par33ons ¡

Evalua3on: ¡SSSP ¡Exec. ¡Time ¡vs ¡# ¡of ¡Par33ons ¡ ¡

Evalua3on: ¡Par33oning ¡Overhead ¡

ü Findings ¡

impact ¡the ¡performance ¡a ¡lot ¡

¡

ü ARGO ¡ ¡

intra-­‑node ¡data ¡comm ¡across ¡nodes. ¡

real-­‑world ¡workloads. ¡

and ¡number ¡of ¡parGGons. ¡ ¡

¡

Thanks! ¡

Conclusions ¡

Argo: ¡Architecture-‑Aware ¡Graph ¡Par33oning ¡

A ¡Balanced ¡Par33oning ¡= ¡Even ¡Load ¡Distribu3on ¡ Minimal ¡Edge-‑Cut ¡= ¡Minimal ¡Data ¡Comm ¡ ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡ edge-‑cut ¡s3ll ¡maOer? ¡ ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡s3ll ¡maOer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

Yes! ¡Both ¡edge-‑cut ¡and ¡its ¡ distribu3on ¡maOer! ¡

The ¡End ¡of ¡Slow ¡Networks: ¡Does ¡edge-‑cut ¡ s3ll ¡maOer? ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡s3ll ¡maOers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

Intra-‑Node ¡Data ¡Comm: ¡Shared ¡Memory ¡

Excess ¡intra-‑node ¡data ¡communica3on ¡ may ¡hurt ¡performance. ¡

Inter-‑Node ¡Data ¡Comm: ¡RDMA ¡Read/Write ¡

Offloading ¡excess ¡intra-‑node ¡data ¡comm ¡across ¡ nodes ¡may ¡achieve ¡beOer ¡performance. ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü EvaluaGon ¡ ü Conclusions ¡

Argo: ¡Architecture-‑Aware ¡Vertex ¡Placement ¡

Argo: ¡Architecture-‑Aware ¡Vertex ¡Placement ¡

ü IntroducGon ¡ ü Does ¡edge-‑cut ¡sGll ¡maCer? ¡ ü Why ¡edge-‑cut ¡sGll ¡maCers? ¡ ü Argo ¡ ü Evalua3on ¡ ü Conclusions ¡

✓ ARGO ¡had ¡the ¡lowest ¡intra-‑node ¡communica3on ¡volume. ¡ ✓ Distribu3on ¡of ¡the ¡edge-‑cut ¡also ¡maOers. ¡

intra-‑node ¡data ¡comm ¡across ¡nodes. ¡

real-‑world ¡workloads. ¡