numa aware graph structured analytics
play

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, - PowerPoint PPT Presentation

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute


  1. NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China

  2. Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 400 Million on Tweets/day ay How do we understand and use Big Data ta ?

  3. Data Analytics 100 100 Hrs of Video o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 Million 400 on Tweets/day ay Machine Learning and Data Mining NLP

  4. It’s about the graphs ...

  5. NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- NUMA Socket Now e.g. 80 Cores with 1TB RAM 8 Sockets X (10 Cores with128GB local RAM) NLP

  6. How about ? NUMA systems Graph-analytics

  7. Contribution Polymer : NUMA-aware Graph-structured Analytics □ A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform □ A new system that exploits both NUMA-aware data layout and memory access strategies □ Three optimizations for global synchronization efficiency, load balance and data structure flexibility □ A detailed evaluation that demonstrates the performance and scalability benefits

  8. Outline Background & Issues Design of Polymer Evaluation

  9. Outline Background & Issues Design of Polymer Evaluation

  10. Example: PageRank 4 5 A centrality analysis algorithm to measure the relative rank 3 1 2 for each element of a linked set Characteristics □ Linked set  data dependence □ Rank of who links it  predictable accesses □ Convergence  iterative computation 4 5 4 5 4 5 3 1 4 3 1 2 3 1 1 4

  11. Graph-analytics The scatter-gather model □ “ scatter ” : propagate the current value of a vertex to its neighbors along edges □ “ gather ” : accumulate values from neighbors to compute the next value of a vertex vertex TOPO 1 2 3 4 5 6 edge 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 In-memory data structure □ Graph Topology curr D 1 D 2 D 3 D 4 D 5 D 6 DATA □ Application-specific Data next □ Runtime State curr 1 0 1 1 1 0 STAT next

  12. Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

  13. Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr SEQ R RND W DATA/next STAT/next

  14. Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr DATA/Uout shuffle phase DATA/Uin DATA/next STAT/next

  15. Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr RND R DATA/Uout shuffle SEQ W R phase SEQ W R DATA/Uin RND W DATA/next STAT/next

  16. NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) □ Processor = multiple cores + a local DRAM □ A globally shared memory abstract (cache-coherence) □ Hallmark: Non-uniform memory access Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

  17. NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) Sequential remote access is faster than □ Processor = multiple cores + a local DRAM random local access □ A globally shared memory abstract (cache-coherence) & □ Hallmark: Non-uniform memory access Random remote access is awesome! Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

  18. NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Centralized Interleaved Associated CPU MEM The world we want to lived in

  19. NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Both centralized and interleaved data layout will hamper locality and parallelism & Centralized Interleaved Associated Associated layout is the ideal one. CPU MEM The world we want to lived in

  20. NUMA Characteristics Lack of locality (access neighboring vertices) □ It is inevitable to access remote memory 1 1 6 2 6 2 . . . How to mix ?? 5 3 5 3 4 4 SEQ RND □ Random access is always there Local Global 1 1 2 3 4 5 6 6 2 update 5 3 1 2 3 4 5 6 4

  21. Access Strategy on NUMA SEQ RND Vertex-centric Model L □ Completely overlooked (e.g. Ligra) G 1 6 2 5 3 N0 N1 4 SEQ R L RND W G

  22. Access Strategy on NUMA SEQ RND Edge-centric Model L □ Inefficient way (e.g. X-Stream) G 1 6 2 N0 N1 5 3 4 RND R L SEQ W L shuffle SEQ R L phase SEQ W G SEQ R L RND W L

  23. Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets Performance (sec) Scalability: #sockets 160 8 Runtime (sec) Ligra Ligra X-Stream worse ! Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

  24. Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X Minimize remote & random accesses 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets + Performance (sec) Scalability: #sockets Eliminate the combination of them 160 8 Runtime (sec) Ligra Ligra X-Stream worse Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s ! Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

  25. Outline Background & Issues Design of Polymer Evaluation

  26. Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node

  27. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node 1 2 3 4 5 6 TOPO 1. Graph-aware 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 partitioning DATA D 1 D 2 D 3 D 4 D 5 D 6 STAT 1 0 1 1 1 0

  28. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 Intuitive

  29. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

  30. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

  31. 1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 2. NUMA-aware 1.seq 1.seq/rnd 1.seq/rnd allocation 2.local 2.global 2.global TOPO DATA STAT 3.long 3.long 3.short Virt-Memory Phys-Memory

  32. Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes

  33. Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles 1 RND L 2 5 6 3 all edges of partial vertices 3 4 5 SEQ G 3 4 1 6 2 Each node handles RND G 5 3 partial edges of all vertices SEQ L 5 3 4

  34. 1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

  35. 1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 SEQ R L DATA/curr DATA/next RND W G DATA/curr SEQ R G RND W L DATA/next

  36. Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure

  37. Outline Background & Issues Design of Polymer Evaluation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend