NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, - PowerPoint PPT Presentation

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China

Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 400 Million on Tweets/day ay How do we understand and use Big Data ta ?

Data Analytics 100 100 Hrs of Video o 1.11 Billion on Users every minute 6 6 Billion on Photos os 400 Million 400 on Tweets/day ay Machine Learning and Data Mining NLP

It’s about the graphs ...

NUMA & Graph-analytics Application Hardware Processor Memory Single Unified Multi- Core NUCA Multi- NUMA Socket Now e.g. 80 Cores with 1TB RAM 8 Sockets X (10 Cores with128GB local RAM) NLP

How about ? NUMA systems Graph-analytics

Contribution Polymer : NUMA-aware Graph-structured Analytics □ A comprehensive analysis that uncovers issues for running graph analytics system on NUMA platform □ A new system that exploits both NUMA-aware data layout and memory access strategies □ Three optimizations for global synchronization efficiency, load balance and data structure flexibility □ A detailed evaluation that demonstrates the performance and scalability benefits

Outline Background & Issues Design of Polymer Evaluation

Example: PageRank 4 5 A centrality analysis algorithm to measure the relative rank 3 1 2 for each element of a linked set Characteristics □ Linked set  data dependence □ Rank of who links it  predictable accesses □ Convergence  iterative computation 4 5 4 5 4 5 3 1 4 3 1 2 3 1 1 4

Graph-analytics The scatter-gather model □ “ scatter ” : propagate the current value of a vertex to its neighbors along edges □ “ gather ” : accumulate values from neighbors to compute the next value of a vertex vertex TOPO 1 2 3 4 5 6 edge 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 In-memory data structure □ Graph Topology curr D 1 D 2 D 3 D 4 D 5 D 6 DATA □ Application-specific Data next □ Runtime State curr 1 0 1 1 1 0 STAT next

Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

Vertex-centric (e.g. Ligra) 1 6 2 5 3 4 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr SEQ R RND W DATA/next STAT/next

Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr DATA/Uout shuffle phase DATA/Uin DATA/next STAT/next

Edge-centric (e.g. X-Stream) 1 6 2 5 3 partition 4 TOPO/edge STAT/curr DATA/curr RND R DATA/Uout shuffle SEQ W R phase SEQ W R DATA/Uin RND W DATA/next STAT/next

NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) □ Processor = multiple cores + a local DRAM □ A globally shared memory abstract (cache-coherence) □ Hallmark: Non-uniform memory access Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

NUMA Characteristics A commodity NUMA machine □ Multiple processor nodes (i.e., socket) Sequential remote access is faster than □ Processor = multiple cores + a local DRAM random local access □ A globally shared memory abstract (cache-coherence) & □ Hallmark: Non-uniform memory access Random remote access is awesome! Latency (Cycle) Bandwidth (MB/s) Inst. 0-hop 1-hop 2-hop Access 0-hop 1-hop 2-hop IL 80-core Intel Xeon machine 80-core Intel Xeon machine Load 117 271 372 SEQ 3207 2455 2101 2333 Store 108 304 409 RAND 720 348 307 344

NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Centralized Interleaved Associated CPU MEM The world we want to lived in

NUMA Characteristics The world we lived in: “ first-touch ” policy “binding virtual pages to physical frames locating on a memory node where a thread first touches the pages” Both centralized and interleaved data layout will hamper locality and parallelism & Centralized Interleaved Associated Associated layout is the ideal one. CPU MEM The world we want to lived in

NUMA Characteristics Lack of locality (access neighboring vertices) □ It is inevitable to access remote memory 1 1 6 2 6 2 . . . How to mix ?? 5 3 5 3 4 4 SEQ RND □ Random access is always there Local Global 1 1 2 3 4 5 6 6 2 update 5 3 1 2 3 4 5 6 4

Access Strategy on NUMA SEQ RND Vertex-centric Model L □ Completely overlooked (e.g. Ligra) G 1 6 2 5 3 N0 N1 4 SEQ R L RND W G

Access Strategy on NUMA SEQ RND Edge-centric Model L □ Inefficient way (e.g. X-Stream) G 1 6 2 N0 N1 5 3 4 RND R L SEQ W L shuffle SEQ R L phase SEQ W G SEQ R L RND W L

Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets Performance (sec) Scalability: #sockets 160 8 Runtime (sec) Ligra Ligra X-Stream worse ！ Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

Scalability & Performance on NUMA Scalability: #Cores vs. #sockets Intel 80-cores (8Sx10C) 10 10 X-Stream Ligra Ligra Normalized Normalized 8 8 8C: 6.92X Speedup X-Stream Speedup X-Stream 6 6 Galois Galois 8S: 4.58X 4 4 Galois 2 2 10C: 6.19X Minimize remote & random accesses 0 0 8S: 2.90X 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 #cores #sockets + Performance (sec) Scalability: #sockets Eliminate the combination of them 160 8 Runtime (sec) Ligra Ligra X-Stream worse Normalized 120 X-Stream 6 Speedup X-Stream 1S: 132s ！ Galois Galois 8S: 29s 80 4 8 Socket LG: 2.9X 2 40 Galois XS: 1.4X 1S: 33s 0 0 8S: 12s 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 #sockets #sockets Intel 80-cores (8Sx10C) AMD 64-core (8Sx8C)

Goal#1: Reduce remote accesses Co-locating data and computation within the same NUMA node

1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node 1 2 3 4 5 6 TOPO 1. Graph-aware 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 partitioning DATA D 1 D 2 D 3 D 4 D 5 D 6 STAT 1 0 1 1 1 0

1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 5 2 5 6 1 3 5 1 2 3 6 2 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 Intuitive

1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node N0 N1 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 sophisticated

1 Graph-aware Data Layout 6 2 5 3 4 Co-locating data and computation within the same NUMA node agent N0 N1 1 2 3 4 5 6 1 2 3 4 5 6 1. Graph-aware partitioning 2 3 3 2 1 3 1 2 3 2 5 5 6 5 6 D 1 D 2 D 3 D 4 D 5 D 6 1 0 1 1 1 0 2. NUMA-aware 1.seq 1.seq/rnd 1.seq/rnd allocation 2.local 2.global 2.global TOPO DATA STAT 3.long 3.long 3.short Virt-Memory Phys-Memory

Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes

Goal#2: Eliminate “random + remote” Random remote access → access neighboring vertices on other nodes distribute the computations on a singe vertex over multiple nodes Each node handles 1 RND L 2 5 6 3 all edges of partial vertices 3 4 5 SEQ G 3 4 1 6 2 Each node handles RND G 5 3 partial edges of all vertices SEQ L 5 3 4

1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 STAT/curr TOPO/vertex TOPO/out-edge DATA/curr DATA/next STAT/next

1 NUMA-aware Access Strategy 6 2 5 3 distribute the computations on 4 singe vertex over multiple NUMA-nodes N0 N1 SEQ R L DATA/curr DATA/next RND W G DATA/curr SEQ R G RND W L DATA/next

Optimizations 1. Rolling update 2. Hierarchical and efficient barrier 3. Adaptive data structure

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, - PowerPoint PPT Presentation

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Who we are Alexey sysbench maintainer since 2004. Formerly performance engineer, Kopytov

CFSCQ: Extending a verified file system with concurrency Tej Chajed advised by Frans Kaashoek

Bootstrap Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 9 in Givens &

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates

Applied Bayesian Statistics STAT 388/488 Dr. Earvin Balderama Department of Mathematics &

Brazil''Canada'Cancer' Surveillance'Collabora3ve' Where'Sand'Meets'Snow:''

MY BACKGROUND NHS UK Consultant Surgeon since 2003 Royal Marsden Hospital, London (Hon.)

Trends in Data Breach and Cybersecurity Regulation, Legislation and Litigation Part I March 20,

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, - PowerPoint PPT Presentation

NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 100 Hrs of Video 100 o 1.11 Billion on Users every minute

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines

Patrick Schmidt, Christoph Sterz NUMA-aware SURF Speeded Up Robust Features Object detection

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Who we are Alexey sysbench maintainer since 2004. Formerly performance engineer, Kopytov

CFSCQ: Extending a verified file system with concurrency Tej Chajed advised by Frans Kaashoek

Bootstrap Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 9 in Givens &amp;

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates

Applied Bayesian Statistics STAT 388/488 Dr. Earvin Balderama Department of Mathematics &amp;

Brazil''Canada'Cancer' Surveillance'Collabora3ve' Where'Sand'Meets'Snow:''

MY BACKGROUND NHS UK Consultant Surgeon since 2003 Royal Marsden Hospital, London (Hon.)

Trends in Data Breach and Cybersecurity Regulation, Legislation and Litigation Part I March 20,

Bootstrap Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 9 in Givens &

Applied Bayesian Statistics STAT 388/488 Dr. Earvin Balderama Department of Mathematics &