A Scalable Processing-in-Memory Accelerator for Parallel Graph - PowerPoint PPT Presentation

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Seminar on Computer Architecture Roknoddin Azizibarzoki Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University I nternational S ymposium on C omputer A rchitecture 2015

Executive Summary Problem : Performance of graph processing on conventional systems does not scale in proportion to graph size Goal : Design an infrastructure with scalable performance for graph processing Observation : High memory bandwidth can sustain scalability in graph processing Key Idea : Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Results : up to 13.8x performance improvement and 87% energy reduction � 2

Graph Processing

Graphs Abstractions used to represent objects and their relations These representations can sometimes become very huge in real world applications Vertices are used to represent objects Graphs used in this paper can reach up Edges are used to represent the relation to 200 million edges, 7 million vertices, between the objects and 3-5 GB of memory footprint � 4 Image obtained from: Grandjean, Martin (2015), "Introduction à la visualisation de données, l'analyse de réseau en histoire", Geschichte und Informatik 18/19 (PDF) , pp. 109–128

Graph Processing Workloads Large amount of data is processed in parallel and almost independent of each other Parallel computation almost independent for each vertex 1 for (v: graph.vertices): Example: Page Rank 2 for (u: v.successors): Originally designed to sort 3 3 u.new_rank = v.rank * weight u.new_rank = v.rank * weight webpages based on number of 4 4 for (v: graph.vertices): for (v: graph.vertices): views for Google, so as to do 5 v.rank = v.new_rank better webpage suggestions 6 v.new_rank = alpha � 5 � 5

Graph Processing Workloads Characteristics Characteristics of this parallel, vertex independent computation: 2. Small amount of computation per vertex 1. Frequent random memory accesses 2 for (u: v.successors): 3 u.new_rank = v.rank * weight 1 2 Each successor might lead you to a whole new subgraph Simple multiplication computation � 6

Graph Processing on Conventional Systems Page Rank performance on conventional graph processing infrastructures: INSIGHT: High bandwidth can mitigate the performance bottleneck! 6 6 5.3x 5 5 1. More bandwidth helps! 4 4 IDEA: Speedup Speedup Ideally! 2. Conventional systems do 3 3 1. Let’s use HMC based Processing-In-Memory to provide high bandwidth +89% +89% not utilize bandwidth 2. And design specially architected cores to exploit this bandwidth 2 2 +42% +42% (Tesseract Cores) 1 1 32 Cores 32 Cores 128 Cores 128 Cores 128 Cores 128 Cores 128 Cores DDR3 DDR3 DDR3 DDR3 HMC HMC HMC Internal BW (102.4GB/s) (102.4GB/s) (102.4GB/s) (102.4GB/s) (640GB/s) (640GB/s) (8TB/s) � 7 � 7

Tesseract System

Tesseract System - Each HMC cube contains 32 vaults, each armed with a simple in-order core in their logic layer, so that the cores can use HMC’s internal BW - Vaults communicating over a crossbar network for remote function calls Host Processor - Specialized cores, armed with latency -A network of HMC cubes tolerant programming model and graph -Memory mapped accelerator processing based prefetching mechanisms interface, non-cacheable, and - Message passing interface, prefetching no support for virtualization mechanisms � 9

Processing-In-Memory with 3D stacked DRAM Large amount of bandwidth available for the cores to utilize Specialized cores, armed with latency tolerant programming model and graph processing based prefetching mechanisms � 10

Communications in Tesseract In-Order Core In-Order Core DRAM Controller DRAM Controller List List Prefetch Prefetch Prefetcher Prefetcher Buffer Buffer Data needed by a Tesseract core might be present in another vaults memory region Message-Triggered Message-Triggered Prefetcher Prefetcher Message Queue Message Queue Message Queue NI NI NI � 1 1

Communications in Tesseract Data needed by a tesseract core might be present in another vaults memory region Non-blocking remote function call, for (u: v.successors): 2 for (u: v.successors): increases latency toleration in the put(w.id, function() { w.next_rank += weight * v.rank; }) barrier() 3 u.new_rank = v.rank * weight source core and guarantees atomicity Send function address and arguments to the remote core TC #x TC #x TC #y TC #y v v Vault #x Vault #x Vault #y Vault #y u u 12

Prefetching in Tesseract In-Order Core In-Order Core Prefetching the data being referenced in the message queue DRAM Controller DRAM Controller (Later noted as MTP in the evaluation section) List List Prefetch Prefetch Prefetch Prefetcher Prefetcher Buffer Buffer Buffer When message enters the message Message-Triggered Message-Triggered Message-Triggered queue, a prefetch request is issued Prefetcher Prefetcher Prefetcher And the message is ready to be serviced when data is present Message Queue Message Queue NI NI 13

Tesseract Core Novelties of Tesseract - Usage of PIM (logic layer integration) to increase the BW available to the cores In-Order Core - Message passing employed, to increase latency tolerance and guarantee atomicity DRAM Controller - Specially crafted prefetching mechanisms are used to utilize the abundant BW available for graph processing List Prefetch Prefetcher Buffer Other Constructs of Tesseract: 1. List Prefetching: Prefetching based on the next elements in the list of Message-Triggered traversal, with a constant stride (later noted as LP in the evaluation section) Prefetcher 2. Programming API Message Queue NI 3. Blocking remote function calls 14 14

Evaluation

Evaluation Methodology - DDR3 + OoO cores - HMC + OoO cores, higher bandwidth - HMC + more number of simpler, less powerful cores Workloads Simulated Systems - Tesseract, logic layer integration of the HMC with Tesseract cores 3 real world graphs: • ljournal-2008 (social network) • enwiki-2003 (Wikipedia) • indochina-0024 (web graph) 5 graph processing algorithms: • Average teenage follower • Conductance • PageRank • Single-source shortest path • Vertex cover 16 16

Evaluation Results Average Performance 16 13.8x 14 1 1.6x 12 9.0x 10 Speedup 8 6 4 +56% +25% 2 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract Tesseract LP LP + MTP 17

Evaluation Results Average Bandwidth Utilization 3.5 Memory Bandwidth ( TB/s ) 2.9TB/s 3 2.2TB/s 2.5 2 1.3TB/s 1.5 1 243GB/s 0.5 190GB/s 80GB/s DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract Tesseract LP LP + MTP 18

Evaluation Results Average Memory Energy Consumption Memory Layers Logic Layers Cores 1.2 1 Normalized Energy 0.8 0.6 0.4 -87% 0.2 HMC-OoO Tesseract LP + MTP 19

Executive Summary Problem : Performance of graph processing on conventional systems does not scale in proportion to graph size Goal : Design an infrastructure with scalable performance for graph processing Observation : High memory bandwidth can sustain scalability in graph processing Key Idea : Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Results : 10x performance improvement and 87% energy reduction � 20

Analysis

Strengths 1. First work to introduce Processing-In-Memory to graph computations 2. Employing specially designed prefetching mechanisms to better utilize BW 3. Non-blocking remote function call is an effective way to increase latency tolerance 4. The paper is written in a way that is easy to follow 22

Weaknesses 1. Data placement is not taken as a serious concern in this work (GraphP [1], Reduce communication in Tesseract with efficient data placement) 2. The paper has not discussed why it is limited to graph applications 3. Introducing barriers raises the concern of load balancing 4. No comparison against prevalent graph processing platforms like GPUs is included in the paper 5. Adapting common applications to the programming model is not easy 23

Takeaways 1. Optimizing a narrow set of factors might lead to underutilization of resources 2. If designed effectively, PIM might be a promising approach to provide high bandwidth for large scale data processing 24

Discussions 1. There is the other construct called Blocking Remote Function Calls The difference is that in that one you have return values that you want to wait for them to come back to the source core Can you think of ways to optimize remote blocking function calls? 25

Discussions 2. How hard will it be to expand Tesseract to other applications? 26

Discussions 3. How bad will Tesseract suffer from unbalanced workloads? 27

Discussions 4. What if we switch Tesseract cores with GPU Streaming Multiprocessors? TOM[2]: Transparent Offloading and Mapping 1. What to offload to the GPU-PIM accelerator: Bandwidth gain 30% average performance gain over a baseline with a GPU 2. How to map the data and schedule the computation to benefit the most: without offloading Subsequent accesses have a certain offset, thus we can map them together 28 28

A Scalable Processing-in-Memory Accelerator for Parallel Graph - PowerPoint PPT Presentation

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Seminar on Computer Architecture Roknoddin Azizibarzoki Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn

Parallel accelerator simulations past, present and future James Amundson Fermilab November 21,

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Pediatric Trauma Resuscitation: How We Do It Maria F. McMahon, RN, MSN, PNP-AC/PC Trauma Center

Cellular Networks: Background and Classical Vulnerabilities Patrick Traynor CSE 545 Systems

Concurrency 1 Introduction Alexandre David adavid@cs.aau.dk Credits for the slides: Claus

Tonights Agenda 6:30 Welcome & Updates 6:35 Main Event Nominations, selection, viewing

A Network for the Balkans Goran Djordjevi SEENET-MTP Network Office Faculty of Science and

Final Results Review Third Workshop 10/25/2019 Jasmine Ouyang, Sr. Consultant Gabe Mantegna,

Instrumentation for Cooking Pattern Analysis in Peri-Urban Nepal Shengrong Yin, Amod Kumar

TRWD Customer Pricing Workshop Jeff Hughes, UNC Environmental Finance Center jhughes@sog.unc.edu

A Scalable Processing-in-Memory Accelerator for Parallel Graph - PowerPoint PPT Presentation

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Seminar on Computer Architecture Roknoddin Azizibarzoki Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

PASCAL A Parallel Algorithmic SCALable Framework A Parallel Algorithmic SCALable Framework for

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn

Parallel accelerator simulations past, present and future James Amundson Fermilab November 21,

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&amp;D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&amp;D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

Pediatric Trauma Resuscitation: How We Do It Maria F. McMahon, RN, MSN, PNP-AC/PC Trauma Center

Cellular Networks: Background and Classical Vulnerabilities Patrick Traynor CSE 545 Systems

Concurrency 1 Introduction Alexandre David adavid@cs.aau.dk Credits for the slides: Claus

Tonights Agenda 6:30 Welcome &amp; Updates 6:35 Main Event Nominations, selection, viewing

A Network for the Balkans Goran Djordjevi SEENET-MTP Network Office Faculty of Science and

Final Results Review Third Workshop 10/25/2019 Jasmine Ouyang, Sr. Consultant Gabe Mantegna,

Instrumentation for Cooking Pattern Analysis in Peri-Urban Nepal Shengrong Yin, Amod Kumar

TRWD Customer Pricing Workshop Jeff Hughes, UNC Environmental Finance Center jhughes@sog.unc.edu

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

Tonights Agenda 6:30 Welcome & Updates 6:35 Main Event Nominations, selection, viewing