 
              A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Seminar on Computer Architecture Roknoddin Azizibarzoki Junwhan Ahn, Sungpack Hong* Sungjoo Yoo, Onur Mutlu+ Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University I nternational S ymposium on C omputer A rchitecture 2015
Executive Summary Problem : Performance of graph processing on conventional systems does not scale in proportion to graph size Goal : Design an infrastructure with scalable performance for graph processing Observation : High memory bandwidth can sustain scalability in graph processing Key Idea : Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Results : up to 13.8x performance improvement and 87% energy reduction � 2
Graph Processing
Graphs Abstractions used to represent objects and their relations These representations can sometimes become very huge in real world applications Vertices are used to represent objects Graphs used in this paper can reach up Edges are used to represent the relation to 200 million edges, 7 million vertices, between the objects and 3-5 GB of memory footprint � 4 Image obtained from: Grandjean, Martin (2015), "Introduction à la visualisation de données, l'analyse de réseau en histoire", Geschichte und Informatik 18/19 (PDF) , pp. 109–128
Graph Processing Workloads Large amount of data is processed in parallel and almost independent of each other Parallel computation almost independent for each vertex 1 for (v: graph.vertices): Example: Page Rank 2 for (u: v.successors): Originally designed to sort 3 3 u.new_rank = v.rank * weight u.new_rank = v.rank * weight webpages based on number of 4 4 for (v: graph.vertices): for (v: graph.vertices): views for Google, so as to do 5 v.rank = v.new_rank better webpage suggestions 6 v.new_rank = alpha � 5 � 5
Graph Processing Workloads Characteristics Characteristics of this parallel, vertex independent computation: 2. Small amount of computation per vertex 1. Frequent random memory accesses 2 for (u: v.successors): 3 u.new_rank = v.rank * weight 1 2 Each successor might lead you to a whole new subgraph Simple multiplication computation � 6
Graph Processing on Conventional Systems Page Rank performance on conventional graph processing infrastructures: INSIGHT: High bandwidth can mitigate the performance bottleneck! 6 6 5.3x 5 5 1. More bandwidth helps! 4 4 IDEA: Speedup Speedup Ideally! 2. Conventional systems do 3 3 1. Let’s use HMC based Processing-In-Memory to provide high bandwidth +89% +89% not utilize bandwidth 2. And design specially architected cores to exploit this bandwidth 2 2 +42% +42% (Tesseract Cores) 1 1 32 Cores 32 Cores 128 Cores 128 Cores 128 Cores 128 Cores 128 Cores DDR3 DDR3 DDR3 DDR3 HMC HMC HMC Internal BW (102.4GB/s) (102.4GB/s) (102.4GB/s) (102.4GB/s) (640GB/s) (640GB/s) (8TB/s) � 7 � 7
Tesseract System
Tesseract System - Each HMC cube contains 32 vaults, each armed with a simple in-order core in their logic layer, so that the cores can use HMC’s internal BW - Vaults communicating over a crossbar network for remote function calls Host Processor - Specialized cores, armed with latency -A network of HMC cubes tolerant programming model and graph -Memory mapped accelerator processing based prefetching mechanisms interface, non-cacheable, and - Message passing interface, prefetching no support for virtualization mechanisms � 9
Processing-In-Memory with 3D stacked DRAM Large amount of bandwidth available for the cores to utilize Specialized cores, armed with latency tolerant programming model and graph processing based prefetching mechanisms � 10
Communications in Tesseract In-Order Core In-Order Core DRAM Controller DRAM Controller List List Prefetch Prefetch Prefetcher Prefetcher Buffer Buffer Data needed by a Tesseract core might be present in another vaults memory region Message-Triggered Message-Triggered Prefetcher Prefetcher Message Queue Message Queue Message Queue NI NI NI � 1 1
Communications in Tesseract Data needed by a tesseract core might be present in another vaults memory region Non-blocking remote function call, for (u: v.successors): 2 for (u: v.successors): increases latency toleration in the put(w.id, function() { w.next_rank += weight * v.rank; }) barrier() 3 u.new_rank = v.rank * weight source core and guarantees atomicity Send function address and arguments to the remote core TC #x TC #x TC #y TC #y v v Vault #x Vault #x Vault #y Vault #y u u 12
Prefetching in Tesseract In-Order Core In-Order Core Prefetching the data being referenced in the message queue DRAM Controller DRAM Controller (Later noted as MTP in the evaluation section) List List Prefetch Prefetch Prefetch Prefetcher Prefetcher Buffer Buffer Buffer When message enters the message Message-Triggered Message-Triggered Message-Triggered queue, a prefetch request is issued Prefetcher Prefetcher Prefetcher And the message is ready to be serviced when data is present Message Queue Message Queue NI NI 13
Tesseract Core Novelties of Tesseract - Usage of PIM (logic layer integration) to increase the BW available to the cores In-Order Core - Message passing employed, to increase latency tolerance and guarantee atomicity DRAM Controller - Specially crafted prefetching mechanisms are used to utilize the abundant BW available for graph processing List Prefetch Prefetcher Buffer Other Constructs of Tesseract: 1. List Prefetching: Prefetching based on the next elements in the list of Message-Triggered traversal, with a constant stride (later noted as LP in the evaluation section) Prefetcher 2. Programming API Message Queue NI 3. Blocking remote function calls 14 14
Evaluation
Evaluation Methodology - DDR3 + OoO cores - HMC + OoO cores, higher bandwidth - HMC + more number of simpler, less powerful cores Workloads Simulated Systems - Tesseract, logic layer integration of the HMC with Tesseract cores 3 real world graphs: • ljournal-2008 (social network) • enwiki-2003 (Wikipedia) • indochina-0024 (web graph) 5 graph processing algorithms: • Average teenage follower • Conductance • PageRank • Single-source shortest path • Vertex cover 16 16
Evaluation Results Average Performance 16 13.8x 14 1 1.6x 12 9.0x 10 Speedup 8 6 4 +56% +25% 2 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract Tesseract LP LP + MTP 17
Evaluation Results Average Bandwidth Utilization 3.5 Memory Bandwidth ( TB/s ) 2.9TB/s 3 2.2TB/s 2.5 2 1.3TB/s 1.5 1 243GB/s 0.5 190GB/s 80GB/s DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract Tesseract LP LP + MTP 18
Evaluation Results Average Memory Energy Consumption Memory Layers Logic Layers Cores 1.2 1 Normalized Energy 0.8 0.6 0.4 -87% 0.2 HMC-OoO Tesseract LP + MTP 19
Executive Summary Problem : Performance of graph processing on conventional systems does not scale in proportion to graph size Goal : Design an infrastructure with scalable performance for graph processing Observation : High memory bandwidth can sustain scalability in graph processing Key Idea : Make use of Processing-In-Memory to provide high bandwidth, and design specially architected cores to utilize that bandwidth Results : 10x performance improvement and 87% energy reduction � 20
Analysis
Strengths 1. First work to introduce Processing-In-Memory to graph computations 2. Employing specially designed prefetching mechanisms to better utilize BW 3. Non-blocking remote function call is an effective way to increase latency tolerance 4. The paper is written in a way that is easy to follow 22
Weaknesses 1. Data placement is not taken as a serious concern in this work (GraphP [1], Reduce communication in Tesseract with efficient data placement) 2. The paper has not discussed why it is limited to graph applications 3. Introducing barriers raises the concern of load balancing 4. No comparison against prevalent graph processing platforms like GPUs is included in the paper 5. Adapting common applications to the programming model is not easy 23
Takeaways 1. Optimizing a narrow set of factors might lead to underutilization of resources 2. If designed effectively, PIM might be a promising approach to provide high bandwidth for large scale data processing 24
Discussions 1. There is the other construct called Blocking Remote Function Calls The difference is that in that one you have return values that you want to wait for them to come back to the source core Can you think of ways to optimize remote blocking function calls? 25
Discussions 2. How hard will it be to expand Tesseract to other applications? 26
Discussions 3. How bad will Tesseract suffer from unbalanced workloads? 27
Discussions 4. What if we switch Tesseract cores with GPU Streaming Multiprocessors? TOM[2]: Transparent Offloading and Mapping 1. What to offload to the GPU-PIM accelerator: Bandwidth gain 30% average performance gain over a baseline with a GPU 2. How to map the data and schedule the computation to benefit the most: without offloading Subsequent accesses have a certain offset, thus we can map them together 28 28
Recommend
More recommend