Analysis and Optimization of the Memory Hierarchy for Graph - PowerPoint PPT Presentation

Scalable and Energy-efficient Architecture Lab (SEAL) Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak , Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao*, Xiaowei Jiang*, and Yuan Xie University of California, Santa Barbara *Server Architecture Group, Alibaba Inc.

Scalable and Energy-efficient Architecture Lab (SEAL) Executive Summary Memory-bound behavior in single-machine in-memory graph processing Data-aware characterization of the core and the cache hierarchy to understand the memory-bound behavior Architecture design and evaluation of DROPLET, a data-aware and decoupled prefetcher for graphs to solve the memory access bottleneck 2

Scalable and Energy-efficient Architecture Lab (SEAL) Section I Memory-bound behavior in single-machine in-memory graph processing • Application domains • Single-machine in-memory graph processing • Memory access bottleneck 3

Scalable and Energy-efficient Architecture Lab (SEAL) Transportation Graph Processing Financial money flows 4

Scalable and Energy-efficient Architecture Lab (SEAL) Interest in Graph Processing 5

Scalable and Energy-efficient Architecture Lab (SEAL) Single-Machine In-Memory Graph Processing Many common-case industry and academic graphs fit in RAM of a high-end server Big-memory machines Ex: 1) Intel Xeon with 1.5TB RAM 2) HPE’s MACHINE 6

Scalable and Energy-efficient Architecture Lab (SEAL) Bottleneck… 45% of cycles are DRAM-bound stall cycles Only 15% of cycles are fully utilized by core without stalling 1.0 synchronization fraction of time 0.8 45% DRAM 0.6 0.4 L3 cache front-end 0.2 15% no stalls 0.0 Cycle stack of PageRank on Orkut dataset Data collected using Sniper on a quad-core architecture 7

Scalable and Energy-efficient Architecture Lab (SEAL) Section II Data-aware characterization of the core and the cache hierarchy to understand the memory-bound behavior • Novelty • Background • Characterization setup • Profiling observations • Summary 8

Scalable and Energy-efficient Architecture Lab (SEAL) Characterization of Core and Cache Hierarchy Novelty compared to prior characterization [ IISWC ‘15, MASCOTS ‘16, SC ’15 ]: simulated environment: data-aware profiling: explicit exploration of guidelines to managing performance sensitivity different data types of hardware design parameters 9

Scalable and Energy-efficient Architecture Lab (SEAL) Background: Data Type Terminology Compressed Sparse Row (CSR) data layout • Structure data -> neighbor ID array • Property data -> vertex data array • Intermediate data -> any other data 10

Scalable and Energy-efficient Architecture Lab (SEAL) Experimental Setup Hardware Characteristics on SniperSIM Algorithms (GAP Benchmark) • • 4-core, 128-entry ROB, 2.66GHz Connected Components ( CC ) • • Private L1D/I caches, 32KB, 8-way SA, 4 cycles PageRank ( PR ) • • Private L2 cache, 256KB, 8-way SA, 8 cycles Betweenness Centrality ( BC ) • • Shared L3, 8MB, 16-way SA, 30 cycles Breadth First Search ( BFS ) • • DDR3 DRAM, access latency = 45 ns Single Source Shortest Path ( SSSP ) Datasets (|V|= # vertices, |E| = # edges) • Kron 16.8M |V| 260M |E| • Urand 8.4M |V| 134M |E| • Orkut 3M |V| 117M |E| • LiveJournal 4.8M |V| 68.5M |E| • Road 23.9M |V| 57.7M |E| 11

Scalable and Energy-efficient Architecture Lab (SEAL) Profiling Overview • Can we achieve higher Memory-Level Parallelism (MLP)? - If not, what factor is restricting MLP? • What is the relative performance sensitivity of different cache levels? • How do different data types use the memory hierarchy? 12

Scalable and Energy-efficient Architecture Lab (SEAL) Instruction Window size does not hinder MLP We increase IW size to 4X...... 1.2 30 BC BFS PR CC SSSP speedup with larger ROB SSSP CC BC BFS PR 1.0 utilization change (%) 20 0.8 DRAM BW 0.6 10 0.4 0.2 0 0.0 kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand mean kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand mean Average speedup is Average memory BW utilization only 1.44% increases by only 2.7% 13

Scalable and Energy-efficient Architecture Lab (SEAL) Load-load dependency hinders MLP For every load in ROB, we track its dependency backward until we find an older load…. (producer load) LD[R5] -> R3 ADD R1, R3 -> R4 LD[R4] -> R2 43.2% of loads are part of a dependency chain with chain length of 2.5 (consumer load) 14

Scalable and Energy-efficient Architecture Lab (SEAL) Property data is consumer in load-load dependency We break down producer and consumer loads by application data type… • Property data is mostly a consumer (54%) rather than a producer (6%) • Structure data is mostly a producer (41%) rather than a consumer (6%) 15

Scalable and Energy-efficient Architecture Lab (SEAL) Private L2 cache shows negligible performance sensitivity We vary L2 cache configurations… An architecture without private L2 caches is just as fine for graph processing 16

Scalable and Energy-efficient Architecture Lab (SEAL) Shared LLC shows higher performance sensitivity We vary shared LLC capacities… 17.4% performance improvement for 4X increase in LLC capacity 17

Scalable and Energy-efficient Architecture Lab (SEAL) Heterogeneous Reuse Distances We break down memory hierarchy usage by application data type…. DRAM L3 L2 L1 structure (%) BC BFS CC PR SSSP 100 80 Structure data has the largest reuse 60 40 distance: serviced by L1 and DRAM 20 0 property (%) 100 80 Property data has a larger reuse 60 distance than that serviced by L2 cache 40 20 0 intermediate (%) 100 80 Intermediate data accesses are mostly 60 40 on-chip cache hits 20 0 kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand kron livejournal orkut road urand 18

Scalable and Energy-efficient Architecture Lab (SEAL) To Summarize…. Memory-bound behavior caused by: • Heterogeneous reuse distances of different data types leading to intensive DRAM accesses for structure and property data • Low MLP due to load-load dependency chains, limiting the possibility of overlapping DRAM accesses 19

Scalable and Energy-efficient Architecture Lab (SEAL) Section III Architecture design and evaluation of DROPLET, a data-aware and decoupled prefetcher for graphs to solve the memory access bottleneck • DROPLET introduction • DROPLET overview • L2 structure streamer • Property prefetcher • Evaluation 20

Scalable and Energy-efficient Architecture Lab (SEAL) DROPLET: Data-AwaRe DecOuPLed PrEfeTcher struct_trigger Data-aware Structure 1 Private L2 cache struct_req Streamer Decoupled: Data-aware: prop_dat prop_req struct_req overcomes prefetches data struct_dat serialization types Shared Inclusive Coherence engine LLC from load-load according to dependency reuse prop_dat struct_req prop_req struct_dat distances prop_req Data-aware Memory prop_trigger Property 2 Controller Prefetcher struct_req struct_dat prop_dat 21

Scalable and Energy-efficient Architecture Lab (SEAL) DROPLET Overview struct_trigger Data-aware Structure 1 Private L2 cache struct_req Streamer prop_dat prop_req struct_req struct_dat Data-aware structure streamer Shared Inclusive Coherence sends a prefetch request for engine LLC structure data prop_dat struct_req prop_req struct_dat prop_req Data-aware Memory prop_trigger Property 2 Controller Prefetcher struct_req struct_dat prop_dat 22

Scalable and Energy-efficient Architecture Lab (SEAL) DROPLET Overview struct_trigger Data-aware Structure 1 Private L2 cache • Copy of prefetched structure struct_req Streamer cacheline triggers property prop_dat prop_req struct_req prefetcher in MC. struct_dat Shared Inclusive Coherence • Property prefetcher uses engine LLC information in structure prop_dat cacheline to calculate struct_req prop_req struct_dat property prefetch addresses. prop_req Data-aware Memory prop_trigger Property 2 Controller Prefetcher struct_req struct_dat prop_dat 23

Scalable and Energy-efficient Architecture Lab (SEAL) DROPLET Overview struct_trigger Data-aware Structure 1 Private L2 cache • Property prefetch address is struct_req Streamer used to check the coherence prop_dat prop_req struct_req engine for on-chip presence struct_dat Shared Inclusive Coherence of data engine LLC • If not on-chip, line up prop_dat request in MC struct_req prop_req struct_dat • If on-chip, query LLC for prop_req Data-aware property data Memory prop_trigger Property 2 Controller Prefetcher struct_req struct_dat prop_dat 24

Analysis and Optimization of the Memory Hierarchy for Graph - PowerPoint PPT Presentation

Scalable and Energy-efficient Architecture Lab (SEAL) Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak , Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Tools for Memory Performance Analysis Goals Expose the hierarchy Show the placement and

Generative Deep Neural Networks for Dialogue Presented By Shantanu Kumar Adapted from slides by

Visualization & Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424

Hierarchical Probabilistic Models for Object Segmentation S. M. Ali Eslami Christopher K. I.

Computer-System Architecture Operating System Concepts 2.2 Silberschatz, Galvin and Gagne

What is Computer Architecture? Structure: static arrangement of the parts Organization:

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By:

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

Analysis and Optimization of the Memory Hierarchy for Graph - PowerPoint PPT Presentation

Scalable and Energy-efficient Architecture Lab (SEAL) Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak , Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao*, Xiaowei Jiang*, and Yuan Xie

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

1 5.1 Introduction A Typical Memory Hierarchy A Typical Memory Hierarchy Memory Technology

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

1 Basic use of caches Levels in the memory hierarchy When fetching an instruction, first

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Data Management Systems Storage Management The Memory hierarchy Memory hierarchy

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Tools for Memory Performance Analysis Goals Expose the hierarchy Show the placement and

Generative Deep Neural Networks for Dialogue Presented By Shantanu Kumar Adapted from slides by

Visualization &amp; Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424

Hierarchical Probabilistic Models for Object Segmentation S. M. Ali Eslami Christopher K. I.

Computer-System Architecture Operating System Concepts 2.2 Silberschatz, Galvin and Gagne

What is Computer Architecture? Structure: static arrangement of the parts Organization:

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By:

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

Scalable and Energy-efficient Architecture Lab (SEAL) Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads Abanti Basak , Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Visualization & Visual Analytics 1 Angus Forbes creativecoding.evl.uic.edu/courses/cs424