Practical Near-Data Processing for In-Memory Analytics Frameworks - PowerPoint PPT Presentation

Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT – Oct 19, 2015

Motivating Trends  End of Dennard scaling  systems are energy limited  Emerging big data workloads o Massive datasets, limited temporal locality, irregular access patterns o They perform poorly on conventional cache hierarchies  Need alternatives to improve energy efficiency Deep Neural Networks MapReduce Graphs 2 Figs: http://oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

PIM & NDP  Improve performance & energy by avoiding data movement  Processing-In- Memory (1990’s – 2000’s) o Same-die integration is too expensive  Near-Data Processing o Enabled by 3D integration o Practical technology solution o Processing on the logic die Hybrid Memory Cube High Bandwidth Memory (HMC) (HBM) 3 Figs: www.extremetech.com

Base NDP Hardware  Stacks linked to host multi-core processor High-Speed Serial Link o Code with temporal locality: runs on host o Code without temporal locality: runs on NDP Memory Host Stack Processor  3D memory stack o x10 bandwidth, x3-5 power improvement o 8-16 vaults per stack Channel • Vertical channel Bank • Dedicated vault controller ... o NDP cores DRAM Die • General-purpose, in-order cores • FPU, L1 caches I/D, no L2 • Multithreaded for latency tolerance NoC Vault Logic Vault 4 Logic Die

Challenges and Contributions  NDP for large-scale highly distributed analytics frameworks ? General coherence maintaining is expensive  Scalable and adaptive software-assisted coherence ? Inefficient communication and synchronization through host processor  Pull-based model to directly communicate, remote atomic operations ? Hardware/software interface  A lightweight runtime to hide low-level details to make program easier ? Processing capability and energy efficiency  Balanced and efficient hardware  A general, efficient, balanced, practical-to-use NDP architecture 5

Example App: PageRank  Edge-centric, scatter-gather, graph processing framework  Other analytics frameworks have similar behaviors Edge-centric SG PageRank edge_scatter(edge_t e) u = src.rank / src.out_degree src sends update over e sum += u update_gather(update_t u) if all gathered Sequential accesses (stream in/out) apply u to dst dst.rank = b * sum + (1-b) while not done Partitioned dataset, local processing for e in all edges edge_scatter(e) Synchronization between iterations for u in all updates update_gather(u) Communication between graph partitions 6

Architecture Design Memory model, communication, coherence, … Lightweight hardware structures and software runtime

Shared Memory Model  Unified physical address space across stacks o Direct access from any NDP/host core to memory in any vault/stack  In PageRank o One thread to access data in a remote graph partition • For edges across two partitions Local Vault Memory Local  Implementation Remote Mem Ctrl Router o Memory ctrl forwards local/remote accesses Memory request o Shared router in each vault …… NDP NDP NDP Core Core Core 8

Virtual Memory Support  NDP threads access virtual address space o Small TLB per core (32 entries) o Large pages to minimize TLB misses (2 MB) o Sufficient to cover local memory & remote buffers  In PageRank o Each core works on local data, much smaller than the entire dataset o 0.25% miss rate for PageRank  TLB misses served by OS in host o Similar to IOMMU misses in conventional systems 9

Software-Assisted Coherence  Maintaining general coherence is expensive in NDP systems o Highly distributed, multiple stacks Vault 0 Vault 1  Analytics frameworks o Little data sharing except for communication Vault Memory Vault Memory o Data partitioning is coarse-grained Mem Ctrl Mem Ctrl $ $ $ $ Memory vault  Only allow data to be cached in one cache identified by NDP NDP NDP NDP Core physical address Core Core Core o Owner cache o No need to check other caches Owner cache identified by TLB  Page-level coarse-grained o Owner cache configurable through PTE 10

Software-Assisted Coherence  Scalable Vault 0 Vault 1 o Avoids directory lookup and storage Dataset Vault Memory Vault Memory Mem Ctrl  Adaptive Mem Ctrl $ $ $ $ o Data may overflow to other vault o Able to cache data from any vault in local cache NDP NDP NDP NDP Core Core Core Core  Flush only when owner cache changes o Rarely happen as dataset partitioning is fixed 11

Communication  Pull-based model o Producer buffers intermediate/result data locally and separately o Post small message (address, size) to consumer o Consumer pulls data when it needs with load instructions Task Task Task Task Process Cores Cores Cores Cores Buffer Pull Task Task Task Task 12

Communication  Pull-based model is efficient and scalable o Sequential accesses to data o Asynchronous and highly parallel o Avoids the overheads of extra copies o Eliminates host processor bottleneck  In PageRank o Used to communicate the update lists across partitions 13

Communication  HW optimization: remote load buffer (RLBs) o A small buffer per NDP core (a few cachelines) o Prefetch and cache remote (sequential) load accesses • Remote data are not cache-able in the local cache • Do not want owner cache change as it results in cache flush  Coherence guarantee with RLBs o Remote stores bypass RLB • All writes go to the owner cache • Owner cache always has the most up-to-date data o Flush RLBs at synchronization point • … at which time new data are guaranteed to be visible to others • Cheap as each iteration is long and RLB is small 14

Synchronization  Remote atomic operations o Fetch-and-add, compare-and-swap, etc. o HW support at memory controllers [Ahn et al. HPCA’05]  Higher-level synchronization primitives o Build by remote atomic operations o E.g., hierarchical, tree-style barrier implementation • Core  vault  stack  global  In PageRank o Build barrier between iterations 15

Software Runtime  Hide low-level coherence/communication features o Expose simple set of API  Data partitioning and program launch o Optionally specify running core and owner cache close to dataset o No need to be perfect, correctness is guaranteed by remote access  Hybrid workloads o Coarsely divide work between host and NDP by programmers • Based on temporal locality and parallelism o Guarantee no concurrent accesses from host and NDP cores 16

Evaluation Three analytics framework: MapReduce, Graph, DNN

Methodology  Infrastructure o zsim o McPAT + CACTI + Micron’s DRAM power calculator  Calibrate with public HMC literatures  Applications o MapReduce: Hist, LinReg, grep o Graph: PageRank, SSSP, ALS o DNN: ConvNet, MLP, dA

Porting Frameworks  MapReduce o In map phase, input data streamed in o Shuffle phase handled by pull-based communication  Graph o Edge-centric o Pull remote update lists when gathering  Deep Neural Networks o Convolution/pooling layers handled similar to Graph o Fully-connected layers use local combiner before communication  Once the framework is ported, no changes to the user-level apps 19

Graph: Edge- vs. Vertex-Centric Performance Energy 1.2 1.2 Normalized Performance Normalized Energy 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 SSSP ALS SSSP ALS Vertex-Centric Edge-Centric Vertex-Centric Edge-Centric  2.9x performance and energy improvement o Edge-centric version optimize for spatial locality o Higher utilization for cachelines and DRAM rows 20

Balance: PageRank 20  Performance scales Performance Normalized 15 to 4-8 cores per vault 10 o Bandwidth saturates 5 Saturate after 8 cores 0  Final design 0 2 4 6 8 10 12 14 16 100% o 4 cores per vault Bandwidth Utilization 80% o 1.0 GHz 60% o 2-threaded 40% o Area constrained 20% 0% 0 2 4 6 8 10 12 14 16 Number of Cores per Vault 21 1.0GHz 1T 1.0GHz 2T 1.0GHz 4T 0.5GHz 1T 0.5GHz 2T 0.5GHz 4T

Scalability Performance Scaling vs. # Stacks 16 Normalized Speedup 14 12 10 8 6 4 2 0 Hist PageRank ConvNet 1 stack 2 stacks 4 stacks 8 stacks 16 stacks  Performance scales well up to 16 stacks (256 vaults, 1024 threads)  Inter-stack links are not heavily used 22

Final Comparison  Four systems o Conv-DDR3 • Host processor + 4 DDR3 channels o Conv-3D • Host processor + 8 HMC stacks o Base-NDP • Host processor + 8 HMC stacks with NDP cores • Communication coordinated by host o NDP • Similar to Base-NDP • With our coherence and communication 23

Final Comparison Execution Time Energy 1.5 1.5 1 1 0.5 0.5 0 0 Conv-DDR3 Conv-3D Base-NDP NDP Conv-DDR3 Conv-3D Base-NDP NDP  Conv-3D: improve 20% for Graph (bandwidth-bound), more energy  Base-NDP: 3.5x faster and 3.4x less energy than Conv-DDR3  NDP: up to 16x improvement than Conv-DDR3, 2.5x over Base-NDP 24

Hybrid Workloads  Use both host processor and NDP cores for processing Execution Time Breakdown 1.2 1 0.8  NDP portion: similar speedup 0.6 0.4  Host portion: slight slowdown 0.2 o Due to coarse-grained address 0 interleaving FisherScoring K-Core Host Time NDP Time 25

Practical Near-Data Processing for In-Memory Analytics Frameworks - PowerPoint PPT Presentation

Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard scaling systems are energy

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Dimensionality Reduc1on Lecture 9 David Sontag New York

Michael Stonebraker The Meaning of Big Data - 3 V s Big Volume With simple (SQL)

Data Quality 101 What is Data Quality? May 5 th , 2020 Meradith Alspaugh & Alissa Parrish

From Inven=on to Innova=on: Compu=ng Research that Makes an

See the Difference! Visualizing Assessment Data Carmen Allen and Jorge Martinez University of

INSTalytics : Cluster Filesystem Co-design for Big-data Analytics Muthian Sivathanu, Midhul

NADEEF: A Commodity Data Cleaning System Data analytics, QCRI Michele Dallachiesa Amr Ebaid

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Practical Near-Data Processing for In-Memory Analytics Frameworks - PowerPoint PPT Presentation

Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard scaling systems are energy

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Biscuit: A Framework for Near-Data Processing of Big Data Workloads Oct 21, 2016 Duck-Ho Bae

61A Lecture 30 Announcements Data Processing Data Processing 4 Data Processing Many data sets

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc &amp; codes Time-memory

Why memory hierarchy (3 rd Ed: p.468-487, 4 th Ed: p. 452-470) users want unlimited fast

Dimensionality Reduc1on Lecture 9 David Sontag New York

Michael Stonebraker The Meaning of Big Data - 3 V s Big Volume With simple (SQL)

Data Quality 101 What is Data Quality? May 5 th , 2020 Meradith Alspaugh &amp; Alissa Parrish

From Inven=on to Innova=on: Compu=ng Research that Makes an

See the Difference! Visualizing Assessment Data Carmen Allen and Jorge Martinez University of

INSTalytics : Cluster Filesystem Co-design for Big-data Analytics Muthian Sivathanu, Midhul

NADEEF: A Commodity Data Cleaning System Data analytics, QCRI Michele Dallachiesa Amr Ebaid

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Time-memory Trade-offs for Near-collisions Conclusion Combining trunc & codes Time-memory

Data Quality 101 What is Data Quality? May 5 th , 2020 Meradith Alspaugh & Alissa Parrish