Practical Near-Data Processing for In-Memory Analytics Frameworks
Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT – Oct 19, 2015
Practical Near-Data Processing for In-Memory Analytics Frameworks - - PowerPoint PPT Presentation
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard scaling systems are energy
Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT – Oct 19, 2015
End of Dennard scaling systems are energy limited Emerging big data workloads
Need alternatives to improve energy efficiency
2
MapReduce Graphs Deep Neural Networks
Figs: http://oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html
Improve performance & energy by avoiding data movement Processing-In-Memory (1990’s – 2000’s)
Near-Data Processing
3
Hybrid Memory Cube (HMC) High Bandwidth Memory (HBM)
Figs: www.extremetech.com
Vault Logic
NoC Logic Die DRAM Die
...
Bank Channel Vault
Stacks linked to host multi-core processor
3D memory stack
4
Host Processor Memory Stack High-Speed Serial Link
NDP for large-scale highly distributed analytics frameworks
? General coherence maintaining is expensive
Scalable and adaptive software-assisted coherence
? Inefficient communication and synchronization through host processor
Pull-based model to directly communicate, remote atomic operations
? Hardware/software interface
A lightweight runtime to hide low-level details to make program easier
? Processing capability and energy efficiency
Balanced and efficient hardware
A general, efficient, balanced, practical-to-use NDP architecture
5
Edge-centric, scatter-gather, graph processing framework Other analytics frameworks have similar behaviors
6
edge_scatter(edge_t e) src sends update over e update_gather(update_t u) apply u to dst while not done for e in all edges edge_scatter(e) for u in all updates update_gather(u)
Edge-centric SG PageRank
u = src.rank / src.out_degree sum += u if all gathered dst.rank = b * sum + (1-b)
Sequential accesses (stream in/out) Communication between graph partitions Synchronization between iterations Partitioned dataset, local processing
Memory model, communication, coherence, … Lightweight hardware structures and software runtime
Unified physical address space across stacks
In PageRank
Implementation
8
Mem Ctrl NDP Core …… Memory request Router Local Vault Memory Local Remote NDP Core NDP Core
NDP threads access virtual address space
In PageRank
TLB misses served by OS in host
9
Maintaining general coherence is expensive in NDP systems
Analytics frameworks
Only allow data to be cached in one cache
Page-level coarse-grained
10
NDP Core $ NDP Core $ Mem Ctrl Vault Memory NDP Core $ NDP Core $ Mem Ctrl Vault Memory
Vault 0 Vault 1 Owner cache identified by TLB Memory vault identified by physical address
Scalable
Adaptive
Flush only when owner cache changes
11
NDP Core $ NDP Core $ Mem Ctrl Vault Memory NDP Core $ NDP Core $ Mem Ctrl Vault Memory
Vault 0 Vault 1 Dataset
Pull-based model
12 Task Task Task Task Cores Cores Cores Cores Process Buffer Task Task Task Task Pull
Pull-based model is efficient and scalable
In PageRank
13
HW optimization: remote load buffer (RLBs)
Coherence guarantee with RLBs
14
Remote atomic operations
Higher-level synchronization primitives
In PageRank
15
Hide low-level coherence/communication features
Data partitioning and program launch
Hybrid workloads
16
Three analytics framework: MapReduce, Graph, DNN
Infrastructure
Calibrate with public HMC literatures Applications
MapReduce
Graph
Deep Neural Networks
Once the framework is ported, no changes to the user-level apps
19
2.9x performance and energy improvement
20 0.2 0.4 0.6 0.8 1 1.2 SSSP ALS Normalized Performance
Performance
Vertex-Centric Edge-Centric 0.2 0.4 0.6 0.8 1 1.2 SSSP ALS Normalized Energy
Energy
Vertex-Centric Edge-Centric
Performance scales
to 4-8 cores per vault
Final design
21 5 10 15 20 2 4 6 8 10 12 14 16 Normalized Performance 0% 20% 40% 60% 80% 100% 2 4 6 8 10 12 14 16 Bandwidth Utilization Number of Cores per Vault
1.0GHz 1T 1.0GHz 2T 1.0GHz 4T 0.5GHz 1T 0.5GHz 2T 0.5GHz 4T
Saturate after 8 cores
2 4 6 8 10 12 14 16 Hist PageRank ConvNet Normalized Speedup
Performance Scaling vs. # Stacks
1 stack 2 stacks 4 stacks 8 stacks 16 stacks 22
Performance scales well up to 16 stacks (256 vaults, 1024 threads) Inter-stack links are not heavily used
Four systems
23
Conv-3D: improve 20% for Graph (bandwidth-bound), more energy Base-NDP: 3.5x faster and 3.4x less energy than Conv-DDR3 NDP: up to 16x improvement than Conv-DDR3, 2.5x over Base-NDP
24 0.5 1 1.5
Execution Time
Conv-DDR3 Conv-3D Base-NDP NDP 0.5 1 1.5
Energy
Conv-DDR3 Conv-3D Base-NDP NDP
Use both host processor and
NDP cores for processing
NDP portion: similar speedup Host portion: slight slowdown
interleaving
25 0.2 0.4 0.6 0.8 1 1.2
Execution Time Breakdown
Host Time NDP Time
FisherScoring K-Core
Lightweight hardware structures and software runtime
Balanced and efficient hardware Up to 16x improvement over DDR3 baseline
Software optimization
26
Questions?