practical near data processing for in memory analytics
play

Practical Near-Data Processing for In-Memory Analytics Frameworks - PowerPoint PPT Presentation

Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard scaling systems are energy


  1. Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT – Oct 19, 2015

  2. Motivating Trends  End of Dennard scaling  systems are energy limited  Emerging big data workloads o Massive datasets, limited temporal locality, irregular access patterns o They perform poorly on conventional cache hierarchies  Need alternatives to improve energy efficiency Deep Neural Networks MapReduce Graphs 2 Figs: http://oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

  3. PIM & NDP  Improve performance & energy by avoiding data movement  Processing-In- Memory (1990’s – 2000’s) o Same-die integration is too expensive  Near-Data Processing o Enabled by 3D integration o Practical technology solution o Processing on the logic die Hybrid Memory Cube High Bandwidth Memory (HMC) (HBM) 3 Figs: www.extremetech.com

  4. Base NDP Hardware  Stacks linked to host multi-core processor High-Speed Serial Link o Code with temporal locality: runs on host o Code without temporal locality: runs on NDP Memory Host Stack Processor  3D memory stack o x10 bandwidth, x3-5 power improvement o 8-16 vaults per stack Channel • Vertical channel Bank • Dedicated vault controller ... o NDP cores DRAM Die • General-purpose, in-order cores • FPU, L1 caches I/D, no L2 • Multithreaded for latency tolerance NoC Vault Logic Vault 4 Logic Die

  5. Challenges and Contributions  NDP for large-scale highly distributed analytics frameworks ? General coherence maintaining is expensive  Scalable and adaptive software-assisted coherence ? Inefficient communication and synchronization through host processor  Pull-based model to directly communicate, remote atomic operations ? Hardware/software interface  A lightweight runtime to hide low-level details to make program easier ? Processing capability and energy efficiency  Balanced and efficient hardware  A general, efficient, balanced, practical-to-use NDP architecture 5

  6. Example App: PageRank  Edge-centric, scatter-gather, graph processing framework  Other analytics frameworks have similar behaviors Edge-centric SG PageRank edge_scatter(edge_t e) u = src.rank / src.out_degree src sends update over e sum += u update_gather(update_t u) if all gathered Sequential accesses (stream in/out) apply u to dst dst.rank = b * sum + (1-b) while not done Partitioned dataset, local processing for e in all edges edge_scatter(e) Synchronization between iterations for u in all updates update_gather(u) Communication between graph partitions 6

  7. Architecture Design Memory model, communication, coherence, … Lightweight hardware structures and software runtime

  8. Shared Memory Model  Unified physical address space across stacks o Direct access from any NDP/host core to memory in any vault/stack  In PageRank o One thread to access data in a remote graph partition • For edges across two partitions Local Vault Memory Local  Implementation Remote Mem Ctrl Router o Memory ctrl forwards local/remote accesses Memory request o Shared router in each vault …… NDP NDP NDP Core Core Core 8

  9. Virtual Memory Support  NDP threads access virtual address space o Small TLB per core (32 entries) o Large pages to minimize TLB misses (2 MB) o Sufficient to cover local memory & remote buffers  In PageRank o Each core works on local data, much smaller than the entire dataset o 0.25% miss rate for PageRank  TLB misses served by OS in host o Similar to IOMMU misses in conventional systems 9

  10. Software-Assisted Coherence  Maintaining general coherence is expensive in NDP systems o Highly distributed, multiple stacks Vault 0 Vault 1  Analytics frameworks o Little data sharing except for communication Vault Memory Vault Memory o Data partitioning is coarse-grained Mem Ctrl Mem Ctrl $ $ $ $ Memory vault  Only allow data to be cached in one cache identified by NDP NDP NDP NDP Core physical address Core Core Core o Owner cache o No need to check other caches Owner cache identified by TLB  Page-level coarse-grained o Owner cache configurable through PTE 10

  11. Software-Assisted Coherence  Scalable Vault 0 Vault 1 o Avoids directory lookup and storage Dataset Vault Memory Vault Memory Mem Ctrl  Adaptive Mem Ctrl $ $ $ $ o Data may overflow to other vault o Able to cache data from any vault in local cache NDP NDP NDP NDP Core Core Core Core  Flush only when owner cache changes o Rarely happen as dataset partitioning is fixed 11

  12. Communication  Pull-based model o Producer buffers intermediate/result data locally and separately o Post small message (address, size) to consumer o Consumer pulls data when it needs with load instructions Task Task Task Task Process Cores Cores Cores Cores Buffer Pull Task Task Task Task 12

  13. Communication  Pull-based model is efficient and scalable o Sequential accesses to data o Asynchronous and highly parallel o Avoids the overheads of extra copies o Eliminates host processor bottleneck  In PageRank o Used to communicate the update lists across partitions 13

  14. Communication  HW optimization: remote load buffer (RLBs) o A small buffer per NDP core (a few cachelines) o Prefetch and cache remote (sequential) load accesses • Remote data are not cache-able in the local cache • Do not want owner cache change as it results in cache flush  Coherence guarantee with RLBs o Remote stores bypass RLB • All writes go to the owner cache • Owner cache always has the most up-to-date data o Flush RLBs at synchronization point • … at which time new data are guaranteed to be visible to others • Cheap as each iteration is long and RLB is small 14

  15. Synchronization  Remote atomic operations o Fetch-and-add, compare-and-swap, etc. o HW support at memory controllers [Ahn et al. HPCA’05]  Higher-level synchronization primitives o Build by remote atomic operations o E.g., hierarchical, tree-style barrier implementation • Core  vault  stack  global  In PageRank o Build barrier between iterations 15

  16. Software Runtime  Hide low-level coherence/communication features o Expose simple set of API  Data partitioning and program launch o Optionally specify running core and owner cache close to dataset o No need to be perfect, correctness is guaranteed by remote access  Hybrid workloads o Coarsely divide work between host and NDP by programmers • Based on temporal locality and parallelism o Guarantee no concurrent accesses from host and NDP cores 16

  17. Evaluation Three analytics framework: MapReduce, Graph, DNN

  18. Methodology  Infrastructure o zsim o McPAT + CACTI + Micron’s DRAM power calculator  Calibrate with public HMC literatures  Applications o MapReduce: Hist, LinReg, grep o Graph: PageRank, SSSP, ALS o DNN: ConvNet, MLP, dA

  19. Porting Frameworks  MapReduce o In map phase, input data streamed in o Shuffle phase handled by pull-based communication  Graph o Edge-centric o Pull remote update lists when gathering  Deep Neural Networks o Convolution/pooling layers handled similar to Graph o Fully-connected layers use local combiner before communication  Once the framework is ported, no changes to the user-level apps 19

  20. Graph: Edge- vs. Vertex-Centric Performance Energy 1.2 1.2 Normalized Performance Normalized Energy 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 SSSP ALS SSSP ALS Vertex-Centric Edge-Centric Vertex-Centric Edge-Centric  2.9x performance and energy improvement o Edge-centric version optimize for spatial locality o Higher utilization for cachelines and DRAM rows 20

  21. Balance: PageRank 20  Performance scales Performance Normalized 15 to 4-8 cores per vault 10 o Bandwidth saturates 5 Saturate after 8 cores 0  Final design 0 2 4 6 8 10 12 14 16 100% o 4 cores per vault Bandwidth Utilization 80% o 1.0 GHz 60% o 2-threaded 40% o Area constrained 20% 0% 0 2 4 6 8 10 12 14 16 Number of Cores per Vault 21 1.0GHz 1T 1.0GHz 2T 1.0GHz 4T 0.5GHz 1T 0.5GHz 2T 0.5GHz 4T

  22. Scalability Performance Scaling vs. # Stacks 16 Normalized Speedup 14 12 10 8 6 4 2 0 Hist PageRank ConvNet 1 stack 2 stacks 4 stacks 8 stacks 16 stacks  Performance scales well up to 16 stacks (256 vaults, 1024 threads)  Inter-stack links are not heavily used 22

  23. Final Comparison  Four systems o Conv-DDR3 • Host processor + 4 DDR3 channels o Conv-3D • Host processor + 8 HMC stacks o Base-NDP • Host processor + 8 HMC stacks with NDP cores • Communication coordinated by host o NDP • Similar to Base-NDP • With our coherence and communication 23

  24. Final Comparison Execution Time Energy 1.5 1.5 1 1 0.5 0.5 0 0 Conv-DDR3 Conv-3D Base-NDP NDP Conv-DDR3 Conv-3D Base-NDP NDP  Conv-3D: improve 20% for Graph (bandwidth-bound), more energy  Base-NDP: 3.5x faster and 3.4x less energy than Conv-DDR3  NDP: up to 16x improvement than Conv-DDR3, 2.5x over Base-NDP 24

  25. Hybrid Workloads  Use both host processor and NDP cores for processing Execution Time Breakdown 1.2 1 0.8  NDP portion: similar speedup 0.6 0.4  Host portion: slight slowdown 0.2 o Due to coarse-grained address 0 interleaving FisherScoring K-Core Host Time NDP Time 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend