Practical Near-Data Processing for In-Memory Analytics Frameworks - - PowerPoint PPT Presentation

practical near data processing for in memory analytics
SMART_READER_LITE
LIVE PREVIEW

Practical Near-Data Processing for In-Memory Analytics Frameworks - - PowerPoint PPT Presentation

Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard scaling systems are energy


slide-1
SLIDE 1

Practical Near-Data Processing for In-Memory Analytics Frameworks

Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT – Oct 19, 2015

slide-2
SLIDE 2

Motivating Trends

 End of Dennard scaling  systems are energy limited  Emerging big data workloads

  • Massive datasets, limited temporal locality, irregular access patterns
  • They perform poorly on conventional cache hierarchies

 Need alternatives to improve energy efficiency

2

MapReduce Graphs Deep Neural Networks

Figs: http://oceanaute.blogspot.com/2015/06/how-to-shuffle-sort-mapreduce.html

slide-3
SLIDE 3

PIM & NDP

 Improve performance & energy by avoiding data movement  Processing-In-Memory (1990’s – 2000’s)

  • Same-die integration is too expensive

 Near-Data Processing

  • Enabled by 3D integration
  • Practical technology solution
  • Processing on the logic die

3

Hybrid Memory Cube (HMC) High Bandwidth Memory (HBM)

Figs: www.extremetech.com

slide-4
SLIDE 4

Base NDP Hardware

Vault Logic

NoC Logic Die DRAM Die

...

Bank Channel Vault

 Stacks linked to host multi-core processor

  • Code with temporal locality: runs on host
  • Code without temporal locality: runs on NDP

 3D memory stack

  • x10 bandwidth, x3-5 power improvement
  • 8-16 vaults per stack
  • Vertical channel
  • Dedicated vault controller
  • NDP cores
  • General-purpose, in-order cores
  • FPU, L1 caches I/D, no L2
  • Multithreaded for latency tolerance

4

Host Processor Memory Stack High-Speed Serial Link

slide-5
SLIDE 5

Challenges and Contributions

 NDP for large-scale highly distributed analytics frameworks

? General coherence maintaining is expensive

Scalable and adaptive software-assisted coherence

? Inefficient communication and synchronization through host processor

Pull-based model to directly communicate, remote atomic operations

? Hardware/software interface

A lightweight runtime to hide low-level details to make program easier

? Processing capability and energy efficiency

Balanced and efficient hardware

 A general, efficient, balanced, practical-to-use NDP architecture

5

slide-6
SLIDE 6

Example App: PageRank

 Edge-centric, scatter-gather, graph processing framework  Other analytics frameworks have similar behaviors

6

edge_scatter(edge_t e) src sends update over e update_gather(update_t u) apply u to dst while not done for e in all edges edge_scatter(e) for u in all updates update_gather(u)

Edge-centric SG PageRank

u = src.rank / src.out_degree sum += u if all gathered dst.rank = b * sum + (1-b)

Sequential accesses (stream in/out) Communication between graph partitions Synchronization between iterations Partitioned dataset, local processing

slide-7
SLIDE 7

Architecture Design

Memory model, communication, coherence, … Lightweight hardware structures and software runtime

slide-8
SLIDE 8

Shared Memory Model

 Unified physical address space across stacks

  • Direct access from any NDP/host core to memory in any vault/stack

 In PageRank

  • One thread to access data in a remote graph partition
  • For edges across two partitions

 Implementation

  • Memory ctrl forwards local/remote accesses
  • Shared router in each vault

8

Mem Ctrl NDP Core …… Memory request Router Local Vault Memory Local Remote NDP Core NDP Core

slide-9
SLIDE 9

Virtual Memory Support

 NDP threads access virtual address space

  • Small TLB per core (32 entries)
  • Large pages to minimize TLB misses (2 MB)
  • Sufficient to cover local memory & remote buffers

 In PageRank

  • Each core works on local data, much smaller than the entire dataset
  • 0.25% miss rate for PageRank

 TLB misses served by OS in host

  • Similar to IOMMU misses in conventional systems

9

slide-10
SLIDE 10

Software-Assisted Coherence

 Maintaining general coherence is expensive in NDP systems

  • Highly distributed, multiple stacks

 Analytics frameworks

  • Little data sharing except for communication
  • Data partitioning is coarse-grained

 Only allow data to be cached in one cache

  • Owner cache
  • No need to check other caches

 Page-level coarse-grained

  • Owner cache configurable through PTE

10

NDP Core $ NDP Core $ Mem Ctrl Vault Memory NDP Core $ NDP Core $ Mem Ctrl Vault Memory

Vault 0 Vault 1 Owner cache identified by TLB Memory vault identified by physical address

slide-11
SLIDE 11

Software-Assisted Coherence

 Scalable

  • Avoids directory lookup and storage

 Adaptive

  • Data may overflow to other vault
  • Able to cache data from any vault in local cache

 Flush only when owner cache changes

  • Rarely happen as dataset partitioning is fixed

11

NDP Core $ NDP Core $ Mem Ctrl Vault Memory NDP Core $ NDP Core $ Mem Ctrl Vault Memory

Vault 0 Vault 1 Dataset

slide-12
SLIDE 12

Communication

 Pull-based model

  • Producer buffers intermediate/result data locally and separately
  • Post small message (address, size) to consumer
  • Consumer pulls data when it needs with load instructions

12 Task Task Task Task Cores Cores Cores Cores Process Buffer Task Task Task Task Pull

slide-13
SLIDE 13

Communication

 Pull-based model is efficient and scalable

  • Sequential accesses to data
  • Asynchronous and highly parallel
  • Avoids the overheads of extra copies
  • Eliminates host processor bottleneck

 In PageRank

  • Used to communicate the update lists across partitions

13

slide-14
SLIDE 14

Communication

 HW optimization: remote load buffer (RLBs)

  • A small buffer per NDP core (a few cachelines)
  • Prefetch and cache remote (sequential) load accesses
  • Remote data are not cache-able in the local cache
  • Do not want owner cache change as it results in cache flush

 Coherence guarantee with RLBs

  • Remote stores bypass RLB
  • All writes go to the owner cache
  • Owner cache always has the most up-to-date data
  • Flush RLBs at synchronization point
  • … at which time new data are guaranteed to be visible to others
  • Cheap as each iteration is long and RLB is small

14

slide-15
SLIDE 15

Synchronization

 Remote atomic operations

  • Fetch-and-add, compare-and-swap, etc.
  • HW support at memory controllers [Ahn et al. HPCA’05]

 Higher-level synchronization primitives

  • Build by remote atomic operations
  • E.g., hierarchical, tree-style barrier implementation
  • Core  vault  stack  global

 In PageRank

  • Build barrier between iterations

15

slide-16
SLIDE 16

Software Runtime

 Hide low-level coherence/communication features

  • Expose simple set of API

 Data partitioning and program launch

  • Optionally specify running core and owner cache close to dataset
  • No need to be perfect, correctness is guaranteed by remote access

 Hybrid workloads

  • Coarsely divide work between host and NDP by programmers
  • Based on temporal locality and parallelism
  • Guarantee no concurrent accesses from host and NDP cores

16

slide-17
SLIDE 17

Evaluation

Three analytics framework: MapReduce, Graph, DNN

slide-18
SLIDE 18

Methodology

 Infrastructure

  • zsim
  • McPAT + CACTI + Micron’s DRAM power calculator

 Calibrate with public HMC literatures  Applications

  • MapReduce: Hist, LinReg, grep
  • Graph: PageRank, SSSP, ALS
  • DNN: ConvNet, MLP, dA
slide-19
SLIDE 19

Porting Frameworks

 MapReduce

  • In map phase, input data streamed in
  • Shuffle phase handled by pull-based communication

 Graph

  • Edge-centric
  • Pull remote update lists when gathering

 Deep Neural Networks

  • Convolution/pooling layers handled similar to Graph
  • Fully-connected layers use local combiner before communication

 Once the framework is ported, no changes to the user-level apps

19

slide-20
SLIDE 20

Graph: Edge- vs. Vertex-Centric

 2.9x performance and energy improvement

  • Edge-centric version optimize for spatial locality
  • Higher utilization for cachelines and DRAM rows

20 0.2 0.4 0.6 0.8 1 1.2 SSSP ALS Normalized Performance

Performance

Vertex-Centric Edge-Centric 0.2 0.4 0.6 0.8 1 1.2 SSSP ALS Normalized Energy

Energy

Vertex-Centric Edge-Centric

slide-21
SLIDE 21

Balance: PageRank

 Performance scales

to 4-8 cores per vault

  • Bandwidth saturates

 Final design

  • 4 cores per vault
  • 1.0 GHz
  • 2-threaded
  • Area constrained

21 5 10 15 20 2 4 6 8 10 12 14 16 Normalized Performance 0% 20% 40% 60% 80% 100% 2 4 6 8 10 12 14 16 Bandwidth Utilization Number of Cores per Vault

1.0GHz 1T 1.0GHz 2T 1.0GHz 4T 0.5GHz 1T 0.5GHz 2T 0.5GHz 4T

Saturate after 8 cores

slide-22
SLIDE 22

Scalability

2 4 6 8 10 12 14 16 Hist PageRank ConvNet Normalized Speedup

Performance Scaling vs. # Stacks

1 stack 2 stacks 4 stacks 8 stacks 16 stacks 22

 Performance scales well up to 16 stacks (256 vaults, 1024 threads)  Inter-stack links are not heavily used

slide-23
SLIDE 23

Final Comparison

 Four systems

  • Conv-DDR3
  • Host processor + 4 DDR3 channels
  • Conv-3D
  • Host processor + 8 HMC stacks
  • Base-NDP
  • Host processor + 8 HMC stacks with NDP cores
  • Communication coordinated by host
  • NDP
  • Similar to Base-NDP
  • With our coherence and communication

23

slide-24
SLIDE 24

Final Comparison

 Conv-3D: improve 20% for Graph (bandwidth-bound), more energy  Base-NDP: 3.5x faster and 3.4x less energy than Conv-DDR3  NDP: up to 16x improvement than Conv-DDR3, 2.5x over Base-NDP

24 0.5 1 1.5

Execution Time

Conv-DDR3 Conv-3D Base-NDP NDP 0.5 1 1.5

Energy

Conv-DDR3 Conv-3D Base-NDP NDP

slide-25
SLIDE 25

Hybrid Workloads

 Use both host processor and

NDP cores for processing

 NDP portion: similar speedup  Host portion: slight slowdown

  • Due to coarse-grained address

interleaving

25 0.2 0.4 0.6 0.8 1 1.2

Execution Time Breakdown

Host Time NDP Time

FisherScoring K-Core

slide-26
SLIDE 26

Conclusion

 Lightweight hardware structures and software runtime

  • Hides hardware details
  • Scalable and adaptive software-assisted coherence model
  • Efficient communication and synchronization

 Balanced and efficient hardware  Up to 16x improvement over DDR3 baseline

  • 2.5x improvement over previous NDP systems

 Software optimization

  • 3x improvement from spatial locality

26

slide-27
SLIDE 27

Thanks!

Questions?