www.inf.ed.ac.uk DATE 2019
Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - - PowerPoint PPT Presentation
Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - - PowerPoint PPT Presentation
Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The End is Near Here! Where do we go from here? An exponential is ending 10%, 20%, .. improvement in performance of component X wont get you far
Where do we go from here?
The End is Near Here!
An exponential is ending…
10%, 20%, .. improvement in performance
- f component X won’t get you far
– No new transistors – Fixed power ceiling
Emerging technologies are either incremental (e.g., Intel’s Xpoint Memory)
- r cover niche areas (e.g., quantum)
3
The Way Forward: Vertical Integration
Software/hardware co-design for high efficiency and programmability
Is this always a good idea?
No!
Need high volume for cost-efficiency Need large perf/Watt gains to be worth the effort
4
This Talk
Vertical integration for in-memory data analytics
5
Data Analytics Takes Center Stage
User data grows exponentially
– Need to monetize data
In-memory data operators
– Poor locality – Low computational requirement – Highly parallel
7
Data Analytics Takes Center Stage
User data grows exponentially
– Need to monetize data
In-memory data operators
– Poor locality – Low computational requirement – Highly parallel
Data movement
– High energy cost – High BW requirement
8
Data movement bottlenecks data analytics
Cost of Moving Data
10
Data access much more expensive than arithmetic operation
DRAM CPU
Memory access 640 pJ Fixed point Add 0.1 pJ
DRAM BW Bottleneck
11
24 GB/s off-chip BW
Memory Array Memory Array Row Buffer
100’s of GB/s internally
DRAM CPU
Internal DRAM BW presents big opportunity
Logic inside DRAM? Not a Good Idea
Fabrication processes not compatible
– DRAM is optimized for density – Logic is irregular, wire-intensive
In-memory logic failed in the 90s
– DRAM is cost-sensitive
12
DRAM
Memory Array Memory Array Logic
Must exploit DRAM in a non-disruptive manner
Near-Memory Processing (NMP)
3D logic/DRAM stack
– Exposes internal BW to processing elements – But constrains logic layer’s area/power envelope
13
Exploit the bandwidth without data movement
640 pJ 24 GB/s 150 pJ 128 GB/s Logic DRAM CPU
How to Best Exploit DRAM BW?
DRAM internals optimized for density DRAM accesses must activate rows
– Single access activates KBs of data – Activations dominate access latency & energy
Can’t utilize internal BW with random access
– Need to maintain many open rows – Complex bookkeeping logic
14
DRAM
Need sequential access to utilize DRAM BW
NMP HW-Algorithm Co-Design
Algorithms: Must have sequential access
– Even if we perform more work
Hardware: Must leverage data parallelism
– On a tight area/power budget
15
HW-algorithm co-design necessary to make best use of NMP
Example data operator: Join
Iterates over a pair of tables to find matching keys Major operation in data analytics
16
Q: SELECT ... FROM A, B WHERE A.Key = B.Key
Join A B Result
C F A D B E A G Z C M E A C E
Baseline: CPU Hash Join
Best performing algorithm in CPU-centric systems Performed in two phases: Partition & Probe
1. Partition generates cache sized partitions 2. Probe builds and queries cache resident hash tables
17
Partition
C F A D B E A D C E F B
Probe
H(x)
E F B E F B
Optimized for random accesses to cached data
NMP Hash Join
18
To DRAM
NMP DRAM C D F E A B C F A D B E H(X)
Goal: maximum MLP
- Limited by bookkeeping logic
19
NMP DRAM C D F E A B C F A D B E H(X) &C &F
C F F C
Poor row buffer utilization
To DRAM
NMP Hash Join
NMP Hash Join
20
To DRAM
NMP DRAM C D F E A B C F A D B E H(X) &A &D
Random accesses are inefficient and under-utilize internal BW
Eliminate Random Access?
Insight: use Sort Join
– Performs mostly sequential accesses – But has higher algorithmic complexity
Trade algorithmic complexity for desirable access pattern
21
O(n) random accesses O(n log n) sequential acesses
H(x)
C F A D D C A F A C D F D F C A
Utilizing internal DRAM BW compensates for increased cost
Hash join Sort join
NMP Sort Join: Sequential Accesses
22
base base NMP DRAM A C E G B D F H
To DRAM
Drop OoO logic
- Reduces area/power of NMP
Add stream buffer
- Simple logic utilizes BW
NMP Sort Join: Sequential Accesses
23
base base NMP DRAM 2 1 3 A C E G 2 1 3 B D F H
To DRAM
&A &B
NMP Sort Join: Sequential Accesses
24
NMP DRAM base &A base &B 2 1 3 A C E G 2 1 3 B D F H
&A + 0 &A + 1
To DRAM
&B + 0 &B + 1
Good row buffer utilization
NMP Sort Join: Sequential Accesses
25
base &A base &B NMP DRAM 2 1 3 A C E G 2 1 3 B D F H
&A + 0 &A + 1 &B + 0 &B + 1
To DRAM
3 3 4 4 1 2 1 2
&A + 0 &A + 1 &B + 0 &B + 1
NMP Sort Join: Sequential Accesses
26
NMP DRAM base &A base &B 3 2 4 A C E G 3 2 4 B D F H
To DRAM
&A + 1 &B + 1
1 1
Sequential access moves bottleneck to compute
NMP Sort Join: Compute
27
base &A base &B NMP DRAM 3 1 2 4
To DRAM
3 1 2 4 A C E G B D F H
Use area/power budget for SIMD
General purpose SIMD keeps up with memory BW
Partitioning Phase
Partitioning basics:
– Each partition contains buckets of objects – For a given object, target bucket determined using a hash – The order of objects within each bucket is irrelevant à buckets are unordered
Insight: the order in which tuples are written into a bucket in the target partition is irrelevant
28
Partitioning phase: tuples are permutable
Partitioning Phase
Leverage tuple’s permutability property Turn partition’s random accesses sequential
– Enable use of SIMD during partition
29
Mondrian
Algorithm + hardware co-design for near-memory processing of data analytics NMP Algorithms
– Use sequential memory accesses – Avoid random memory accesses – Target partitioning and compute phases
NMP Harware
– High memory parallelism using simple SIMD hardware – Exploit sequential memory accesses
30
Big data operators:
– Scan – Join – Group By – Sort
Memory subsystem:
- 4 HMC stacks
– 20 GB/s external BW – 128 GB/s internal BW
Simulated systems:
- CPU-centric: ARM Cortex-A57
– 16 cores – 3-wide,128-entry ROB @ 2GHz
- NMP: Mobile ARM core
– 16 cores per stack – 3-wide, 48-entry ROB @ 1GHz
- Mondrian: SIMD in-order
– 16 cores per stack – 1024-bit SIMD @ 1GHz
Methodology
31
Flexus cycle accurate simulator [Wenisch’06]
Evaluation: Performance
32
1 10 100 Scan Sort Group by Join
Speedup (log scale) Operator
NMP Mondrian
Evaluation: Performance
33
1 10 100 Scan Sort Group by Join
Speedup (log scale) Operator
NMP Mondrian
Mondrian achieves superior BW utilization
Evaluation: Performance
34
1 10 100 Scan Sort Group by Join
Speedup (log scale) Operator
NMP Mondrian
NMP can’t utilize memory BW with random accesses
Evaluation: Performance
35
1 10 100 Scan Sort Group by Join
Speedup (log scale) Operator
NMP Mondrian
Mondrian BW utilization compensates for extra log(n) work
Summary
End of technology scaling à must think vertical
– Software + hardware co-design
Big data analytics are a critical workload
– Large datasets, little locality à memory bottleneck!
Moving compute near memory improves performance
– But need to conform to DRAM constraints
Mondrian is algorithm-hardware NMP for analytics
– Adapt algorithms/HW to DRAM constraints – Sequential rather than random memory access – Simple hardware to exploit memory bandwidth
36
inf.ed.ac.uk/bgrot
Thank you!
Questions?
37
Mondrian Energy Efficiency
38
1 10 100
Scan Sort Group by Join Efficiency Improvement (performance/energy) Operator NMP-OoO Mondrian