Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - - PowerPoint PPT Presentation

near memory processing it s the sw and hw stupid
SMART_READER_LITE
LIVE PREVIEW

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - - PowerPoint PPT Presentation

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The End is Near Here! Where do we go from here? An exponential is ending 10%, 20%, .. improvement in performance of component X wont get you far


slide-1
SLIDE 1

www.inf.ed.ac.uk DATE 2019

Near-Memory Processing: It’s the SW and HW, stupid!

Boris Grot

slide-2
SLIDE 2

Where do we go from here?

The End is Near Here!

slide-3
SLIDE 3

An exponential is ending…

10%, 20%, .. improvement in performance

  • f component X won’t get you far

– No new transistors – Fixed power ceiling

Emerging technologies are either incremental (e.g., Intel’s Xpoint Memory)

  • r cover niche areas (e.g., quantum)

3

slide-4
SLIDE 4

The Way Forward: Vertical Integration

Software/hardware co-design for high efficiency and programmability

Is this always a good idea?

No!

Need high volume for cost-efficiency Need large perf/Watt gains to be worth the effort

4

slide-5
SLIDE 5

This Talk

Vertical integration for in-memory data analytics

5

slide-6
SLIDE 6

Data Analytics Takes Center Stage

User data grows exponentially

– Need to monetize data

In-memory data operators

– Poor locality – Low computational requirement – Highly parallel

7

slide-7
SLIDE 7

Data Analytics Takes Center Stage

User data grows exponentially

– Need to monetize data

In-memory data operators

– Poor locality – Low computational requirement – Highly parallel

Data movement

– High energy cost – High BW requirement

8

Data movement bottlenecks data analytics

slide-8
SLIDE 8

Cost of Moving Data

10

Data access much more expensive than arithmetic operation

DRAM CPU

Memory access 640 pJ Fixed point Add 0.1 pJ

slide-9
SLIDE 9

DRAM BW Bottleneck

11

24 GB/s off-chip BW

Memory Array Memory Array Row Buffer

100’s of GB/s internally

DRAM CPU

Internal DRAM BW presents big opportunity

slide-10
SLIDE 10

Logic inside DRAM? Not a Good Idea

Fabrication processes not compatible

– DRAM is optimized for density – Logic is irregular, wire-intensive

In-memory logic failed in the 90s

– DRAM is cost-sensitive

12

DRAM

Memory Array Memory Array Logic

Must exploit DRAM in a non-disruptive manner

slide-11
SLIDE 11

Near-Memory Processing (NMP)

3D logic/DRAM stack

– Exposes internal BW to processing elements – But constrains logic layer’s area/power envelope

13

Exploit the bandwidth without data movement

640 pJ 24 GB/s 150 pJ 128 GB/s Logic DRAM CPU

slide-12
SLIDE 12

How to Best Exploit DRAM BW?

DRAM internals optimized for density DRAM accesses must activate rows

– Single access activates KBs of data – Activations dominate access latency & energy

Can’t utilize internal BW with random access

– Need to maintain many open rows – Complex bookkeeping logic

14

DRAM

Need sequential access to utilize DRAM BW

slide-13
SLIDE 13

NMP HW-Algorithm Co-Design

Algorithms: Must have sequential access

– Even if we perform more work

Hardware: Must leverage data parallelism

– On a tight area/power budget

15

HW-algorithm co-design necessary to make best use of NMP

slide-14
SLIDE 14

Example data operator: Join

Iterates over a pair of tables to find matching keys Major operation in data analytics

16

Q: SELECT ... FROM A, B WHERE A.Key = B.Key

Join A B Result

C F A D B E A G Z C M E A C E

slide-15
SLIDE 15

Baseline: CPU Hash Join

Best performing algorithm in CPU-centric systems Performed in two phases: Partition & Probe

1. Partition generates cache sized partitions 2. Probe builds and queries cache resident hash tables

17

Partition

C F A D B E A D C E F B

Probe

H(x)

E F B E F B

Optimized for random accesses to cached data

slide-16
SLIDE 16

NMP Hash Join

18

To DRAM

NMP DRAM C D F E A B C F A D B E H(X)

Goal: maximum MLP

  • Limited by bookkeeping logic
slide-17
SLIDE 17

19

NMP DRAM C D F E A B C F A D B E H(X) &C &F

C F F C

Poor row buffer utilization

To DRAM

NMP Hash Join

slide-18
SLIDE 18

NMP Hash Join

20

To DRAM

NMP DRAM C D F E A B C F A D B E H(X) &A &D

Random accesses are inefficient and under-utilize internal BW

slide-19
SLIDE 19

Eliminate Random Access?

Insight: use Sort Join

– Performs mostly sequential accesses – But has higher algorithmic complexity

Trade algorithmic complexity for desirable access pattern

21

O(n) random accesses O(n log n) sequential acesses

H(x)

C F A D D C A F A C D F D F C A

Utilizing internal DRAM BW compensates for increased cost

Hash join Sort join

slide-20
SLIDE 20

NMP Sort Join: Sequential Accesses

22

base base NMP DRAM A C E G B D F H

To DRAM

Drop OoO logic

  • Reduces area/power of NMP

Add stream buffer

  • Simple logic utilizes BW
slide-21
SLIDE 21

NMP Sort Join: Sequential Accesses

23

base base NMP DRAM 2 1 3 A C E G 2 1 3 B D F H

To DRAM

&A &B

slide-22
SLIDE 22

NMP Sort Join: Sequential Accesses

24

NMP DRAM base &A base &B 2 1 3 A C E G 2 1 3 B D F H

&A + 0 &A + 1

To DRAM

&B + 0 &B + 1

Good row buffer utilization

slide-23
SLIDE 23

NMP Sort Join: Sequential Accesses

25

base &A base &B NMP DRAM 2 1 3 A C E G 2 1 3 B D F H

&A + 0 &A + 1 &B + 0 &B + 1

To DRAM

3 3 4 4 1 2 1 2

&A + 0 &A + 1 &B + 0 &B + 1

slide-24
SLIDE 24

NMP Sort Join: Sequential Accesses

26

NMP DRAM base &A base &B 3 2 4 A C E G 3 2 4 B D F H

To DRAM

&A + 1 &B + 1

1 1

Sequential access moves bottleneck to compute

slide-25
SLIDE 25

NMP Sort Join: Compute

27

base &A base &B NMP DRAM 3 1 2 4

To DRAM

3 1 2 4 A C E G B D F H

Use area/power budget for SIMD

General purpose SIMD keeps up with memory BW

slide-26
SLIDE 26

Partitioning Phase

Partitioning basics:

– Each partition contains buckets of objects – For a given object, target bucket determined using a hash – The order of objects within each bucket is irrelevant à buckets are unordered

Insight: the order in which tuples are written into a bucket in the target partition is irrelevant

28

Partitioning phase: tuples are permutable

slide-27
SLIDE 27

Partitioning Phase

Leverage tuple’s permutability property Turn partition’s random accesses sequential

– Enable use of SIMD during partition

29

slide-28
SLIDE 28

Mondrian

Algorithm + hardware co-design for near-memory processing of data analytics NMP Algorithms

– Use sequential memory accesses – Avoid random memory accesses – Target partitioning and compute phases

NMP Harware

– High memory parallelism using simple SIMD hardware – Exploit sequential memory accesses

30

slide-29
SLIDE 29

Big data operators:

– Scan – Join – Group By – Sort

Memory subsystem:

  • 4 HMC stacks

– 20 GB/s external BW – 128 GB/s internal BW

Simulated systems:

  • CPU-centric: ARM Cortex-A57

– 16 cores – 3-wide,128-entry ROB @ 2GHz

  • NMP: Mobile ARM core

– 16 cores per stack – 3-wide, 48-entry ROB @ 1GHz

  • Mondrian: SIMD in-order

– 16 cores per stack – 1024-bit SIMD @ 1GHz

Methodology

31

Flexus cycle accurate simulator [Wenisch’06]

slide-30
SLIDE 30

Evaluation: Performance

32

1 10 100 Scan Sort Group by Join

Speedup (log scale) Operator

NMP Mondrian

slide-31
SLIDE 31

Evaluation: Performance

33

1 10 100 Scan Sort Group by Join

Speedup (log scale) Operator

NMP Mondrian

Mondrian achieves superior BW utilization

slide-32
SLIDE 32

Evaluation: Performance

34

1 10 100 Scan Sort Group by Join

Speedup (log scale) Operator

NMP Mondrian

NMP can’t utilize memory BW with random accesses

slide-33
SLIDE 33

Evaluation: Performance

35

1 10 100 Scan Sort Group by Join

Speedup (log scale) Operator

NMP Mondrian

Mondrian BW utilization compensates for extra log(n) work

slide-34
SLIDE 34

Summary

End of technology scaling à must think vertical

– Software + hardware co-design

Big data analytics are a critical workload

– Large datasets, little locality à memory bottleneck!

Moving compute near memory improves performance

– But need to conform to DRAM constraints

Mondrian is algorithm-hardware NMP for analytics

– Adapt algorithms/HW to DRAM constraints – Sequential rather than random memory access – Simple hardware to exploit memory bandwidth

36

slide-35
SLIDE 35

inf.ed.ac.uk/bgrot

Thank you!

Questions?

37

slide-36
SLIDE 36

Mondrian Energy Efficiency

38

1 10 100

Scan Sort Group by Join Efficiency Improvement (performance/energy) Operator NMP-OoO Mondrian