Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - PowerPoint PPT Presentation

Near-Memory Processing: It’s the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019

The End is Near Here! Where do we go from here?

An exponential is ending… 10%, 20%, .. improvement in performance of component X won’t get you far – No new transistors – Fixed power ceiling Emerging technologies are either incremental (e.g., Intel’s Xpoint Memory) or cover niche areas (e.g., quantum) 3

The Way Forward: Vertical Integration Software/hardware co-design for high efficiency and programmability Is this always a good idea? No! Need high volume for cost-efficiency Need large perf/Watt gains to be worth the effort 4

This Talk Vertical integration for in-memory data analytics 5

Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel 7

Data Analytics Takes Center Stage User data grows exponentially – Need to monetize data In-memory data operators – Poor locality – Low computational requirement – Highly parallel Data movement – High energy cost – High BW requirement Data movement bottlenecks data analytics 8

Cost of Moving Data Memory access 640 pJ Fixed point Add 0.1 pJ CPU DRAM Data access much more expensive than arithmetic operation 10

DRAM BW Bottleneck Memory Array 100’s of GB/s internally Row Buffer Memory Array CPU DRAM 24 GB/s off-chip BW Internal DRAM BW presents big opportunity 11

Logic inside DRAM? Not a Good Idea Fabrication processes not compatible Memory Array – DRAM is optimized for density – Logic is irregular, wire-intensive Logic Memory Array In-memory logic failed in the 90s – DRAM is cost-sensitive DRAM Must exploit DRAM in a non-disruptive manner 12

Near-Memory Processing (NMP) 3D logic/DRAM stack – Exposes internal BW to processing elements – But constrains logic layer’s area/power envelope 640 pJ 24 GB/s 150 pJ 128 GB/s CPU DRAM Logic Exploit the bandwidth without data movement 13

How to Best Exploit DRAM BW? DRAM internals optimized for density DRAM accesses must activate rows – Single access activates KBs of data – Activations dominate access latency & energy DRAM Can’t utilize internal BW with random access – Need to maintain many open rows – Complex bookkeeping logic Need sequential access to utilize DRAM BW 14

NMP HW-Algorithm Co-Design Algorithms: Must have sequential access – Even if we perform more work Hardware: Must leverage data parallelism – On a tight area/power budget HW-algorithm co-design necessary to make best use of NMP 15

Example data operator: Join Iterates over a pair of tables to find matching keys Major operation in data analytics Q: SELECT ... FROM A, B WHERE A.Key = B.Key A B Result C A F G A Z A Join D C C B M E E E 16

Baseline: CPU Hash Join Best performing algorithm in CPU-centric systems Performed in two phases: Partition & Probe 1. Partition generates cache sized partitions 2. Probe builds and queries cache resident hash tables Partition Probe A E F H(x) C D F E F C B B A D B E E F B Optimized for random accesses to cached data 17

NMP Hash Join C C F D H(X) A F E D A B E B To DRAM DRAM Goal: maximum MLP • Limited by bookkeeping logic NMP 18

NMP Hash Join C C F D H(X) A F &C E D &F A B B E C F To DRAM C F DRAM Poor row buffer utilization NMP 19

NMP Hash Join C C F D H(X) A F &A E D &D A B B E To DRAM DRAM NMP Random accesses are inefficient and under-utilize internal BW 20

Eliminate Random Access? Insight: use Sort Join – Performs mostly sequential accesses – But has higher algorithmic complexity Trade algorithmic complexity for desirable access pattern Hash join Sort join O(n) random accesses O(n log n) sequential acesses C A C A H(x) F D F C A F A D D C D F Utilizing internal DRAM BW compensates for increased cost 21

NMP Sort Join: Sequential Accesses A C base E G base B D F H To DRAM Drop OoO logic DRAM • Reduces area/power of NMP Add stream buffer • Simple logic utilizes BW NMP 22

NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H To DRAM DRAM NMP 23

NMP Sort Join: Sequential Accesses A C base &A E 3 2 1 0 G base &B B 3 2 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM Good row buffer utilization DRAM NMP 24

NMP Sort Join: Sequential Accesses A C base &A E 4 3 3 2 2 1 1 0 G base &B B 3 4 3 2 2 1 1 0 D F H &A + 0 &A + 1 &B + 0 &B + 1 To DRAM &A + 0 &A + 1 &B + 0 DRAM &B + 1 NMP 25

NMP Sort Join: Sequential Accesses A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H &A + 1 &B + 1 To DRAM DRAM NMP Sequential access moves bottleneck to compute 26

NMP Sort Join: Compute A C base &A E 4 3 2 1 G base &B B 4 3 2 1 D F H To DRAM DRAM Use area/power budget for SIMD NMP General purpose SIMD keeps up with memory BW 27

Partitioning Phase Partitioning basics: – Each partition contains buckets of objects – For a given object, target bucket determined using a hash – The order of objects within each bucket is irrelevant à buckets are unordered Insight: the order in which tuples are written into a bucket in the target partition is irrelevant Partitioning phase: tuples are permutable 28

Partitioning Phase Leverage tuple’s permutability property Turn partition’s random accesses sequential – Enable use of SIMD during partition 29

Mondrian Algorithm + hardware co-design for near-memory processing of data analytics NMP Algorithms – Use sequential memory accesses – Avoid random memory accesses – Target partitioning and compute phases NMP Harware – High memory parallelism using simple SIMD hardware – Exploit sequential memory accesses 30

Methodology Flexus cycle accurate simulator [Wenisch’06] Big data operators: Simulated systems: – Scan • CPU-centric: ARM Cortex-A57 – Join – 16 cores – 3-wide,128-entry ROB @ 2GHz – Group By – Sort • NMP: Mobile ARM core Memory subsystem: – 16 cores per stack – 3-wide, 48-entry ROB @ 1GHz • 4 HMC stacks • Mondrian: SIMD in-order – 20 GB/s external BW – 16 cores per stack – 128 GB/s internal BW – 1024-bit SIMD @ 1GHz 31

Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator 32

Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian achieves superior BW utilization 33

Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator NMP can’t utilize memory BW with random accesses 34

Evaluation: Performance 100 NMP Mondrian Speedup (log scale) 10 1 Scan Sort Group by Join Operator Mondrian BW utilization compensates for extra log(n) work 35

Summary End of technology scaling à must think vertical – Software + hardware co-design Big data analytics are a critical workload – Large datasets, little locality à memory bottleneck! Moving compute near memory improves performance – But need to conform to DRAM constraints Mondrian is algorithm-hardware NMP for analytics – Adapt algorithms/HW to DRAM constraints – Sequential rather than random memory access – Simple hardware to exploit memory bandwidth 36

Thank you! Questions ? inf.ed.ac.uk/bgrot 37

Mondrian Energy Efficiency 100 NMP-OoO Mondrian Efficiency Improvement (performance/energy) 10 1 Scan Sort Group by Join Operator 38

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - PowerPoint PPT Presentation

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The End is Near Here! Where do we go from here? An exponential is ending 10%, 20%, .. improvement in performance of component X wont get you far

Jesuss Stupid Disciples Mike Taylor Forest Community Church Sunday 5 May 2019 Stupid

Better to LOOK stupid, than to BE stupid Fred Henry Williams Agile Prague, 2018 Never

Its People, Stupid (People are stupid?) Andy Walker Success in software engineering is

RESILIENCE lo Deli v er the bits, stupid. David Isenberg Rise of the Stupid

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

3D VIDEO SYSTEMS 3D VIDEO SYSTEMS Fernando Pereira Instituto Superior Tcnico Comunicao de

Keep Persistence Simple, Stupid A possible future for Java Persistence Robert Brutigam adidas

Lets Stop Making People Feel Stupid @ClareSudbery, ThoughtWorks FEEDBACK CLOSE YOUR EYES

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Making Middleboxes Someone Elses Problem: Network Processing as a Cloud Service Justine

Imagine: Media Processing with Streams Brucek Khailany et al. and a little bit of Evaluating

Processing Data from Files n So far: n Inputs : n from user n

AIMS CDT - Signal Processing Michaelmas Term 2020 Xiaowen Dong Department of Engineering Science

Out-of of-GPU-Memory ry Graph Processing Amir Hossein Nodehi Sabet, Zhijia Zhao, Rajiv Gupta

Language What is Processing Object Complete orientated with IDE Open source Java Based Can

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Freedom of Information Act Advisory Committee March 20, 2019 1 A Snapshot of FOIA

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot - PowerPoint PPT Presentation

Near-Memory Processing: Its the SW and HW, stupid! Boris Grot www.inf.ed.ac.uk DATE 2019 The End is Near Here! Where do we go from here? An exponential is ending 10%, 20%, .. improvement in performance of component X wont get you far

Jesuss Stupid Disciples Mike Taylor Forest Community Church Sunday 5 May 2019 Stupid

Better to LOOK stupid, than to BE stupid Fred Henry Williams Agile Prague, 2018 Never

Its People, Stupid (People are stupid?) Andy Walker Success in software engineering is

RESILIENCE lo Deli v er the bits, stupid. David Isenberg Rise of the Stupid

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Management Memory Manager Requirements Minimize primary memory access time

3D VIDEO SYSTEMS 3D VIDEO SYSTEMS Fernando Pereira Instituto Superior Tcnico Comunicao de

Keep Persistence Simple, Stupid A possible future for Java Persistence Robert Brutigam adidas

Lets Stop Making People Feel Stupid @ClareSudbery, ThoughtWorks FEEDBACK CLOSE YOUR EYES

Long-Term Memory Introduction STM versus LTM Episodic Memory Semantic Memory

Making Middleboxes Someone Elses Problem: Network Processing as a Cloud Service Justine

Imagine: Media Processing with Streams Brucek Khailany et al. and a little bit of Evaluating

Processing Data from Files n So far: n Inputs : n from user n

AIMS CDT - Signal Processing Michaelmas Term 2020 Xiaowen Dong Department of Engineering Science

Out-of of-GPU-Memory ry Graph Processing Amir Hossein Nodehi Sabet, Zhijia Zhao, Rajiv Gupta

Language What is Processing Object Complete orientated with IDE Open source Java Based Can

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Freedom of Information Act Advisory Committee March 20, 2019 1 A Snapshot of FOIA

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1