Towards General-Purpose Acceleration by Exploiting Common Data- - PowerPoint PPT Presentation

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019

Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum generality DOMAIN- AGNOSTIC REASON: Control/memory Relies on vectorization data-dependence and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 2

Challenging trade-off in domain-specific and domain-agnostic acceleration CPU OUR GOAL Maximum generality DOMAIN- DOMAIN- AGNOSTIC AGNOSTIC Relies on vectorization and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 3

Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence Memory Dependence Control Dependence Arbitrary code a[3] a[0] a[5] a[1] Request Arbitrary Branch execution vector access Code 2 Code 1 location Insight: Restricted control and memory dependence is Branch Memory a[0] a[1] a[2] a[3] a[4] a[5] sufficient for many data-processing algorithms. Code 3 Code 4 4

Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 5

Irregularity is Ubiquitous Sparsity within dataset Data-structures representing Purpose to reorder data (Machine Learning) relationships (Graphs) (Databases) 4 2 3 6 5 1 7 1 2 3 4 5 6 7 Pruned Neural Network Bayesian Networks Sorting Table Z = Table X Table Y Inner Join (X, Y) A B B B D F = C F F G Decision tree building Triangle Counting Database Join 6

Irregularity Stems from Data-dependence Data-dependent aspects of execution Restricted Control flow: Stream-Join 1. Control flow: if( f (a[i])) Restricted Memory Access: Alias-Free 2. Memory Access: b[ a[i]] Indirection Main-Insight: There are narrow forms of dependence which are: • Sufficient to express many algorithms (from ML, graph analytics, databases ) • Exploitable with minimal hardware overhead 7

Algorithm Classification Restricted memory Restricted control dependence dependence Stream Alias-free Regular Join Indirect No control/memory dependence General Irregularity 8

Regular Example: Dense Matrix Multiply Input Vector A (N) 0 2 0 3 0 4 0 • No data-dependence; × ie. the dynamic pattern of: Sparse matrix-multiply can be implemented in two ways: ∑ 3 0 0 0 0 1 4 1 Output Vector C (N) • Control 0 0 0 7 0 0 0 2 1. Inner product: Data-dependent control 0 0 0 0 0 0 9 0 • Data Access 0 1 0 0 9 3 3 1 0 0 0 0 0 3 2. Outer product: Data-dependent memory • … is known a priori. 4 2 0 0 0 0 0 2 5 0 0 0 0 0 0 0 0 0 1 0 0 0 6 0 0 0 2 3 4 0 0 0 0 6 Input Matrix B (NxN) 9

Sparse Inner Product Multiply (stream-join) CSR format: Compressed Sparse Row idx val 2 3 5 A 2 3 4 total+= 3 * 1 B[0] idx val 0 1 3 1 4 1 Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control 10

Sparse Inner Product Multiply (stream-join) float sparse_dotp ( row r1 , r2 ) CSR format: Compressed Sparse Row int i1 = 0 , i2 = 0 float total = 0 idx val 2 3 5 A 2 3 4 while( i1<r1.cnt && i2<r2.cnt ) if ( r1 . idx [ i1 ]== r2 . idx [ i2 ]) total += r1 . val [ i1 ]* r2 . val [ i2 ] total+= 3 * 1 i1 ++; i2 ++ elif ( r1 . idx [ i1 ]> r2 . idx [ i2 ]) B[0] idx val 0 1 3 1 4 1 i1 ++ Indicative of else Stream-Join i2 ++ ... Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control • Stream Join: • Memory read can be independent of data* • Order that we consume streams of data is data-dependent 11

Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column idx 1 3 5 A val 2 3 4 0 1 5 3 4 0 3 5 0 3 B idx 1 2 2 3 2 4 3 5 1 1 val Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) 12

Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column float sparse_mv ( row r1 , m2 ) ... idx 1 3 5 A for i1=0 to r1.cnt, ++ i1 val 2 3 4 cid = r1.idx [ i1 ] for i2=ptr[cid] to ptr[cid+1] 0 1 5 3 4 0 3 5 0 3 B idx out_vec [ m2 . idx [ i2 ]] += r1 . val[i1] * m2.val[i2] 1 2 2 3 2 4 3 5 1 1 val i2 ++ Indirection Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) • Alias-free Indirect: • Produce addresses depending on other data • Memory dependences, but no unknown (data-dependent) aliases 13

Graph Mining (e.g. Triangle Counting) • For every pair of connected nodes, a d find if they have a common neighbor c b f (alias-free indirect) e C A B D E F edge list b d a c e b d e f a c f b c c d (stream-join) 14

Stream Join Alias-free Indirection (irreg. control) (irregular memory) Machine Learning Neural Net (FC + Conv) Outer Product Mult. Inner Product Mult. Supp. Vector (SVM) “” “” Sparse + Histogramming Decision Trees (GBDT) data access Condition on + DAG Access Bayesian Networks node type + Indirect acc. Sparse join of Page Rank Graph for edges & BFS active list Find common + Indirect acc. Triangle Counting for edges neighbor edges Databases Sort-Join Join (inner) Hash-Join Merge-Sort Sort Radix-Sort Generate Generate Filter Filtered Col. Column Ind. 15

Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 16

Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router PuDianNao (ASPLOS’15) Ctrl Google TPU v2 Systolic ISCA’17 Array Systolic Array Stereotypical Dense Wide Scratchpad Accelerator Core Control Tabla (HPCA’16) 17

Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Array 18

Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Systolic array Array supporting stream-join control 19

Approach: Start with a Dense Programmable Accelerator Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 20

Specializing for Stream Join Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 21

Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Systolic array Ld Ld Gen Gen idxA idxB PE PE PE PE PE addr addr <= >= Cmp PE PE PE PE PE ++ ++ = PE PE PE PE PE Gen Gen addr addr PE PE PE PE PE Ld Ld × PE PE PE PE PE valA ValB Control-dep. Load, Cyclic dependence, acc Unpredictable branch! 22

Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow • Observation: For a Ld Ld Gen Gen strm strm strm strm idxA idxB stream join, memory is addr addr idxA idxB valA valB <= >= (mostly) separable from Cmp c c ++ ++ computation Cmp × = >,<,= • Idea: Allow Dataflow to Gen Gen init addr c addr conditionally acc pop/discard/reset Ld Ld × valA ValB values based on control decisions. acc 23

Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB <= >= 0 2 Cmp c c ++ ++ Cmp × = >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 24

Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 0 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 25

Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 3 Gen Gen strm strm strm strm idxA idxB 1 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 1 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 26

Towards General-Purpose Acceleration by Exploiting Common Data- - PowerPoint PPT Presentation

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019 Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Middle School Enrichment & Acceleration Where will students access enrichment and

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

Acceleration in English and Social Studies Acceleration in English and Social Studies (ELA/SS):

Questions ? Tonights Agenda Acceleration

Neutrino acceleration: analogy with Fermi acceleration and Comptonization Yudai Suwa 1,2 1 Yukawa

Probing Particle Acceleration with Probing Particle Acceleration with X-ray/Gamma X ray/Gamma

Laser-Wakefield Acceleration Application to Endoscopic Oncology Scott Nicks, Toshi Tajima, Dante

Particle Acceleration Particle Acceleration and Injection Problem in Shocks and Injection

Plasma acceleration experiments at DESY Zeuthen Plasma wakefield acceleration and astrophysics in

hendren@cs.mcgill.ca COMP 520 Winter 2016 Domain-Specific Languages - OncoTime (2) Designing

Building Genomic Medicine Capability Challenges and opportunities of big data Andy Futreal MD

Automatic DRR Enhancement for Patient Positioning in a Radiotherapy Treatment Rafael

Performance Measurement Work Group Meeting 9/17 / 2019 Agenda 1. Welcome and introductions 2.

The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Rewriting in Artin groups Sarah Rees University of Newcastle Paul Schupp Fest, Stevens Institute

Chapter 12: The Regression Line We already know that the regression line goes through the point

Towards General-Purpose Acceleration by Exploiting Common Data- - PowerPoint PPT Presentation

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019 Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Advanced Acceleration Concepts Advanced Acceleration Concepts Levi Sch chter chter Levi

Middle School Enrichment &amp; Acceleration Where will students access enrichment and

Hardware Acceleration of Hardware Acceleration of Graphics and Imaging Graphics and Imaging

Acceleration in English and Social Studies Acceleration in English and Social Studies (ELA/SS):

Questions ? Tonights Agenda Acceleration

Neutrino acceleration: analogy with Fermi acceleration and Comptonization Yudai Suwa 1,2 1 Yukawa

Probing Particle Acceleration with Probing Particle Acceleration with X-ray/Gamma X ray/Gamma

Laser-Wakefield Acceleration Application to Endoscopic Oncology Scott Nicks, Toshi Tajima, Dante

Particle Acceleration Particle Acceleration and Injection Problem in Shocks and Injection

Plasma acceleration experiments at DESY Zeuthen Plasma wakefield acceleration and astrophysics in

hendren@cs.mcgill.ca COMP 520 Winter 2016 Domain-Specific Languages - OncoTime (2) Designing

Building Genomic Medicine Capability Challenges and opportunities of big data Andy Futreal MD

Automatic DRR Enhancement for Patient Positioning in a Radiotherapy Treatment Rafael

Performance Measurement Work Group Meeting 9/17 / 2019 Agenda 1. Welcome and introductions 2.

The Explosion in Neural Network Hardware USC Friday 19 th April Trevor Mudge Bredt Family

Power-Driven DNN Dataflow Optimization on FPGA Qi Sun 1 , Tinghuan Chen 1 , Jin Miao 2 , Bei Yu 1 1

Rewriting in Artin groups Sarah Rees University of Newcastle Paul Schupp Fest, Stevens Institute

Chapter 12: The Regression Line We already know that the regression line goes through the point

Middle School Enrichment & Acceleration Where will students access enrichment and