towards general purpose acceleration
play

Towards General-Purpose Acceleration by Exploiting Common Data- - PowerPoint PPT Presentation

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019 Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum


  1. Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019

  2. Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum generality DOMAIN- AGNOSTIC REASON: Control/memory Relies on vectorization data-dependence and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 2

  3. Challenging trade-off in domain-specific and domain-agnostic acceleration CPU OUR GOAL Maximum generality DOMAIN- DOMAIN- AGNOSTIC AGNOSTIC Relies on vectorization and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 3

  4. Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence Memory Dependence Control Dependence Arbitrary code a[3] a[0] a[5] a[1] Request Arbitrary Branch execution vector access Code 2 Code 1 location Insight: Restricted control and memory dependence is Branch Memory a[0] a[1] a[2] a[3] a[4] a[5] sufficient for many data-processing algorithms. Code 3 Code 4 4

  5. Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 5

  6. Irregularity is Ubiquitous Sparsity within dataset Data-structures representing Purpose to reorder data (Machine Learning) relationships (Graphs) (Databases) 4 2 3 6 5 1 7 1 2 3 4 5 6 7 Pruned Neural Network Bayesian Networks Sorting Table Z = Table X Table Y Inner Join (X, Y) A B B B D F = C F F G Decision tree building Triangle Counting Database Join 6

  7. Irregularity Stems from Data-dependence Data-dependent aspects of execution Restricted Control flow: Stream-Join 1. Control flow: if( f (a[i])) Restricted Memory Access: Alias-Free 2. Memory Access: b[ a[i]] Indirection Main-Insight: There are narrow forms of dependence which are: • Sufficient to express many algorithms (from ML, graph analytics, databases ) • Exploitable with minimal hardware overhead 7

  8. Algorithm Classification Restricted memory Restricted control dependence dependence Stream Alias-free Regular Join Indirect No control/memory dependence General Irregularity 8

  9. Regular Example: Dense Matrix Multiply Input Vector A (N) 0 2 0 3 0 4 0 • No data-dependence; × ie. the dynamic pattern of: Sparse matrix-multiply can be implemented in two ways: ∑ 3 0 0 0 0 1 4 1 Output Vector C (N) • Control 0 0 0 7 0 0 0 2 1. Inner product: Data-dependent control 0 0 0 0 0 0 9 0 • Data Access 0 1 0 0 9 3 3 1 0 0 0 0 0 3 2. Outer product: Data-dependent memory • … is known a priori. 4 2 0 0 0 0 0 2 5 0 0 0 0 0 0 0 0 0 1 0 0 0 6 0 0 0 2 3 4 0 0 0 0 6 Input Matrix B (NxN) 9

  10. Sparse Inner Product Multiply (stream-join) CSR format: Compressed Sparse Row idx val 2 3 5 A 2 3 4 total+= 3 * 1 B[0] idx val 0 1 3 1 4 1 Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control 10

  11. Sparse Inner Product Multiply (stream-join) float sparse_dotp ( row r1 , r2 ) CSR format: Compressed Sparse Row int i1 = 0 , i2 = 0 float total = 0 idx val 2 3 5 A 2 3 4 while( i1<r1.cnt && i2<r2.cnt ) if ( r1 . idx [ i1 ]== r2 . idx [ i2 ]) total += r1 . val [ i1 ]* r2 . val [ i2 ] total+= 3 * 1 i1 ++; i2 ++ elif ( r1 . idx [ i1 ]> r2 . idx [ i2 ]) B[0] idx val 0 1 3 1 4 1 i1 ++ Indicative of else Stream-Join i2 ++ ... Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control • Stream Join: • Memory read can be independent of data* • Order that we consume streams of data is data-dependent 11

  12. Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column idx 1 3 5 A val 2 3 4 0 1 5 3 4 0 3 5 0 3 B idx 1 2 2 3 2 4 3 5 1 1 val Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) 12

  13. Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column float sparse_mv ( row r1 , m2 ) ... idx 1 3 5 A for i1=0 to r1.cnt, ++ i1 val 2 3 4 cid = r1.idx [ i1 ] for i2=ptr[cid] to ptr[cid+1] 0 1 5 3 4 0 3 5 0 3 B idx out_vec [ m2 . idx [ i2 ]] += r1 . val[i1] * m2.val[i2] 1 2 2 3 2 4 3 5 1 1 val i2 ++ Indirection Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) • Alias-free Indirect: • Produce addresses depending on other data • Memory dependences, but no unknown (data-dependent) aliases 13

  14. Graph Mining (e.g. Triangle Counting) • For every pair of connected nodes, a d find if they have a common neighbor c b f (alias-free indirect) e C A B D E F edge list b d a c e b d e f a c f b c c d (stream-join) 14

  15. Stream Join Alias-free Indirection (irreg. control) (irregular memory) Machine Learning Neural Net (FC + Conv) Outer Product Mult. Inner Product Mult. Supp. Vector (SVM) “” “” Sparse + Histogramming Decision Trees (GBDT) data access Condition on + DAG Access Bayesian Networks node type + Indirect acc. Sparse join of Page Rank Graph for edges & BFS active list Find common + Indirect acc. Triangle Counting for edges neighbor edges Databases Sort-Join Join (inner) Hash-Join Merge-Sort Sort Radix-Sort Generate Generate Filter Filtered Col. Column Ind. 15

  16. Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 16

  17. Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router PuDianNao (ASPLOS’15) Ctrl Google TPU v2 Systolic ISCA’17 Array Systolic Array Stereotypical Dense Wide Scratchpad Accelerator Core Control Tabla (HPCA’16) 17

  18. Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Array 18

  19. Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Systolic array Array supporting stream-join control 19

  20. Approach: Start with a Dense Programmable Accelerator Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 20

  21. Specializing for Stream Join Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 21

  22. Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Systolic array Ld Ld Gen Gen idxA idxB PE PE PE PE PE addr addr <= >= Cmp PE PE PE PE PE ++ ++ = PE PE PE PE PE Gen Gen addr addr PE PE PE PE PE Ld Ld × PE PE PE PE PE valA ValB Control-dep. Load, Cyclic dependence, acc Unpredictable branch! 22

  23. Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow • Observation: For a Ld Ld Gen Gen strm strm strm strm idxA idxB stream join, memory is addr addr idxA idxB valA valB <= >= (mostly) separable from Cmp c c ++ ++ computation Cmp × = >,<,= • Idea: Allow Dataflow to Gen Gen init addr c addr conditionally acc pop/discard/reset Ld Ld × valA ValB values based on control decisions. acc 23

  24. Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB <= >= 0 2 Cmp c c ++ ++ Cmp × = >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 24

  25. Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 0 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 25

  26. Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 3 Gen Gen strm strm strm strm idxA idxB 1 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 1 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend