Towards General-Purpose Acceleration by Exploiting Common Data- - - PowerPoint PPT Presentation

towards general purpose acceleration
SMART_READER_LITE
LIVE PREVIEW

Towards General-Purpose Acceleration by Exploiting Common Data- - - PowerPoint PPT Presentation

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019 Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum


slide-1
SLIDE 1

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms

Vidushi Dadu, Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019

slide-2
SLIDE 2

Challenging trade-off in domain-specific and domain-agnostic acceleration

Maximum efficiency Maximum generality DOMAIN- SPECIFIC DOMAIN- AGNOSTIC CPU

REASON: Control/memory data-dependence

Support for application- specific dependencies Relies on vectorization and prefetching.

2

slide-3
SLIDE 3

Challenging trade-off in domain-specific and domain-agnostic acceleration

Maximum efficiency Maximum generality DOMAIN- SPECIFIC DOMAIN- AGNOSTIC CPU

OUR GOAL

DOMAIN- AGNOSTIC

Relies on vectorization and prefetching. Support for application- specific dependencies

3

slide-4
SLIDE 4

Control Dependence Memory Dependence

a[3] a[0] a[5] a[1]

Request vector

Arbitrary access location Arbitrary code execution

Memory a[0]

a[1] a[2] a[3] a[4] a[5] Code 2 Branch Branch Code 1 Code 3 Code 4

Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence

Insight: Restricted control and memory dependence is sufficient for many data-processing algorithms.

4

slide-5
SLIDE 5

Outline

  • Irregularity is ubiquitous
  • Sufficient and Exploitable forms of Control and Memory dependence
  • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
  • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
  • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

5

slide-6
SLIDE 6

Irregularity is Ubiquitous

Sparsity within dataset (Machine Learning) Data-structures representing relationships (Graphs) Purpose to reorder data (Databases)

4 2 3 6 5 1 7 1 2 3 4 5 6 7 Pruned Neural Network Decision tree building Bayesian Networks Sorting

6

A B C F B D F G B F Table X Table Y Table Z = Inner Join (X, Y) = Database Join Triangle Counting

slide-7
SLIDE 7

Irregularity Stems from Data-dependence

Main-Insight: There are narrow forms of dependence which are:

  • Sufficient to express many algorithms (from ML, graph analytics,

databases)

  • Exploitable with minimal hardware overhead

Data-dependent aspects of execution

  • 1. Control flow: if(f(a[i]))
  • 2. Memory Access: b[a[i]]

Restricted Control flow: Stream-Join Restricted Memory Access: Alias-Free Indirection

7

slide-8
SLIDE 8

8

Algorithm Classification

General Irregularity Regular Stream Join Alias-free Indirect

No control/memory dependence Restricted control dependence Restricted memory dependence

slide-9
SLIDE 9

Regular Example: Dense Matrix Multiply

1 4 1 2 7 9 3 3 1 1 2 3 2 5 1 6 2 3 4 2 3 4 3 9 4 6

Input Vector A (N) Output Vector C (N)

×

∑ Input Matrix B (NxN)

  • No data-dependence;
  • ie. the dynamic pattern of:
  • Control
  • Data Access
  • … is known a priori.

9

Sparse matrix-multiply can be implemented in two ways:

  • 1. Inner product: Data-dependent control
  • 2. Outer product: Data-dependent memory
slide-10
SLIDE 10

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val

CSR format: Compressed Sparse Row

total+=3*1 Output of conditional Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

  • Known memory access pattern, but unpredictability in control

10

1

slide-11
SLIDE 11

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val

CSR format: Compressed Sparse Row

total+=3*1 Output of conditional Conditional output 0 means no multiplication

Sparse Inner Product Multiply (stream-join)

11

1

  • Known memory access pattern, but unpredictability in control
  • Stream Join:
  • Memory read can be independent of data*
  • Order that we consume streams of data is data-dependent

float sparse_dotp(row r1, r2) int i1=0, i2=0 float total=0 while(i1<r1.cnt && i2<r2.cnt) if (r1.idx[i1]==r2.idx[i2]) total+=r1.val[i1]*r2.val[i2] i1++; i2++ elif (r1.idx[i1]> r2.idx[i2]) i1++ else i2++ ... Indicative of Stream-Join

slide-12
SLIDE 12

Sparse Outer Product Multiply (Alias-free Indirection)

2 3 4 1 3 5 1 5 3 4 3 5 3 1 2 2 3 2 4 3 5 1 1

idx val A B idx val C Accumulate

  • utput

vector

12

CSC: Compressed Sparse Column

  • High memory unpredictability, but known control pattern
  • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])
slide-13
SLIDE 13

Sparse Outer Product Multiply (Alias-free Indirection)

2 3 4 1 3 5 1 5 3 4 3 5 3 1 2 2 3 2 4 3 5 1 1

idx val A B idx val C Accumulate

  • utput

vector

13

CSC: Compressed Sparse Column

  • High memory unpredictability, but known control pattern
  • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i])
  • Alias-free Indirect:
  • Produce addresses depending on other data
  • Memory dependences, but no unknown (data-dependent) aliases

float sparse_mv(row r1, m2) ... for i1=0 to r1.cnt, ++i1 cid = r1.idx[i1] for i2=ptr[cid] to ptr[cid+1]

  • ut_vec[m2.idx[i2]] +=

r1.val[i1]*m2.val[i2] i2++ Indirection

slide-14
SLIDE 14

Graph Mining (e.g. Triangle Counting)

b c e d a f

b d a c e b d e f a c f b c c d

edge list (stream-join) A B C D E F (alias-free indirect)

  • For every pair of connected nodes,

find if they have a common neighbor

14

slide-15
SLIDE 15

Stream Join (irreg. control) Alias-free Indirection (irregular memory) Neural Net (FC + Conv) Machine Learning

  • Supp. Vector (SVM)

Decision Trees (GBDT) Databases Bayesian Networks Join (inner) Sort Filter Graph Page Rank & BFS Triangle Counting Inner Product Mult.

“”

Outer Product Mult.

“”

Sort-Join Merge-Sort Generate Filtered Col. Hash-Join Radix-Sort Generate Column Ind. Sparse data access + Histogramming Condition on node type Sparse join of active list + Indirect acc. for edges Find common neighbor edges + Indirect acc. for edges + DAG Access

15

slide-16
SLIDE 16

Outline

  • Irregularity is ubiquitous
  • Sufficient and Exploitable forms of Control and Memory dependence
  • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
  • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
  • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

16

slide-17
SLIDE 17

Approach: Start with a Dense Programmable Accelerator

Systolic Array Wide Scratchpad Control

Google TPU v2 ISCA’17 PuDianNao (ASPLOS’15) Tabla (HPCA’16) Stereotypical Dense Accelerator Core

Systolic Array Ctrl Router

Wide Scratchpad

17

slide-18
SLIDE 18

Approach: Start with a Dense Programmable Accelerator

Systolic Array Ctrl

Wide Scratchpad Router

18

slide-19
SLIDE 19

Approach: Start with a Dense Programmable Accelerator

Systolic Array Ctrl

Wide Scratchpad Router Systolic array supporting stream-join control

19

slide-20
SLIDE 20

Approach: Start with a Dense Programmable Accelerator

Systolic Array Ctrl

Router Systolic array supporting stream-join control Bank Scratchpad

I- ROB

Compute-Enabled Scratchpad for fast Alias-free indirect access

20

slide-21
SLIDE 21

Specializing for Stream Join

21

Systolic Array Ctrl

Router Systolic array supporting stream-join control Bank Scratchpad

I- ROB

Compute-Enabled Scratchpad for fast Alias-free indirect access

slide-22
SLIDE 22

Novel Dataflow for Stream Join

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val Ld idxA Ld idxB Cmp ++ Gen addr ++ Gen addr Ld ValB Ld valA Gen addr Gen addr = <= >=

× acc

Traditional Dataflow Sparse MM Example

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Systolic array Control-dep. Load, Cyclic dependence, Unpredictable branch!

22

slide-23
SLIDE 23
  • Observation: For a

stream join, memory is (mostly) separable from computation

  • Idea: Allow Dataflow to

conditionally pop/discard/reset values based on control decisions.

Novel Dataflow for Stream Join

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val Ld idxA Ld idxB Cmp ++ Gen addr ++ Gen addr Ld ValB Ld valA Gen addr Gen addr = <= >=

×

strm idxA strm idxB strm valA strm valB Cmp

× acc acc

>,<,=

c c c

Traditional Dataflow Novel Stream Join Dataflow

init

Sparse MM Example

23

slide-24
SLIDE 24

Novel Dataflow for Stream Join

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val Ld idxA Ld idxB Cmp ++ Gen addr ++ Gen addr Ld ValB Ld valA Gen addr Gen addr = <= >=

×

strm idxA strm idxB strm valA strm valB Cmp

× acc acc

>,<,=

c c c

Traditional Dataflow Novel Stream Join Dataflow

init

Sparse MM Example

24

2 2 3 1

slide-25
SLIDE 25

Novel Dataflow for Stream Join

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val Ld idxA Ld idxB Cmp ++ Gen addr ++ Gen addr Ld ValB Ld valA Gen addr Gen addr = <= >=

×

strm idxA strm idxB strm valA strm valB Cmp

× acc acc

>,<,=

c c c

Traditional Dataflow Novel Stream Join Dataflow

init

Sparse MM Example

25

2 2 2 3 1 consume <

slide-26
SLIDE 26

Novel Dataflow for Stream Join

2 3 4

idx

2 3 5

val A B[0]

1 3 1 4 1

idx val Ld idxA Ld idxB Cmp ++ Gen addr ++ Gen addr Ld ValB Ld valA Gen addr Gen addr = <= >=

×

strm idxA strm idxB strm valA strm valB Cmp

× acc acc

>,<,=

c c c

Traditional Dataflow Novel Stream Join Dataflow

init

Sparse MM Example

26

1 1 2 2 2 3 3 < consume

slide-27
SLIDE 27

Other Kernels as Stream Join

strm key1 strm key2 strm val1 strm val2 Cmp

cat

>,<,=

c c

Database Join

strm

  • ut

cat

Merge (sort)

strm key1 strm key2 Cmp

c

Mux strm

  • ut

Resparsify (filter)

strm in s-ind s-val Cmp

Relu

(max)

acc

c

init i =0

c c

27

slide-28
SLIDE 28

Supporting Stream-Join in Hardware

discard

  • Func. Unit

ACC CLT FIFO0 FIFO1 FIFO2

CGRA Processing Element

reuse reset

SE

Stream-Join Flow-Ctrl Data-Flow

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

“Systolic” CGRA

28

slide-29
SLIDE 29

Specializing for Indirection

29

Systolic Array Ctrl

Router Systolic array supporting stream-join control Bank Scratchpad

I- ROB

Compute-Enabled Scratchpad for fast Alias-free indirect access

slide-30
SLIDE 30

Indirection with guaranteed alias-freedom

Traditional Banked Memory

Bank Bank 1 Bank 2 Bank 3 Bank 4 Bank n

Arbitrated Crossbar

From/To Compute Fabric & Network

Dependence check in a vector

Data Data

Observations: Dependence check serializes over a group of non- conflicting requests. Idea: SPU allow aggressive reordering with a simple reorder buffer.

Addr. Bank Bank 1 Bank 2 Bank 3 Bank 4 Bank n

Arbitrated Crossbar Address Generation

From/To Compute Fabric & Network

Queue Queue Queue Queue Queue Queue

Indirect Reorder Buffer

Addr. Data Data

SPU Banked Memory

Indirect updates Indirect access w/o dependence check Reordering for indirect read

Typical GPU

30

Known apriori that they won’t alias. Serialization unrequired. Vec read req1 Vec write req2

slide-31
SLIDE 31

Indirect Access Reordering in Scratchpad

Typical GPU SPU

31

Vec read req1 Vec write req2

slide-32
SLIDE 32

32

Network: Traditional mesh NoC Scratchpads in global address space: for nearby core communication Broadcast: using memory stream engine Synchronization: Dataflow Counters (sync. On SPAD write)

SPU: Sparse Processing Unit

Systolic Array Ctrl

Router Systolic array supporting stream-join control Bank Scratchpad

I- ROB

Compute-Enabled Scratchpad for fast Alias-free indirect access

Main Memory Memory Stream Engine SPU Core SPU Core SPU Core

SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … … …

slide-33
SLIDE 33

Outline

  • Irregularity is ubiquitous
  • Sufficient and Exploitable forms of Control and Memory dependence
  • Example Workload: Matrix Multiply
  • Exploiting data-dependence with SPU accelerator
  • uArch: Stream-join Dataflow & Compute-enabled Scratchpad
  • SPU Multicore Design
  • Evaluating SPU
  • Conclusion

33

slide-34
SLIDE 34

Methodology

  • Programming: C + Intrinsics +

Dataflow Graphs

  • SPU Simulation: Gem5+Ruby

(RISCV inorder control core)

Workloads CPU GPU GBDT LightGBM LightGBM Kernel-SVM LibSVM Hand-coded AC Hand-coded Hand-coded FC MKL SPBLAS cuSparse Conv layer MKL-DNN cuDNN Graph Alg. Graphmat

  • TPCH

MonetDB

  • Wkld

GBDT Cifar10-bin (1) Higgs-bin (0.28) Yahoo-bin (0.05) Ltrc-bin (0.008) KSVM Connect (0.33) Higgs (0.92) Yahoo (0.59) Ltrc (0.24) CONV VGG-3 (0.34) VGG-4 (0.1) ALEX-2 (0.14) RES-1 (0.05) FC VGG-12 (0.04) VGG-13 (0.09) ALEX-6 (0.16) RES-1 (0.22) AC Pigs Munin Andes Mildew Graph Flickr Fb-artist NY-road

Benchmarks Datasets (with varying sparsity)

34

slide-35
SLIDE 35

Domain-agnostic comparison points

P4000 Pascal GPU SPU-inorder

Overall Design Core Design

SPU

SM-1 SM-2 Main Memory Ctrl FU-1 FU-N L1 Cache Shared mem SM-N FU-2 …

35

On-chip mem:4MB FP units: 3696 Mem bw = 243 GB/s Main Memory

CPU Core CPU Core CPU Core CPU Core CPU Core CPU Core

… … … …

Ctrl FU-1 FU-N

  • Lin. Scratch

Bank Scratch

FU-2 … Main Memory

SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … …

Ctr l FU-1 FU-2 FU-8 FU-N On-chip mem:3MB FP units: 2560 Mem bw = 256 GB/s On-chip mem:3MB FP units: 2632 Mem bw = 256 GB/s Ctrl Ctrl

  • Lin. Scratch

Bank Scratch

SIMD unit Array of in-order cores Systolic CGRA

slide-36
SLIDE 36

Domain-specific comparison points

SCNN (Sparse convolution): ISCA’17 EIE (Sparse fully connected): ISCA’16 Graphicionado (Graph Analytics): MICRO’16 Q100 (Databases): ASPLOS’14

36

slide-37
SLIDE 37

Overall Results

1 10 100 FC CONV KSVM AC GBDT GM Speedup Normalized

  • ver CPU

Machine Learning

PR BFS GM

Graph Processing

N-SH SH GM

Databases GPU SPU-inorder SPU ASIC

EIE SCNN Q100 Graphicionado

37

slide-38
SLIDE 38

Cost of adding stream-join in systolic CGRA

  • 1.69x area overhead due to

addition of flow control.

  • 1.63x power overhead.

Compared to whole design, it is 6.9% area overhead and 14.2% power overhead.

38

Methodology: SPU’s DGRA is implemented in Chisel and synthesized using Synopsis DC with a 28nm UMC technology library.

0.5 1 1.5 2

  • Trad. CGRA

CGRA+Stream-Join Normalized Area/Power Area Power

slide-39
SLIDE 39

Conclusion

Efficiency on Stream- Join Algorithms Efficiency on Alias- Free Indirection Algorithms SCNN (Conv) EIE (FC) Graph. (Graph) Intel SpM- SpV Q100 Outers pace CPU GPU SPU

39

slide-40
SLIDE 40

EXTRA SLIDES

40

slide-41
SLIDE 41

Programming SPU

Example of gradient boosting decision trees (GBDT)

  • Stream join control

expressed in the dataflow graph

  • Alias-Free Indirection

expressed as update stream

41

slide-42
SLIDE 42

Approach Overview

Stream-Dataflow ISA

× × + A[2] B[2] Out Local Storage Memory Stream Dataflow Graph To Memory

Sparsity-Enabled SPU Core

Streaming Memory: Streams of data fetched from memory and stored back to memory. Dataflow Computation: Dependence graph (DFG) with input/output vector ports.

Systolic Array Ctrl Router

Compute-Enabled High-bandwidth Indirect Scratchpad Decomposable Memory/Network/ Compute Systolic array with novel meta-reuse control flow

Wide Scratchpad I- ROB

Maps dataflow computation to the systolic array Streams of data flow from wide scratchpad A[0.N] B[0.N]

12

slide-43
SLIDE 43

Note: The relative position are the best to our knowledge

Efficiency on Stream- Join Algorithms Efficiency on Alias- Free Indirection Algorithms SCNN (Conv) EIE (FC) Graph. (Graph) Intel SpM- SpV Q100 Outers pace CPU GPU SPU

43

Vector thread Pastici ne Trig. inst

slide-44
SLIDE 44

IROB Buffer at Cycle 2

Indirect Access Reordering in Scratchpad

Typical GPU SPU

44

slide-45
SLIDE 45

Main Memory Global Stream Engine SPU Core SPU Core SPU Core

SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … … …

SPU: Sparse Processing Unit

Main Memory Global Stream Engine SPU Core SPU Core SPU Core

SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … … …

Main Memory Global Stream Engine SPU Core SPU Core SPU Core

SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … … …

Fully Connected Layer (broadcast row) Sparse Convolution (communication with neighbors for halos) Independent Lanes (with/without Communication) Local Spatial Communication Pipelined Communication Arithmetic Circuits – Pipelined DAG Traveral (pipeline node updates) Graph Processing (core-core communication)

slide-46
SLIDE 46

How much density is exploitable?

  • Bounded by memory

bandwidth, sparse versions are better at less than 50% density.

  • SPU-sparse has

exponential gain with sparsity.

27

slide-47
SLIDE 47

Alias-free Indirection Abstractions

1: Indirect Memory -- d = a[b[i]]

  • Allow to specify indirect loads or stores using an input stream as address

values.

  • Offset list for array-of-structs organization.

Example C code struct {int f1, f2} a[N] for i=0 to N ind = b[i] str1=load(b[0..n]) .. = a[ind].f1 -> ind_load(addr=str1,

  • ffset_list={0,4})

.. = a[ind].f2

14

Stream code

slide-48
SLIDE 48

Update stream

2: Histogram -- a[hist_bin] += c

  • Enhance ISA with compute-enabled semantics for the access stream
  • Add update stream for common reduction operations
slide-49
SLIDE 49

Sparsity-Enhanced Memory Micro- architecture

Arbiter

XBAR(eg.16x32)

Indirect Address Gener- ation

Linear Access Stream Table

Linear Address Generation

MUX

Indirect Rd/Wr/ Atomic Stream Table

rd-wr bank-queues Control Logic

Composable Banked Scratchpad NoC Linear Scratchpad Control Unit

Sel

Indirect

ROB

To Compute Fabric

From Compute Fabric

We keep 2 logical scratchpad memories: banked and linear. Linear Memory

  • reads/writes

15

slide-50
SLIDE 50

Sparsity-Enhanced Memory Micro- architecture

Arbiter

XBAR(eg.16x32)

Indirect Address Gener- ation

Linear Access Stream Table

Linear Address Generation

MUX

Indirect Rd/Wr/ Atomic Stream Table

rd-wr bank-queues Control Logic

Composable Banked Scratchpad NoC Linear Scratchpad Control Unit

Sel

Indirect

ROB

To Compute Fabric

From Compute Fabric

We keep 2 logical scratchpad memories: banked and linear. Linear Memory

  • reads/writes

Banked Memory

  • Indirect writes/updates
  • Indirect reads

17

slide-51
SLIDE 51

Benefits of Heterogeneity on CGRA

Example DFG Mapping to DGRA 21

  • Effective vectorization width is increased by 64/data-

width.

  • DGRA supports Concat and Extract using sub-networks.

Mul16 Mul16 Sub16 Sub16 Mul32 Concat 16 Mul64

16 16 16 16 16 16 16 32 32 16 16 16 16 16 64

Output A[0] B[0] C[1] D[1] C[0] D[0] A[1] B[1]

Mul64 Mul32 Mul16 Sub16

16 16 16 16 16 32 16 16 16 16 32 64 16 16 32 32 64

Output

32 32 32 32

A[0.2] B[0.2] D[0.2] C[0.2]

16 16

CGRA datapath width = 64-bits Initial datatypes= 16-bits

slide-52
SLIDE 52

Sparsity-Enhanced Computation Micro- architecture

DGRA Switch: same external interface but splits i/p and o/p. DGRA Processing Element: Decomposed to fine-grained PEs

22

slide-53
SLIDE 53

Cost of Decomposability & Stream-Join

Stream-Join in Systolic Array More Decomposable More Decomposable

53

slide-54
SLIDE 54

Remaining Challenges

  • Generality
  • What about other forms of irregularity? (task-based?)
  • Programmability Challenges
  • Workload balance (1) same amount of work in each core

2) efficient use of available on-chip memory (global addressing helps in this case)

  • Partitioning of Computation/Memory (Locality & Parallelism)
  • Programming in low-level intrinsic (dataflow compute & stream

mem)

  • Virtualization/integration with CPU

54

slide-55
SLIDE 55

Domain-agnostic comparison points

24-core Intel Skylake CPU P4000 Pascal GPU

Main Memory SPU Core SPU Core SPU Core SPU Core SPU Core SPU Core

… … … … …

SPU-inorder

SM-1 SM-2 Main Memory Ctrl FU-1 FU-N L1 Cache L2 Cache Ctrl FU-1 FU-N L1 Cache Shared mem Ctrl FU-1 FU-N

  • Lin. Scratch

Bank Scratch Ctrl Ctrl FU-2 OOO-3 OOO-1 OOO-2 OOO-4 Main Memory SM-N 4MB 3584 2048 2.5MB FU-2 FU-2

Overall Design Core Design

55