A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms
Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27th, 2020
1
A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix - - PowerPoint PPT Presentation
A Hybrid Systolic-Dataflow Architecture for In Inductive Matrix Alg lgorithms Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27 th , 2020 1 Inductive Kernels F(
Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki University of California, Los Angeles Feb 27th, 2020
1
2
F( )= F( )= F( )=
12..32
3
0% 5% 10% 15% 20% 25%
dsp cpu gpu
% of the ideal performance
25%
and Implicit Vector Padding
Lanes
spatial architectures, and general purpose processors
with ASIC collection
4
Sync Sync
Private Spad
Ct rl
XFER
Ctrl
Ctrl
Sync Sync
Private Spad
Ct rl
Shared Spad
Ct rl
XFER
Ctrl
flexibility
5
6
✖︐ ✖︐
➖
>>
✖︐ ✖︐
➖
>>
are determined when compilation +
>>
✖︐ ✖︐
➖
shared PE
execution
Dependence Graph Tagged Dataflow Systolic
5.8x Area 4.2x Power
programming interfaces
7
0% 10% 20% 30% 40% 50% 60% 70% 80%
systolic tagged dataflow
% of the ideal performance
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
8
✖︐ ✖︐
➖
>>
spatial architecture
coordinated by the controller
interfaces between dynamic/static timing
for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; ++i) x[i] -= x[j]*a[j,i]; }
9
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖
➗
1|(n-j-1) produce|consume
for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; invsqrt = 1.0/sqrt(a[k,k]); for (j=k; j<n; ++j) l[j,k] = a[k,j]*invsqrt; for (j=k+1; j<n; ++j) for (i=j; i<n; ++i) a[j,i] -= a[k,i]*a[k,j]*inv; }
10
10
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐ ✖︐ ✖︐
➗ ➗
11
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖
➗ for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { x[i] -= x[j]*a[j,i]; }} a[n:m,p:q] Rectangular Slicing: Triangular Slicing: a[n-2:n] a[j+3:n] a[j+2:n] a[j+1:n]
…
12
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖
➗ for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { x[i] -= x[j]*a[j,i]; if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }}
✖︐
➖
a[n:m,p:q] Rectangular Slicing: Triangular Slicing: a[n-2:n] a[j+3:n] a[j+2:n] a[j+1:n]
…
13
for (k=0; k<n; ++k) { inv = 1.0/a[k,k]; invsqrt = 1.0/sqrt(a[k,k]); for (j=k; j<n; ++j) l[j,k] = a[k,j]*invsqrt; for (j=k+1; j<n; ++j) for (i=j; i<n; ++i) a[j,i] -= a[k,i]*a[k,j]*inv; }
14
14
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐ ✖︐ ✖︐
➗ ➗
O(1) O(n) O(n²)
15
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖
➗
for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; ++i) x[i] -= x[j]*a[j,i]; } 1|(n-j-1) 1|(n-j-1)
16
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖ ➗
for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { x[i] -= x[j]*a[j,i]; if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }} a[n:m,p:q] Rectangular Slicing: Triangular Slicing: a[n-2:n] a[j+3:n] a[j+2:n] a[j+1:n]
…
𝑘𝑙+1
𝑜
a[k,j:n]
17
Sync Sync
Scratch Memory
Ct rl
XFER
Ctrl
Ctrl
✖︐
➖
➗
for (j=0; j<n; ++j) { x[j] = x[j]/a[j,j]; for (i=j+1; i<n; i+=2) { x[i] -= x[j]*a[j,i]; if (i+1<n) x[i+1] -= x[j]*a[j,i+1]; }}
✖︐
➖
Generate masks according to the striding pattern
Triangular Slicing: a[n-2:n] a[j+3:n] a[j+2:n] a[j+1:n]
…
Sync Sync
18
Scratch Memory
Ctrl
XFER
Ctrl
Ctrl
Sync Sync
Private Spad
Ct rl
XFER
Ctrl
Ctrl
Sync Sync
Private Spad
Ct rl
XFER
Ctrl
Pri Spad
19
commands multiple lanes
execute
Sync Sync
Private Spad
Ct rl
XFER
Ctrl
Ctrl
Sync Sync
Private Spad
Ct rl
Shared Spad
Ct rl
XFER
Ctrl
Pri Spad
Lane 0 Lane 1 01010101 a[k,j:n] a[0,j:n] 1 a[1,j:n] a[2
20
architecture simulator
21
Same peak performance
0.48 0.16 0.56 0.48 0.24 0.04
CGRA Net Trig Net FUs SPAD VP/SE Control Core
22
0.01 0.1 1 10 100
cpu gpu systolic tagged revel
Speedup over the TI DSP (log scale)
23
0.01 0.1 1 10 100
cpu gpu systolic dataflow revel
Speedup Over the TI DSP (log scale)
a promising next-generation digital signal processing architecture.
considering multiple execution models.
24