Proposing a Fast and Scalable Systolic Array for Matrix - - PowerPoint PPT Presentation
Proposing a Fast and Scalable Systolic Array for Matrix - - PowerPoint PPT Presentation
Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style Matrix Multiplication 2 Matrix multiplication is the key operation
Matrix Multiplication
2
Matrix multiplication is the key operation in many applications Example: convolution in neural networks Systolic arrays perform matrix multiplication that
} Includes several similar operations (i.e., multiply and accumulation) } Captures high data reuse rate
F F F
…
C
K F W
H
C
F F
*
=
W
H
K
K
F 2.C
× = K
W.H W.H
F 2.C
Convolution:
Matrix Multiplication:
Systolic Arrays for Matrix Multiplication
3
} Non-stationary
} None of the operands are stationary
An×m × Bm×p = Cn×p
m n m p
a MAC unit
B A
Systolic Arrays for Matrix Multiplication
4
} Non-stationary
} None of the operands are stationary } Phase 1: } only processing } Time steps: 1 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
5
} Non-stationary
} None of the operands are stationary } Phase 1 } only processing } Time steps: 2 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
6
} Non-stationary
} None of the operands are stationary } Phase 1: } only processing } Time steps: 3 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
7
} Non-stationary
} None of the operands are stationary } Phase 1: } only processing } Time steps: 4 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
8
} Non-stationary
} None of the operands are stationary } Phase 1: } only processing } Time steps: 5 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
9
} Non-stationary
} None of the operands are stationary } Phase 1: } only processing } Time steps: n + m
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
10
} Non-stationary
} None of the operands are stationary } Phase 2: } processing and offloading } Time steps: n + m + 1
An×m × Bm×p = Cn×p
Phase 1
Systolic Arrays for Matrix Multiplication
11
} Non-stationary
} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + 1 }
An×m × Bm×p = Cn×p
Phase 2 Phase 1
Systolic Arrays for Matrix Multiplication
12
} Non-stationary
} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + 2 }
An×m × Bm×p = Cn×p
Phase 2 Phase 1
Systolic Arrays for Matrix Multiplication
13
} Non-stationary
} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + n }
An×m × Bm×p = Cn×p
Phase 2 Phase 1
Systolic Arrays for Matrix Multiplication
14
} Non-stationary
} None of the operands are stationary } Phase 3: } only offloading } Time steps: 2n + m + p - 2 }
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
15
} Stationary
} One operand (here, B) is stationary
An×m × Bm×p = Cn×p
m n m p
a MAC unit
B A
Systolic Arrays for Matrix Multiplication
16
} Stationary
} One operand (here, B) is stationary
Phase 1:
} only loading B } Time steps: 1
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
17
} Stationary
} One operand (here, B) is stationary
Phase 1:
} only loading B } Time steps: m - 1
An×m × Bm×p = Cn×p
Systolic Arrays for Matrix Multiplication
18
} Stationary
} One operand (here, B) is stationary
Phase 2:
} loading B and processing } Time steps: m - 1 + 1
An×m × Bm×p = Cn×p
Phase 1
Systolic Arrays for Matrix Multiplication
19
} Stationary
} One operand (here, B) is stationary
Phase 3:
} only processing } Time steps: m + 1
An×m × Bm×p = Cn×p
Phase 1 &2
Systolic Arrays for Matrix Multiplication
20
} Stationary
} One operand (here, B) is stationary
Phase 3:
} only processing } Time steps: m + m - 1
An×m × Bm×p = Cn×p
Phase 1 &2
Systolic Arrays for Matrix Multiplication
21
An×m × Bm×p = Cn×p
Phase 1 &2&3
} Stationary
} One operand (here, B) is stationary
Phase 4:
} processing and offloading } Time steps: 2m - 1 + 1
Systolic Arrays for Matrix Multiplication
22
} Stationary
} One operand (here, B) is stationary
Phase 4:
} processing and offloading } Time steps: 2m - 1 + 2
An×m × Bm×p = Cn×p
Phase 1 &2&3
Systolic Arrays for Matrix Multiplication
23
An×m × Bm×p = Cn×p
} Stationary
} One operand (here, B) is stationary
Phase 4:
} processing and offloading } Time steps: 2m - 1 + 3
Phase 1 &2&3
Systolic Arrays for Matrix Multiplication
24
An×m × Bm×p = Cn×p
} Stationary
} One operand (here, B) is stationary
Phase 4:
} processing and offloading } Time steps: 2m - 1 + n + p - 2
Phase 1 &2&3
Systolic Arrays for Matrix Multiplication
25
} Stationary
} One operand (here, B) is stationary
Phase 5:
} only offloading } Time steps: 2m -1 + n + p - 2 + 1
An×m × Bm×p = Cn×p
Phase 4 Phase 1 &2&3
Systolic Arrays for Matrix Multiplication
26
} Stationary
} One operand (here, B) is stationary
Phase 5:
} only offloading } Time steps: n + 2m + p - 2
An×m × Bm×p = Cn×p
Key Challenge
27
The systolic arrays proposed by prior work are not scalable:
} Their latency grows linearly with the size of the inputs } Latency is the key metric for single-batch inference
An×m × Bm×p = Cn×p
Stationary Time steps: n + 2m + p - 2 Non-Stationary Time steps: 2n + m + p - 2
Key Insight and Proposed Systolic Array
31
Matrix multiplication consists of
} Multiplication } Additions
In optimized implementation
} Latency increases sublinearly with the input size
We propose a systolic array with separate
}
Multiplier array
}
Adder-tree array
Time steps: n + 2m + p - 2 m + log(m)
This can be done in log(m) for m numbers
m n m p
a multiplier
B A
an adder tree
Our proposed systolic array
32
One operand (here, B) is stationary An×m × Bm×p = Cn×p
m n m p
a multiplier
B A
an adder tree
Our proposed systolic array
33
One operand (here, B) is stationary Phase 1:
} only loading B } Time steps: 1
An×m × Bm×p = Cn×p
Our proposed systolic array
34
One operand (here, B) is stationary Phase 1:
} only loading B } Time steps: m-1
An×m × Bm×p = Cn×p
Our proposed systolic array
35
One operand (here, B) is stationary Phase 2:
} loading B and multiplication } Time steps: m - 1 + 1
An×m × Bm×p = Cn×p
Phase 1
Our proposed systolic array
36
One operand (here, B) is stationary Phase 3:
} multiplication and addition } Time steps: m + 1
An×m × Bm×p = Cn×p
Phase 1 &2
Our proposed systolic array
37
One operand (here, B) is stationary Phase 3:
} multiplication and addition } Time steps: m + 2
An×m × Bm×p = Cn×p
Phase 1 &2
Our proposed systolic array
38
One operand (here, B) is stationary Phase 3:
} multiplication and addition } Time steps: m + 3
An×m × Bm×p = Cn×p
Phase 1 &2
Our proposed systolic array
39
One operand (here, B) is stationary Phase 3:
} multiplication and addition } Time steps: m + 4
An×m × Bm×p = Cn×p
Phase 1 &2
Our proposed systolic array
40
One operand (here, B) is stationary Phase 4:
} only addition } Time steps: m + n + p - 2 + 1
An×m × Bm×p = Cn×p
Phase 1 &2 Phase 3
Our proposed systolic array
41
One operand (here, B) is stationary Phase 4:
} only addition } Time steps: m + n + p - 2 + 2
An×m × Bm×p = Cn×p
Phase 1 &2 Phase 3
Our proposed systolic array
42
One operand (here, B) is stationary Phase 4:
} only addition } Time steps: m + n + p - 2 + 3
An×m × Bm×p = Cn×p
Phase 1 &2 Phase 3
Our proposed systolic array
43
One operand (here, B) is stationary Phase 4:
} only addition } Time steps: m + n + p - 2 + log (m)
An×m × Bm×p = Cn×p
Phase 1 &2 Phase 3
Our proposed systolic array
44
One operand (here, B) is stationary Phase 4:
} only addition } Time steps: n + m + log(m) + p - 2
An×m × Bm×p = Cn×p
Implementation
46
Tools and Devices:
} ZYNQ XC7z020 } Vivado HLS
Benchmark:
} DNNs (VGG16, VGGS, AlexNet, CifarNet, ResNet50)
Metrics:
} Latency } Energy consumption
Results – Speedup and Energy Consumption
47
1 2 3 VGGS AlexNet CifarNet VGG16 ResNet50 GMEAN speed up over non-stationary Stationary Non-stationary Our proposed systolic array
Our proposed systolic array is
} 1.99x faster than non-stationary while consuming 2.12x less energy } 1.83x faster than stationary while consuming 2.27x less energy
1.99 1.83
Conclusions
48
Systolic arrays have seen significant interest
} because of their unique interconnections that satisfies the unique requirement of data
reuse in matrix multiplication.
Although the systolic arrays in prior work offer high throughput, their latency is not optimized
} Latency is the key factor for single-batch inference!
To optimize latency, we propose a new systolic array consisting of separate multiplier and adder-tree arrays
}
It is faster than both prior proposals when the size of the operands grows