Proposing a Fast and Scalable Systolic Array for Matrix - - PowerPoint PPT Presentation

proposing a fast and scalable systolic array for matrix
SMART_READER_LITE
LIVE PREVIEW

Proposing a Fast and Scalable Systolic Array for Matrix - - PowerPoint PPT Presentation

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style Matrix Multiplication 2 Matrix multiplication is the key operation


slide-1
SLIDE 1

Click to edit Master subtitle style

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication

Bahar Asgari, , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim

slide-2
SLIDE 2

Matrix Multiplication

2

Matrix multiplication is the key operation in many applications Example: convolution in neural networks Systolic arrays perform matrix multiplication that

} Includes several similar operations (i.e., multiply and accumulation) } Captures high data reuse rate

F F F

C

K F W

H

C

F F

*

=

W

H

K

K

F 2.C

× = K

W.H W.H

F 2.C

Convolution:

Matrix Multiplication:

slide-3
SLIDE 3

Systolic Arrays for Matrix Multiplication

3

} Non-stationary

} None of the operands are stationary

An×m × Bm×p = Cn×p

m n m p

a MAC unit

B A

slide-4
SLIDE 4

Systolic Arrays for Matrix Multiplication

4

} Non-stationary

} None of the operands are stationary } Phase 1: } only processing } Time steps: 1 }

An×m × Bm×p = Cn×p

slide-5
SLIDE 5

Systolic Arrays for Matrix Multiplication

5

} Non-stationary

} None of the operands are stationary } Phase 1 } only processing } Time steps: 2 }

An×m × Bm×p = Cn×p

slide-6
SLIDE 6

Systolic Arrays for Matrix Multiplication

6

} Non-stationary

} None of the operands are stationary } Phase 1: } only processing } Time steps: 3 }

An×m × Bm×p = Cn×p

slide-7
SLIDE 7

Systolic Arrays for Matrix Multiplication

7

} Non-stationary

} None of the operands are stationary } Phase 1: } only processing } Time steps: 4 }

An×m × Bm×p = Cn×p

slide-8
SLIDE 8

Systolic Arrays for Matrix Multiplication

8

} Non-stationary

} None of the operands are stationary } Phase 1: } only processing } Time steps: 5 }

An×m × Bm×p = Cn×p

slide-9
SLIDE 9

Systolic Arrays for Matrix Multiplication

9

} Non-stationary

} None of the operands are stationary } Phase 1: } only processing } Time steps: n + m

An×m × Bm×p = Cn×p

slide-10
SLIDE 10

Systolic Arrays for Matrix Multiplication

10

} Non-stationary

} None of the operands are stationary } Phase 2: } processing and offloading } Time steps: n + m + 1

An×m × Bm×p = Cn×p

Phase 1

slide-11
SLIDE 11

Systolic Arrays for Matrix Multiplication

11

} Non-stationary

} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + 1 }

An×m × Bm×p = Cn×p

Phase 2 Phase 1

slide-12
SLIDE 12

Systolic Arrays for Matrix Multiplication

12

} Non-stationary

} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + 2 }

An×m × Bm×p = Cn×p

Phase 2 Phase 1

slide-13
SLIDE 13

Systolic Arrays for Matrix Multiplication

13

} Non-stationary

} None of the operands are stationary } Phase 3: } only offloading } Time steps: n + m + p - 2 + n }

An×m × Bm×p = Cn×p

Phase 2 Phase 1

slide-14
SLIDE 14

Systolic Arrays for Matrix Multiplication

14

} Non-stationary

} None of the operands are stationary } Phase 3: } only offloading } Time steps: 2n + m + p - 2 }

An×m × Bm×p = Cn×p

slide-15
SLIDE 15

Systolic Arrays for Matrix Multiplication

15

} Stationary

} One operand (here, B) is stationary

An×m × Bm×p = Cn×p

m n m p

a MAC unit

B A

slide-16
SLIDE 16

Systolic Arrays for Matrix Multiplication

16

} Stationary

} One operand (here, B) is stationary

Phase 1:

} only loading B } Time steps: 1

An×m × Bm×p = Cn×p

slide-17
SLIDE 17

Systolic Arrays for Matrix Multiplication

17

} Stationary

} One operand (here, B) is stationary

Phase 1:

} only loading B } Time steps: m - 1

An×m × Bm×p = Cn×p

slide-18
SLIDE 18

Systolic Arrays for Matrix Multiplication

18

} Stationary

} One operand (here, B) is stationary

Phase 2:

} loading B and processing } Time steps: m - 1 + 1

An×m × Bm×p = Cn×p

Phase 1

slide-19
SLIDE 19

Systolic Arrays for Matrix Multiplication

19

} Stationary

} One operand (here, B) is stationary

Phase 3:

} only processing } Time steps: m + 1

An×m × Bm×p = Cn×p

Phase 1 &2

slide-20
SLIDE 20

Systolic Arrays for Matrix Multiplication

20

} Stationary

} One operand (here, B) is stationary

Phase 3:

} only processing } Time steps: m + m - 1

An×m × Bm×p = Cn×p

Phase 1 &2

slide-21
SLIDE 21

Systolic Arrays for Matrix Multiplication

21

An×m × Bm×p = Cn×p

Phase 1 &2&3

} Stationary

} One operand (here, B) is stationary

Phase 4:

} processing and offloading } Time steps: 2m - 1 + 1

slide-22
SLIDE 22

Systolic Arrays for Matrix Multiplication

22

} Stationary

} One operand (here, B) is stationary

Phase 4:

} processing and offloading } Time steps: 2m - 1 + 2

An×m × Bm×p = Cn×p

Phase 1 &2&3

slide-23
SLIDE 23

Systolic Arrays for Matrix Multiplication

23

An×m × Bm×p = Cn×p

} Stationary

} One operand (here, B) is stationary

Phase 4:

} processing and offloading } Time steps: 2m - 1 + 3

Phase 1 &2&3

slide-24
SLIDE 24

Systolic Arrays for Matrix Multiplication

24

An×m × Bm×p = Cn×p

} Stationary

} One operand (here, B) is stationary

Phase 4:

} processing and offloading } Time steps: 2m - 1 + n + p - 2

Phase 1 &2&3

slide-25
SLIDE 25

Systolic Arrays for Matrix Multiplication

25

} Stationary

} One operand (here, B) is stationary

Phase 5:

} only offloading } Time steps: 2m -1 + n + p - 2 + 1

An×m × Bm×p = Cn×p

Phase 4 Phase 1 &2&3

slide-26
SLIDE 26

Systolic Arrays for Matrix Multiplication

26

} Stationary

} One operand (here, B) is stationary

Phase 5:

} only offloading } Time steps: n + 2m + p - 2

An×m × Bm×p = Cn×p

slide-27
SLIDE 27

Key Challenge

27

The systolic arrays proposed by prior work are not scalable:

} Their latency grows linearly with the size of the inputs } Latency is the key metric for single-batch inference

An×m × Bm×p = Cn×p

Stationary Time steps: n + 2m + p - 2 Non-Stationary Time steps: 2n + m + p - 2

slide-28
SLIDE 28

Key Insight and Proposed Systolic Array

31

Matrix multiplication consists of

} Multiplication } Additions

In optimized implementation

} Latency increases sublinearly with the input size

We propose a systolic array with separate

}

Multiplier array

}

Adder-tree array

Time steps: n + 2m + p - 2 m + log(m)

This can be done in log(m) for m numbers

m n m p

a multiplier

B A

an adder tree

slide-29
SLIDE 29

Our proposed systolic array

32

One operand (here, B) is stationary An×m × Bm×p = Cn×p

m n m p

a multiplier

B A

an adder tree

slide-30
SLIDE 30

Our proposed systolic array

33

One operand (here, B) is stationary Phase 1:

} only loading B } Time steps: 1

An×m × Bm×p = Cn×p

slide-31
SLIDE 31

Our proposed systolic array

34

One operand (here, B) is stationary Phase 1:

} only loading B } Time steps: m-1

An×m × Bm×p = Cn×p

slide-32
SLIDE 32

Our proposed systolic array

35

One operand (here, B) is stationary Phase 2:

} loading B and multiplication } Time steps: m - 1 + 1

An×m × Bm×p = Cn×p

Phase 1

slide-33
SLIDE 33

Our proposed systolic array

36

One operand (here, B) is stationary Phase 3:

} multiplication and addition } Time steps: m + 1

An×m × Bm×p = Cn×p

Phase 1 &2

slide-34
SLIDE 34

Our proposed systolic array

37

One operand (here, B) is stationary Phase 3:

} multiplication and addition } Time steps: m + 2

An×m × Bm×p = Cn×p

Phase 1 &2

slide-35
SLIDE 35

Our proposed systolic array

38

One operand (here, B) is stationary Phase 3:

} multiplication and addition } Time steps: m + 3

An×m × Bm×p = Cn×p

Phase 1 &2

slide-36
SLIDE 36

Our proposed systolic array

39

One operand (here, B) is stationary Phase 3:

} multiplication and addition } Time steps: m + 4

An×m × Bm×p = Cn×p

Phase 1 &2

slide-37
SLIDE 37

Our proposed systolic array

40

One operand (here, B) is stationary Phase 4:

} only addition } Time steps: m + n + p - 2 + 1

An×m × Bm×p = Cn×p

Phase 1 &2 Phase 3

slide-38
SLIDE 38

Our proposed systolic array

41

One operand (here, B) is stationary Phase 4:

} only addition } Time steps: m + n + p - 2 + 2

An×m × Bm×p = Cn×p

Phase 1 &2 Phase 3

slide-39
SLIDE 39

Our proposed systolic array

42

One operand (here, B) is stationary Phase 4:

} only addition } Time steps: m + n + p - 2 + 3

An×m × Bm×p = Cn×p

Phase 1 &2 Phase 3

slide-40
SLIDE 40

Our proposed systolic array

43

One operand (here, B) is stationary Phase 4:

} only addition } Time steps: m + n + p - 2 + log (m)

An×m × Bm×p = Cn×p

Phase 1 &2 Phase 3

slide-41
SLIDE 41

Our proposed systolic array

44

One operand (here, B) is stationary Phase 4:

} only addition } Time steps: n + m + log(m) + p - 2

An×m × Bm×p = Cn×p

slide-42
SLIDE 42

Implementation

46

Tools and Devices:

} ZYNQ XC7z020 } Vivado HLS

Benchmark:

} DNNs (VGG16, VGGS, AlexNet, CifarNet, ResNet50)

Metrics:

} Latency } Energy consumption

slide-43
SLIDE 43

Results – Speedup and Energy Consumption

47

1 2 3 VGGS AlexNet CifarNet VGG16 ResNet50 GMEAN speed up over non-stationary Stationary Non-stationary Our proposed systolic array

Our proposed systolic array is

} 1.99x faster than non-stationary while consuming 2.12x less energy } 1.83x faster than stationary while consuming 2.27x less energy

1.99 1.83

slide-44
SLIDE 44

Conclusions

48

Systolic arrays have seen significant interest

} because of their unique interconnections that satisfies the unique requirement of data

reuse in matrix multiplication.

Although the systolic arrays in prior work offer high throughput, their latency is not optimized

} Latency is the key factor for single-batch inference!

To optimize latency, we propose a new systolic array consisting of separate multiplier and adder-tree arrays

}

It is faster than both prior proposals when the size of the operands grows