Proposing a Fast and Scalable Systolic Array for Matrix - PowerPoint PPT Presentation

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style

Matrix Multiplication 2 Matrix multiplication is the key operation in many applications Example: convolution in neural networks H H C K C F F Convolution: F … F W K = * W F F W.H W.H F 2 .C Matrix F 2 .C = K Multiplication: K × Systolic arrays perform matrix multiplication that } Includes several similar operations (i.e., multiply and accumulation) } Captures high data reuse rate

Systolic Arrays for Matrix Multiplication 3 } Non-stationary } None of the operands are stationary p B A n × m × B m × p = C n × p m m a MAC unit n A

Systolic Arrays for Matrix Multiplication 4 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: 1 }

Systolic Arrays for Matrix Multiplication 5 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1 } only processing } Time steps: 2 }

Systolic Arrays for Matrix Multiplication 9 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 1: } only processing } Time steps: n + m

Systolic Arrays for Matrix Multiplication 10 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 2: } processing and offloading } Time steps: n + m + 1 Phase 1

Systolic Arrays for Matrix Multiplication 11 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + 1 Phase 1 Phase 2 }

Systolic Arrays for Matrix Multiplication 12 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + 2 Phase 1 Phase 2 }

Systolic Arrays for Matrix Multiplication 13 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: n + m + p - 2 + n Phase 2 } Phase 1

Systolic Arrays for Matrix Multiplication 14 } Non-stationary } None of the operands are stationary A n × m × B m × p = C n × p } Phase 3: } only offloading } Time steps: 2n + m + p - 2 }

Systolic Arrays for Matrix Multiplication 15 } Stationary } One operand (here, B) is stationary p A n × m × B m × p = C n × p B m n a MAC unit A m

Systolic Arrays for Matrix Multiplication 16 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: 1

Systolic Arrays for Matrix Multiplication 17 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: m - 1

Systolic Arrays for Matrix Multiplication 18 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 2: } loading B and processing } Time steps: m - 1 + 1 Phase 1

Systolic Arrays for Matrix Multiplication 19 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } only processing } Time steps: m + 1 Phase 1 &2

Systolic Arrays for Matrix Multiplication 20 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } only processing } Time steps: m + m - 1 Phase 1 &2

Systolic Arrays for Matrix Multiplication 21 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + 1 Phase 1 &2&3

Systolic Arrays for Matrix Multiplication 24 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } processing and offloading } Time steps: 2m - 1 + n + p - 2 Phase 1 &2&3

Systolic Arrays for Matrix Multiplication 25 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 5: } only offloading } Time steps: 2m -1 + n + p - 2 + 1 Phase 1 &2&3 Phase 4

Systolic Arrays for Matrix Multiplication 26 } Stationary } One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 5: } only offloading } Time steps: n + 2m + p - 2

Key Challenge 27 The systolic arrays proposed by prior work are not scalable: } Their latency grows linearly with the size of the inputs } Latency is the key metric for single-batch inference A n × m × B m × p = C n × p Non-Stationary Stationary Time steps: 2n + m + p - 2 Time steps: n + 2m + p - 2

Key Insight and Proposed Systolic Array 31 Matrix multiplication consists of } Multiplication } Additions This can be done in log(m) for m numbers p In optimized implementation } Latency increases sublinearly with the input size B m We propose a systolic array with separate n Multiplier array } a multiplier Adder-tree array } A m an adder tree Time steps: n + 2m + p - 2 m + log(m)

Our proposed systolic array 32 One operand (here, B) is stationary p B m n A n × m × B m × p = C n × p a multiplier A m an adder tree

Our proposed systolic array 33 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: 1

Our proposed systolic array 34 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 1: } only loading B } Time steps: m-1

Our proposed systolic array 35 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 2: } loading B and multiplication } Time steps: m - 1 + 1 Phase 1

Our proposed systolic array 36 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 3: } multiplication and addition } Time steps: m + 1 Phase 1 &2

Our proposed systolic array 40 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + 1 Phase 1 &2 Phase 3

Our proposed systolic array 43 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: m + n + p - 2 + log (m) Phase 1 &2 Phase 3

Our proposed systolic array 44 One operand (here, B) is stationary A n × m × B m × p = C n × p Phase 4: } only addition } Time steps: n + m + log(m) + p - 2

Implementation 46 Tools and Devices: } ZYNQ XC7z020 } Vivado HLS Benchmark: } DNNs (VGG16, VGGS, AlexNet, CifarNet, ResNet50) Metrics: } Latency } Energy consumption

Results – Speedup and Energy Consumption 47 Our proposed systolic array is } 1.99x faster than non-stationary while consuming 2.12x less energy } 1.83x faster than stationary while consuming 2.27x less energy 3 speed up over non-stationary 2 1.99 1.83 1 0 VGGS AlexNet CifarNet VGG16 ResNet50 GMEAN Stationary Non-stationary Our proposed systolic array

Conclusions 48 Systolic arrays have seen significant interest } because of their unique interconnections that satisfies the unique requirement of data reuse in matrix multiplication. Although the systolic arrays in prior work offer high throughput, their latency is not optimized } Latency is the key factor for single-batch inference! To optimize latency, we propose a new systolic array consisting of separate multiplier and adder-tree arrays It is faster than both prior proposals when the size of the operands grows }

Proposing a Fast and Scalable Systolic Array for Matrix - PowerPoint PPT Presentation

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication Bahar Asgari , , Ra Ramyad Ha Hadidi, , Hy Hyesoon esoon Ki Kim Click to edit Master subtitle style Matrix Multiplication 2 Matrix multiplication is the key operation

VLSI programming Systolic Design Book Parhi, Chp. 7 Rudolf Mak r.h.mak@tue.nl 18-May-16

singly linked lists Sept. 18, 2017 1 Recall last lecture: Java array array array array of

Dont Use a Single Large Systolic Array, Use Many Small Ones Instead H. T. Kung Harvard

Green Paper Proposing Green Paper Proposing Property Boundary Reform Property Boundary Reform

Hastings ratio = P ( proposing ) P ( proposing ) = g ( u ) g ( u )

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Cross- -sectional Association of Job Strain and Systolic sectional Association of Job Strain and

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Review We can declare an array of any type, even other arrays A 2D array is an array of

Contemporary Management of Diabetic Diabetes Cardiomyopathy Systolic Heart Failure Obesity

On the explicit systolic inequality from the cup-product Hoil Ryu Graduate School of

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Delegation Sketch : a Parallel Design with Support for Fast and Accurate Concurrent Operations

Rigorous estimation of the speed of convergence to equilibrium. S. Galatolo Dip. Mat, Univ. Pisa

Speed up the monolith building a smart reverse proxy in Go Alessio Caiazza Senior Backend

4. In the Background dialog box that appears, click the Choose button to the right of Image to

Using a positron beam to measure the speed of light anisotropy Bogdan Wojtsekhowski,

Development at the Speed and Scale of Google Ashish Kumar Engineering Tools The Challenge

Key-value Store with Bounded Tails Junsu Im , Jinwook Bae, Chanwoo Chung * , Arvind * , and Sungjin

Chapter 4 Confusions Please see my answers below in italics Generals: Many times units are