E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - PowerPoint PPT Presentation

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for Energy-efficient Computing and Applications School of EECS, Peking University Design Automation Conference, June 2019 1 / 17

Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 2 / 17

Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. Input sequence: Output sequence: words, audio, image translation, prediction X ts X 3 X 2 X 1 h ts h 3 h 2 h 1 Embedded LSTM Vector … … INPUT OUTPUT Cell Sequence Time Step (ts) x (x 1 , x 2 ,…, x ts ) h (h 1 , h 2 ,…, h ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17

Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. h (h 1 , h 2 ,…, h ts ) h 1 h 2 h ts Context h 1 h 2 h ts-1 Link LSTM Cell Cell Cell … unrolled Cell iter 0 iter 1 c 2 iter ts c 1 c ts-1 x 1 x 2 x ts x (x 1 , x 2 ,…, x ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17

Arithmetic of LSTM-cell Computation c t-1 c t x + x tanh σ σ tanh f t i t h t o t σ x h t-1 W f , U f , b f W i , U i , b i W c , U c , b c W o , U o , b o x t Cell iter t Figure: Detail dataflow in the LSTM cell. f t = σ ( W f x t + U f h t − 1 + b f ) (1) i t = σ ( W i x t + U i h t − 1 + b i ) (2) c t = f t · c t − 1 + i t · tanh( W c x t + U c h t − 1 + b c ) (3) o t = σ ( W o x t + U o h t − 1 + b o ) (4) h t = o t · tanh ( c t ) (5) 4 / 17

Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , In a benchmark layer for machine comprehension, m = n = 1500, one sequence has 35 time steps (cell iteration). 630 , 000 , 000 MACC operations for each sequence. One LSTM layer costs 0.63 second on a CPU with 1 GOp/s. 5 / 17

Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , Sparsity in Weight Matrix Sparsity ( W , U ) ∈ [0 . 2 , 0 . 8] CPU performance decreases while computing Sparse matrix-vector multiplication (SpMV). Embedded Solution for LSTM Inference A heterogeneous system coupling CPU and a generic LSTM accelerator. 5 / 17

Target Platform: Tightly-coupled Heterogeneous System Data Path ROCC Control Path ROCC Data Path Advantages: RISC-V HW Chip Accelerator L1 lower latency: 30 cycles (DRAM access) v.s. 1 cycle (DCache DCache Buf access via ROCC) 64b smaller area: DRAM Controller (Accel) DRAM Area Limitations: Controller saving chip-area limitation: off-chip weight storage (CPU) ROCC bandwidth: 64bits/cycle DRAM Figure: Tightly-coupled Arch. 6 / 17

eSELL: Area-saving Sparse Weight Format Access Coalescing 0 0 0 Port 0 v 1 Port 0 0 0 2 1 3 Port 1 Port 2,3 = X X = 2 4 4 3 Vector 6 Vector 6 (on-chip) (on-chip) 7 7 Matrix (o ff -chip) Result (on-chip) Matrix (o ff -chip) Result (on-chip) Figure: Coalesced access to result buffer, 4 MACC per cycle, Figure: Column-major SpMV with Compressed Sparse Column (CSC) format, 4 MACC per cycle. 63% area reduction to CSC. SRAM Area Estimation [ ? ] area ∝ (# bits ) 0 . 9 × (# port ) 0 . 7 8 / 17

eSELL: Area-saving Sparse Weight Format Weight Format Construction STEP1 STEP2 STEP3 STEP4 (eSELL) Values & BLK w Values Values Bin Column Indices 0 1 3 0 0 1 3 001 1 2 3 CHK 0 1 2 2 1 2 100 3 Row Permutation Index ( IDX row ) Encoded Column Indices ( EIDX col ) 0 2 4 0 2 010 BLK � 0,0 � 0 3 6 0 3 011 BLK h Chunk Width (CHK W ) 2 1 2 110 4 CHK 1 1 3 1 100 1 MAT h 5 000 … 7 000 (0,0) 1 0 2 5 6 010 CHK h 5 CHK 2 2 3 110 2 6 0 3 BLK � 1,0 � 011 7 3 111 0 2 7 110 … … CHK 3 2 3 111 … 1 4 1 100 3 MAT w 000 Figure: Steps for eSELL weight format construction. 8 / 17

eSELL: Area-saving Sparse Weight Format Alignment to ROCC Interface Chunk1 Head Invalid Chunk0 Head 0 IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) 8 Value (16 bits) Value Value Value … Address Value Value Value Value IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) Value Value Value Value ROCC word (64 bits) Figure: eSELL storage / transmission pattern aligned with ROCC 64-bits interface. 8 / 17

Generic Accelerator Hardware for Embedded LSTM Instruction eSELL Decoder Weight Matrix Data I/O … SpMV SpMV SpMV Controller Load/Store PE PE PE Request Vectors Interface BUF x BUF wx BUF uh BUF h BUF b BUF c ROCC Vectors Vector PE Figure: Accelerator architecture in E-LSTM. 10 / 17

Throughput Bottleneck Pipeline diagram for single SpMV-PE case. Time Load Matrix U Load Matrix W ROCC … Wx 1 Uh 1 Wx 2 Wx ts Uh ts SpMV-PE Figure: Process cell iterations in sequence; both ROCC and PE are fully utilized. 11 / 17

Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. 11 / 17

Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. Pipeline Stall W x t and U h t cannot be computed concurrently, as the ROCC can only load one word of W or U in each cycle. Thus the stall of PE is unavoidable, and U h t becomes the throughput bottleneck. 11 / 17

Optimization1: Shorten U h t period with inherent sparsity of h t Backtrace of h t computation: o t = σ ( W o x t + U o h t − 1 + b o ) h t = o t · tanh ( c t ) 1 1 σ ( x ) = σ (x) 1 + e − x Inherent sparsity of h t 0.5 As P ( o t < 0 . 1) ≈ 0 . 32, and tanh ( c t ) ∈ ( − 1 , 1), a considerable portion of h t is closed to zero that can p = 32% be regarded as zero in U h t computation. σ (x)=0.1 0 − 6 − 4 − 2 0 2 4 6 x 12 / 17 Figure: function ( x )

Optimization1: Shorten U h t period with inherent sparsity of h t Sparse-Matrix Sparse-Vector Multiplication (SpMSpV) in U h t non-zero value zero non-zero value zero res U res U weight column to be read weight column to be read h t h t x = x = Figure: SpMSpV: U h t computation considering inherent Figure: SpMV: original computation of U h t . sparsity of h t . In this example, SpMSpV achieves 3 × speedup on U h t computation. 12 / 17

Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Cell-fusion Scheme Assuming there are N pe PEs, we set ( N pe − 1) of them process W x t (SpMV) and the rest one process U h t (SpMSpV). Besides, each SpMV-PE process W x t of N fuse cell iterations in interleave. Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Stall Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T main T epilog T prolog 13 / 17

Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 Stall PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T prolog T main T epilog Advantage: W x t and U h t are processed in concurrent. In every N fuse cycles, ROCC interface is occupied by loading W for 1 cycle and loading U for ( N fuse − 1) cycles. 13 / 17

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - PowerPoint PPT Presentation

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting LSTM

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Parameter efficient training of deep convolutional neural networks by dynamic sparse

Using Sentence-Level LSTM Language Models for Script Inference Karl Pichotta and Raymond J.

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Chapter 1 Electrons and Holes in Semiconductors 1.1 Silicon Crystal Structure Unit cell

Beladys Anomaly with Round Robin Num 1 2 3 4 5 6 7 8 9 10 11 12 Refs a b c d

Testing Steve Loughran HP Laboratories Thursday November 6th, 2006 your code doesn't work i

Eric Lengyel, PhD Terathon Software Math used in 3D programming Dot / cross products, scalar

Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout Georgia Institute of

esap Re-use & Recycling Working Group 4 February 2015 Product Lifecycle 5 esap themes

Verification of Redecoration for Infinite Triangular Matrices in Coq Celia Picard joint work

The Extreme Energy Event network Status and Perspectives Ivan Gnesi for the EEE Collaboration

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - PowerPoint PPT Presentation

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting LSTM

for Efficient Softmax Inference Shun Liao* 1 , Ting Chen* 2 , Tian Lin 2 , Denny Zhou 2 , Chong

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Parameter efficient training of deep convolutional neural networks by dynamic sparse

Using Sentence-Level LSTM Language Models for Script Inference Karl Pichotta and Raymond J.

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Chapter 1 Electrons and Holes in Semiconductors 1.1 Silicon Crystal Structure Unit cell

Beladys Anomaly with Round Robin Num 1 2 3 4 5 6 7 8 9 10 11 12 Refs a b c d

Testing Steve Loughran HP Laboratories Thursday November 6th, 2006 your code doesn't work i

Eric Lengyel, PhD Terathon Software Math used in 3D programming Dot / cross products, scalar

Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout Georgia Institute of

esap Re-use &amp; Recycling Working Group 4 February 2015 Product Lifecycle 5 esap themes

Verification of Redecoration for Infinite Triangular Matrices in Coq Celia Picard joint work

The Extreme Energy Event network Status and Perspectives Ivan Gnesi for the EEE Collaboration

esap Re-use & Recycling Working Group 4 February 2015 Product Lifecycle 5 esap themes