e lstm efficient inference of sparse lstm on embedded
play

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - PowerPoint PPT Presentation

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for


  1. E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for Energy-efficient Computing and Applications School of EECS, Peking University Design Automation Conference, June 2019 1 / 17

  2. Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 2 / 17

  3. Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. Input sequence: Output sequence: words, audio, image translation, prediction X ts X 3 X 2 X 1 h ts h 3 h 2 h 1 Embedded LSTM Vector … … INPUT OUTPUT Cell Sequence Time Step (ts) x (x 1 , x 2 ,…, x ts ) h (h 1 , h 2 ,…, h ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17

  4. Iterative Cell Evaluation in LSTM Inference An illustration of the LSTM-cell iteration. h (h 1 , h 2 ,…, h ts ) h 1 h 2 h ts Context h 1 h 2 h ts-1 Link LSTM Cell Cell Cell … unrolled Cell iter 0 iter 1 c 2 iter ts c 1 c ts-1 x 1 x 2 x ts x (x 1 , x 2 ,…, x ts ) Figure: The LSTM cell and its iterative evaluation over temporal sequence. 3 / 17

  5. Arithmetic of LSTM-cell Computation c t-1 c t x + x tanh σ σ tanh f t i t h t o t σ x h t-1 W f , U f , b f W i , U i , b i W c , U c , b c W o , U o , b o x t Cell iter t Figure: Detail dataflow in the LSTM cell. f t = σ ( W f x t + U f h t − 1 + b f ) (1) i t = σ ( W i x t + U i h t − 1 + b i ) (2) c t = f t · c t − 1 + i t · tanh( W c x t + U c h t − 1 + b c ) (3) o t = σ ( W o x t + U o h t − 1 + b o ) (4) h t = o t · tanh ( c t ) (5) 4 / 17

  6. Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , In a benchmark layer for machine comprehension, m = n = 1500, one sequence has 35 time steps (cell iteration). 630 , 000 , 000 MACC operations for each sequence. One LSTM layer costs 0.63 second on a CPU with 1 GOp/s. 5 / 17

  7. Heavy Workload v.s. Low Performance of Embedded CPU Main Computational Workload of LSTM Matrix-vector multiplication: W = ( W f , W i , W c , W o ) T ∈ R 4 n × m , x t ∈ R m W x t , U = ( U f , U i , U c , U o ) T ∈ R 4 n × n , h t ∈ R n U h t , Sparsity in Weight Matrix Sparsity ( W , U ) ∈ [0 . 2 , 0 . 8] CPU performance decreases while computing Sparse matrix-vector multiplication (SpMV). Embedded Solution for LSTM Inference A heterogeneous system coupling CPU and a generic LSTM accelerator. 5 / 17

  8. Target Platform: Tightly-coupled Heterogeneous System Data Path ROCC Control Path ROCC Data Path Advantages: RISC-V HW Chip Accelerator L1 lower latency: 30 cycles (DRAM access) v.s. 1 cycle (DCache DCache Buf access via ROCC) 64b smaller area: DRAM Controller (Accel) DRAM Area Limitations: Controller saving chip-area limitation: off-chip weight storage (CPU) ROCC bandwidth: 64bits/cycle DRAM Figure: Tightly-coupled Arch. 6 / 17

  9. Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 7 / 17

  10. eSELL: Area-saving Sparse Weight Format Access Coalescing 0 0 0 Port 0 v 1 Port 0 0 0 2 1 3 Port 1 Port 2,3 = X X = 2 4 4 3 Vector 6 Vector 6 (on-chip) (on-chip) 7 7 Matrix (o ff -chip) Result (on-chip) Matrix (o ff -chip) Result (on-chip) Figure: Coalesced access to result buffer, 4 MACC per cycle, Figure: Column-major SpMV with Compressed Sparse Column (CSC) format, 4 MACC per cycle. 63% area reduction to CSC. SRAM Area Estimation [ ? ] area ∝ (# bits ) 0 . 9 × (# port ) 0 . 7 8 / 17

  11. eSELL: Area-saving Sparse Weight Format Weight Format Construction STEP1 STEP2 STEP3 STEP4 (eSELL) Values & BLK w Values Values Bin Column Indices 0 1 3 0 0 1 3 001 1 2 3 CHK 0 1 2 2 1 2 100 3 Row Permutation Index ( IDX row ) Encoded Column Indices ( EIDX col ) 0 2 4 0 2 010 BLK � 0,0 � 0 3 6 0 3 011 BLK h Chunk Width (CHK W ) 2 1 2 110 4 CHK 1 1 3 1 100 1 MAT h 5 000 … 7 000 (0,0) 1 0 2 5 6 010 CHK h 5 CHK 2 2 3 110 2 6 0 3 BLK � 1,0 � 011 7 3 111 0 2 7 110 … … CHK 3 2 3 111 … 1 4 1 100 3 MAT w 000 Figure: Steps for eSELL weight format construction. 8 / 17

  12. eSELL: Area-saving Sparse Weight Format Alignment to ROCC Interface Chunk1 Head Invalid Chunk0 Head 0 IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) 8 Value (16 bits) Value Value Value … Address Value Value Value Value IDX row (3x4 bits) EIDX col (3x4) CHK w (2) IDX row (3x4) EIDX col (3x4) CHK w (2) (10) Value Value Value Value ROCC word (64 bits) Figure: eSELL storage / transmission pattern aligned with ROCC 64-bits interface. 8 / 17

  13. Table of Contents Background 1 LSTM-based Neural Networks Target Embedded-Platform of E-LSTM Method: An Area-saving Sparse Weight Format (eSELL) 2 Method: Optimizations for LSTM Inter-Cell Parallelism 3 Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion Experiments and Evaluation 4 Conclusion 5 9 / 17

  14. Generic Accelerator Hardware for Embedded LSTM Instruction eSELL Decoder Weight Matrix Data I/O … SpMV SpMV SpMV Controller Load/Store PE PE PE Request Vectors Interface BUF x BUF wx BUF uh BUF h BUF b BUF c ROCC Vectors Vector PE Figure: Accelerator architecture in E-LSTM. 10 / 17

  15. Throughput Bottleneck Pipeline diagram for single SpMV-PE case. Time Load Matrix U Load Matrix W ROCC … Wx 1 Uh 1 Wx 2 Wx ts Uh ts SpMV-PE Figure: Process cell iterations in sequence; both ROCC and PE are fully utilized. 11 / 17

  16. Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. 11 / 17

  17. Throughput Bottleneck Pipeline diagram for multiple SpMV-PE case. Time Load Matrix W Load Matrix U ROCC Wx 1 Wx 4 SpMV-PE1 PE Stall Period SpMV-PE2 Wx 2 Wx 5 Wx 3 Uh 1 Uh 2 Uh 3 Wx 6 SpMV-PE3 Figure: Process W x t in parallel and U h t in sequence. Pipeline Stall W x t and U h t cannot be computed concurrently, as the ROCC can only load one word of W or U in each cycle. Thus the stall of PE is unavoidable, and U h t becomes the throughput bottleneck. 11 / 17

  18. Optimization1: Shorten U h t period with inherent sparsity of h t Backtrace of h t computation: o t = σ ( W o x t + U o h t − 1 + b o ) h t = o t · tanh ( c t ) 1 1 σ ( x ) = σ (x) 1 + e − x Inherent sparsity of h t 0.5 As P ( o t < 0 . 1) ≈ 0 . 32, and tanh ( c t ) ∈ ( − 1 , 1), a considerable portion of h t is closed to zero that can p = 32% be regarded as zero in U h t computation. σ (x)=0.1 0 − 6 − 4 − 2 0 2 4 6 x 12 / 17 Figure: function ( x )

  19. Optimization1: Shorten U h t period with inherent sparsity of h t Sparse-Matrix Sparse-Vector Multiplication (SpMSpV) in U h t non-zero value zero non-zero value zero res U res U weight column to be read weight column to be read h t h t x = x = Figure: SpMSpV: U h t computation considering inherent Figure: SpMV: original computation of U h t . sparsity of h t . In this example, SpMSpV achieves 3 × speedup on U h t computation. 12 / 17

  20. Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Cell-fusion Scheme Assuming there are N pe PEs, we set ( N pe − 1) of them process W x t (SpMV) and the rest one process U h t (SpMSpV). Besides, each SpMV-PE process W x t of N fuse cell iterations in interleave. Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Stall Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T main T epilog T prolog 13 / 17

  21. Optimization2: Scheduling with Cell-fusion Inter-cell Parallel Scheme: Cell-fusion Time ROCC N fuse =3, ts=18 Load Matrix U Free Load Matrix W PE1 Stall Wx 1,2,3 Wx 7,8,9 Wx 13,14,15 PE2 Wx 4,5,6 Wx 10,11,12 Wx 16,17,18 Stall PE3 Stall Uh 1 Uh 2 Uh 3 Uh 4 Uh 5 Uh 6 Uh 7 Uh 8 Uh 9 Uh 10 Uh 11 Uh 12 … Uh 18 T prolog T main T epilog Advantage: W x t and U h t are processed in concurrent. In every N fuse cycles, ROCC interface is occupied by loading W for 1 cycle and loading U for ( N fuse − 1) cycles. 13 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend