E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - - PowerPoint PPT Presentation

e lstm efficient inference of sparse lstm on embedded
SMART_READER_LITE
LIVE PREVIEW

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous - - PowerPoint PPT Presentation

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie Liu 1 Shuo Wang 2 Yun Liang 2 Hayden So 1 1 Department of Electrical and Electronic Engineering The University of Hong Kong 2 Center for


slide-1
SLIDE 1

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System

Runbin Shi1 Junjie Liu1 Shuo Wang2 Yun Liang2 Hayden So 1

1Department of Electrical and Electronic Engineering

The University of Hong Kong

2Center for Energy-efficient Computing and Applications

School of EECS, Peking University

Design Automation Conference, June 2019

1 / 17

slide-2
SLIDE 2

Table of Contents

1

Background LSTM-based Neural Networks Target Embedded-Platform of E-LSTM

2

Method: An Area-saving Sparse Weight Format (eSELL)

3

Method: Optimizations for LSTM Inter-Cell Parallelism Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion

4

Experiments and Evaluation

5

Conclusion

2 / 17

slide-3
SLIDE 3

Iterative Cell Evaluation in LSTM Inference

An illustration of the LSTM-cell iteration. x(x1, x2,…, xts) LSTM Cell h(h1, h2,…, hts)

INPUT OUTPUT

X1 X2 X3

Xts

Sequence Time Step (ts)

Embedded Vector h1 h2 h3

hts

Input sequence: words, audio, image Output sequence: translation, prediction Figure: The LSTM cell and its iterative evaluation over temporal sequence.

3 / 17

slide-4
SLIDE 4

Iterative Cell Evaluation in LSTM Inference

An illustration of the LSTM-cell iteration. Cell iter0 x(x1, x2,…, xts) Cell iter1

Context Link

h2 c2 h2 x2 LSTM Cell Cell iterts

unrolled

… h1 c1 h1 x1 h(h1, h2,…, hts) hts xts hts-1 cts-1

Figure: The LSTM cell and its iterative evaluation over temporal sequence.

3 / 17

slide-5
SLIDE 5

Arithmetic of LSTM-cell Computation

ht-1

Cell iter t

σ

tanh

σ x + x x

Wf, Uf, bf

σ

Wi, Ui, bi Wc, Uc, bc Wo, Uo, bo

tanh xt ht ct ct-1

ft it

  • t

Figure: Detail dataflow in the LSTM cell.

ft = σ( Wf xt + Uf ht−1 + bf ) (1) it = σ( Wixt + Uiht−1 + bi) (2) ct = ft · ct−1 + it · tanh( Wcxt + Ucht−1 + bc) (3)

  • t = σ( Woxt + Uoht−1 + bo)

(4) ht = ot · tanh(ct) (5)

4 / 17

slide-6
SLIDE 6

Heavy Workload v.s. Low Performance of Embedded CPU

Main Computational Workload of LSTM

Matrix-vector multiplication: Wxt, W = (Wf , Wi, Wc, Wo)T ∈ R4n×m, xt ∈ Rm Uht, U = (Uf , Ui, Uc, Uo)T ∈ R4n×n, ht ∈ Rn In a benchmark layer for machine comprehension, m = n = 1500, one sequence has 35 time steps (cell iteration). 630, 000, 000 MACC operations for each sequence. One LSTM layer costs 0.63 second on a CPU with 1 GOp/s.

5 / 17

slide-7
SLIDE 7

Heavy Workload v.s. Low Performance of Embedded CPU

Main Computational Workload of LSTM

Matrix-vector multiplication: Wxt, W = (Wf , Wi, Wc, Wo)T ∈ R4n×m, xt ∈ Rm Uht, U = (Uf , Ui, Uc, Uo)T ∈ R4n×n, ht ∈ Rn

Sparsity in Weight Matrix

Sparsity(W, U) ∈ [0.2, 0.8] CPU performance decreases while computing Sparse matrix-vector multiplication (SpMV).

Embedded Solution for LSTM Inference

A heterogeneous system coupling CPU and a generic LSTM accelerator.

5 / 17

slide-8
SLIDE 8

Target Platform: Tightly-coupled Heterogeneous System

RISC-V L1 DCache DRAM Controller (CPU) DRAM HW Accelerator Buf Chip

Area saving

Data Path

ROCC Data Path ROCC Control Path 64b

Figure: Tightly-coupled Arch.

Advantages:

lower latency: 30 cycles (DRAM access) v.s. 1 cycle (DCache access via ROCC) smaller area: DRAM Controller (Accel)

Limitations:

chip-area limitation: off-chip weight storage ROCC bandwidth: 64bits/cycle

6 / 17

slide-9
SLIDE 9

Table of Contents

1

Background LSTM-based Neural Networks Target Embedded-Platform of E-LSTM

2

Method: An Area-saving Sparse Weight Format (eSELL)

3

Method: Optimizations for LSTM Inter-Cell Parallelism Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion

4

Experiments and Evaluation

5

Conclusion

7 / 17

slide-10
SLIDE 10

eSELL: Area-saving Sparse Weight Format

Access Coalescing

Vector (on-chip) Result (on-chip) Matrix (off-chip)

4 6 7 X

=

Port 0 Port 1 Port 2,3 4 6 7 Figure: Column-major SpMV with Compressed Sparse Column (CSC) format, 4 MACC per cycle.

Vector (on-chip) Result (on-chip) Matrix (off-chip)

v X

=

1 2 3 1 2 3 Port 0 Figure: Coalesced access to result buffer, 4 MACC per cycle, 63% area reduction to CSC.

SRAM Area Estimation [?]

area ∝ (#bits)0.9 × (#port)0.7

8 / 17

slide-11
SLIDE 11

eSELL: Area-saving Sparse Weight Format

Weight Format Construction

Encoded Column Indices ( EIDXcol )

2

MATw … MATh BLK0,0 BLK1,0 BLKw BLKh … … … (0,0)

STEP1 Row Permutation Index ( IDXrow )

2 4 6 1 3 5 7 1 5 6 7 2 4 3 1 3 2 1 2 1 2 3 2 2 3 1 2 3 3 3

CHK0 CHK1 CHK2 CHK3

1 3 2 1 1 2 3 3 1 2 1

Chunk Width (CHKW) Values

Values & Column Indices

Values

001 100 010 011 110 100 000 000 010 110 011 111 110 111 100 000

CHKh STEP2 STEP3 STEP4 (eSELL) Bin

1 2 3 4 5 6 7

Figure: Steps for eSELL weight format construction.

8 / 17

slide-12
SLIDE 12

eSELL: Area-saving Sparse Weight Format

Alignment to ROCC Interface IDXrow (3x4 bits) CHKw (2) EIDXcol (3x4) IDXrow (3x4) CHKw (2) EIDXcol (3x4) (10) Value Value Value Value

8

Chunk0 Head IDXrow (3x4 bits) CHKw (2) EIDXcol (3x4) IDXrow (3x4) CHKw (2) EIDXcol (3x4) Value (16 bits) Value Value Value (10) Value Value Value Value … ROCC word (64 bits) Chunk1 Head Invalid

Address

Figure: eSELL storage / transmission pattern aligned with ROCC 64-bits interface.

8 / 17

slide-13
SLIDE 13

Table of Contents

1

Background LSTM-based Neural Networks Target Embedded-Platform of E-LSTM

2

Method: An Area-saving Sparse Weight Format (eSELL)

3

Method: Optimizations for LSTM Inter-Cell Parallelism Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion

4

Experiments and Evaluation

5

Conclusion

9 / 17

slide-14
SLIDE 14

Generic Accelerator Hardware for Embedded LSTM

BUFwx BUFuh BUFx BUFb BUFh SpMV PE Vector PE SpMV PE SpMV PE

BUFc Controller eSELL Decoder

ROCC Interface

Instruction Data I/O Load/Store Request

Weight Matrix Vectors Vectors

Figure: Accelerator architecture in E-LSTM.

10 / 17

slide-15
SLIDE 15

Throughput Bottleneck

Pipeline diagram for single SpMV-PE case. Uh1 Wx1 Uhts Wxts … Wx2

Time

Load Matrix W Load Matrix U

ROCC SpMV-PE

Figure: Process cell iterations in sequence; both ROCC and PE are fully utilized.

11 / 17

slide-16
SLIDE 16

Throughput Bottleneck

Pipeline diagram for multiple SpMV-PE case. Wx1 Wx2 Wx3 Uh1 Uh2 Uh3 Wx4 Wx5 Wx6 PE Stall Period

Time

Load Matrix W Load Matrix U

ROCC SpMV-PE1 SpMV-PE2 SpMV-PE3

Figure: Process Wxt in parallel and Uht in sequence.

11 / 17

slide-17
SLIDE 17

Throughput Bottleneck

Pipeline diagram for multiple SpMV-PE case. Wx1 Wx2 Wx3 Uh1 Uh2 Uh3 Wx4 Wx5 Wx6 PE Stall Period

Time

Load Matrix W Load Matrix U

ROCC SpMV-PE1 SpMV-PE2 SpMV-PE3

Figure: Process Wxt in parallel and Uht in sequence.

Pipeline Stall

Wxt and Uht cannot be computed concurrently, as the ROCC can only load one word of W

  • r U in each cycle. Thus the stall of PE is unavoidable, and Uht becomes the throughput

bottleneck.

11 / 17

slide-18
SLIDE 18

Optimization1: Shorten Uht period with inherent sparsity of ht

Backtrace of ht computation:

  • t = σ(Woxt + Uoht−1 + bo)

ht = ot · tanh(ct)

0.5 1 −6 −4 −2 2 4 6

σ(x) x

σ(x)=0.1 σ(x) = 1 1 + e−x p = 32% Figure: function (x)

Inherent sparsity of ht

As P(ot < 0.1) ≈ 0.32, and tanh(ct) ∈ (−1, 1), a considerable portion of ht is closed to zero that can be regarded as zero in Uht computation.

12 / 17

slide-19
SLIDE 19

Optimization1: Shorten Uht period with inherent sparsity of ht

Sparse-Matrix Sparse-Vector Multiplication (SpMSpV) in Uht

x =

non-zero value zero weight column to be read

res ht U

Figure: SpMV: original computation of Uht.

x =

ht U res

non-zero value zero weight column to be read

Figure: SpMSpV: Uht computation considering inherent sparsity of ht.

In this example, SpMSpV achieves 3× speedup on Uht computation.

12 / 17

slide-20
SLIDE 20

Optimization2: Scheduling with Cell-fusion

Inter-cell Parallel Scheme: Cell-fusion

Cell-fusion Scheme

Assuming there are Npe PEs, we set (Npe − 1) of them process Wxt (SpMV) and the rest one process Uht (SpMSpV). Besides, each SpMV-PE process Wxt of Nfuse cell iterations in interleave.

Wx1,2,3 Wx4,5,6

Uh1

PE1 ROCC PE2

Uh2 Uh3 Uh4 Uh5 Uh6

Wx7,8,9 Wx10,11,12 Wx13,14,15 Wx16,17,18

Uh7 Uh8 Uh9 Uh10 Uh11 Uh12 … Uh18

Stall Stall Stall

PE3

Nfuse=3, ts=18

Time

Load Matrix W Load Matrix U Free

Tprolog Tmain Tepilog

13 / 17

slide-21
SLIDE 21

Optimization2: Scheduling with Cell-fusion

Inter-cell Parallel Scheme: Cell-fusion Wx1,2,3 Wx4,5,6

Uh1

PE1 ROCC PE2

Uh2 Uh3 Uh4 Uh5 Uh6

Wx7,8,9 Wx10,11,12 Wx13,14,15 Wx16,17,18

Uh7 Uh8 Uh9 Uh10 Uh11 Uh12 … Uh18

Stall Stall Stall

PE3

Nfuse=3, ts=18

Time

Load Matrix W Load Matrix U Free

Tprolog Tmain Tepilog

Advantage: Wxt and Uht are processed in concurrent. In every Nfuse cycles, ROCC interface is occupied by loading W for 1 cycle and loading U for (Nfuse − 1) cycles.

13 / 17

slide-22
SLIDE 22

Optimization2: Scheduling with Cell-fusion

Fine-tuning fusion factor (Nfuse) in Cell-fusion

Fine-tuning Nfuse for the minimum computation period.

Wx1,2,3 Wx4,5,6

Uh1

PE1 ROCC PE2

Uh2 Uh3 Uh4 Uh5 Uh6

Wx7,8,9 Wx10,11,12 Wx13,14,15 Wx16,17,18

Uh7 Uh8 Uh9 Uh10 Uh11 Uh12 … Uh18

Stall Stall Stall

PE3

Nfuse=3, ts=18

Time

Load Matrix W Load Matrix U Free

Tprolog Tmain Tepilog

13 / 17

slide-23
SLIDE 23

Optimization2: Scheduling with Cell-fusion

Fine-tuning fusion factor (Nfuse) in Cell-fusion

Fine-tuning Nfuse for the minimum computation period.

Wx1,2,3 Wx4,5,6

Uh1

PE1 ROCC PE2

Uh2 Uh3 Uh4 Uh5 Uh6

Wx7,8,9 Wx10,11,12 Wx13,14,15 Wx16,17,18

Uh7 Uh8 Uh9 Uh10 Uh11 Uh12 … Uh18

Stall Stall Stall

PE3

Nfuse=3, ts=18

Time

Load Matrix W Load Matrix U Free

Tprolog Tmain Tepilog

Optimization target: minimize

Nfuse

T(Nfuse) = Tprolog + Tmain + Tepilog with time consumption model: T = f (Nfuse, Npe, ts, len(x), len(h), Spw, Spu, Sph)

13 / 17

slide-24
SLIDE 24

Table of Contents

1

Background LSTM-based Neural Networks Target Embedded-Platform of E-LSTM

2

Method: An Area-saving Sparse Weight Format (eSELL)

3

Method: Optimizations for LSTM Inter-Cell Parallelism Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion

4

Experiments and Evaluation

5

Conclusion

14 / 17

slide-25
SLIDE 25

Experiments and Evaluation

Implementation Methods

Sparse LSTM benchmark layers: PyTorch Heterogeneous system with LSTM accelerator: CPP behavioral model in Spike (a gem5-like simulator in RISC-V eco-system) Scripts for eSELL-format model translation and Nfuse fine-tuning: Python Source code: https://github.com/rbshi/elstm

15 / 17

slide-26
SLIDE 26

Experiments and Evaluation

Evaluation of eSELL Sparse Format

Bit-volume comparison between sparse formats in different sparsity.

(matrix size: 768 × 768)

Sparsity Bit Volume of Compressed Sparse Matrix

Dense eSELL EC CSC

Competitive bit volume, but 63% reduction on SRAM area-cost due to less port usage.

15 / 17

slide-27
SLIDE 27

Experiments and Evaluation

Sparse LSTM Benchmark Layers

Table: Benchmark LSTM layers from three real-world applications.

Name Layer len(x) len(h) ts Spw Spu Sph Score OCR LSTM1 28 128 28 0.3 0.5 0.22 98.68/98.61/98.11 (MNIST) LSTM2 128 128 28 0.2 0.4 0.29 the higher, the better LM LSTM1 800 800 35 0.2 0.5 0.56 81.33/81.67/88.52 (PTB) LSTM2 800 800 35 0.2 0.6 0.41 the lower, the better LM LSTM1 1500 1500 35 0.4 0.5 0.37 101.63/102.15/106.5 (Wikitext) LSTM2 1500 1500 35 0.3 0.4 0.39 the lower, the better

Spw/u : sparsity of weight matrix W, U Sph : sparsity of hidden state (ht) len(x/h) : size of input vector (xt) or hidden state vector (ht) ts : time steps (length) of a sequence

15 / 17

slide-28
SLIDE 28

Experiments and Evaluation

Performance Comparison

Accelerator hardware configuration: 3 PEs (12 MACC Ops/cycle) Schemes in comparison: Original v.s. Sparse ht v.s Cell fusion + Sparse ht

Time Consumption (Cycle)

LSTM1 LSTM2 LSTM1 LSTM2 LSTM1 LSTM2 OCR (MNIST) LM (PTB) LM (Wikitext)

1x 1.31x 1.40x 1x 1.30x 1.51x 1x 1.52x 2.20x 1x 1.40x 2.03x 1x 1.34x 1.87x 1x 1.34x 1.83x

4 3 3 2 4 6

  • ptimal

Nfuse

  • Max. Speedup: 1.52× in Sparse ht scheme; 2.2× in Cell fusion + Sparse ht scheme.

15 / 17

slide-29
SLIDE 29

Table of Contents

1

Background LSTM-based Neural Networks Target Embedded-Platform of E-LSTM

2

Method: An Area-saving Sparse Weight Format (eSELL)

3

Method: Optimizations for LSTM Inter-Cell Parallelism Generic E-LSTM Architecture and Throughput Bottleneck Optimization with Inherent Sparsity in LSTM Arithmetic Scheduling with Cell-fusion

4

Experiments and Evaluation

5

Conclusion

16 / 17

slide-30
SLIDE 30

Conclusion

E-LSTM provides a solution for LSTM acceleration in embedded heterogeneous system considering the latency and chip-area cost. E-LSTM leverages the inherent sparsity in algorithm and proposes the cell-fusion scheme. With the fine-tuned fusion factor, a significant speed up is achieved. This scheme is also suitable to all LSTM accelerators. The open sourced framework contributes to the RISC-V heterogeneous system community.

17 / 17