Bit Fu Bi Fusion on Bit-Level Dynamically Composable - - PowerPoint PPT Presentation

bit fu bi fusion on
SMART_READER_LITE
LIVE PREVIEW

Bit Fu Bi Fusion on Bit-Level Dynamically Composable - - PowerPoint PPT Presentation

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Georgia Institute of Technology Jongse Park Arm, Inc. Naveen Suda Liangzhen Lai University of California, San Diego


slide-1
SLIDE 1

Bi Bit Fu Fusion

  • n

Bit-Level Dynamically Composable Architecture for Deep Neural Networks

Hardik Sharma Jongse Park Naveen Suda† Liangzhen Lai† Benson Chau Vikas Chandra† Hadi Esmaeilzadeh‡ Alternative Computing Technologies (ACT) Lab

†Arm, Inc.

Georgia Institute of Technology

‡University of California, San Diego

slide-2
SLIDE 2

0% 20% 40% 60% 80% 100% Al AlexNet CI CIFAR10 LS LSTM Le LeNet-5 RE RESNET- 18 18 RN RNN SV SVHN VG VGG-7 Av Avg 1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit

DNNs DNNs T Tolerate Lo Low-Bi Bitwidth Operation

  • ns

>99.4% Multiply-Adds require less than 8-bits

2

slide-3
SLIDE 3

Bi Bitwidth Flexibility is Necessary for

  • r Accuracy

A fixed-bitwidth accelerator would either achieve limited benefits (8-bit), or compromise on accuracy (<8-bit)

3

Conv. 8b/8b Conv. 4b/4b Conv. 4b/4b Conv. 4b/4b Conv. 4b/4b FC 4b/4b FC 4b/4b FC 8b/8b Conv. 2b/2b Conv. 2b/2b FC 2b/2b FC 2b/2b

AlexNet: IMAGENET dataset (Mishra et al., WRPN, arXiv 2017) LeNet: MNIST dataset (Li et al., TWN, arXiv 2016)

slide-4
SLIDE 4

Our Our Appr pproach: h: Bit it-level Composa sability

BitBricks (BBs) are bit-level composable compute units

sy y1 y0 3 3 6 sx x1 x0 sign mode

BitBrick (BB)

BB BB BB BB + BB BB BB BB + BB BB BB BB + BB BB BB BB + +

Fusion Unit

WBUF Psum Forward Input Forward

4

slide-5
SLIDE 5

Compute units (BitBricks) logically fuse at runtime to form Fused-PEs (F-PEs) that dynamically match bit-width

  • f the DNN layers

5

(b) 16x Parallelism, Binary (1-bit)

  • r Ternary (2-bit)

Psum forward

+ + + + +

F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE WBUF Input forward (d) No Parallelism, 8-bits Psum forward Input forward

+ + + + +

F-PE

WBUF (c) 4x Parallelism, Mixed-Bitwidth (2-bit weights, 8-bit inputs) Psum forward WBUF Input forward

F-PE F-PE F-PE F-PE

(a) Fusion Unit with 16 BitBricks Psum forward BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+ +

Input forward WBUF

slide-6
SLIDE 6

Con Config #1 : Bi Binary/Ternary Mod Mode

Each BitBrick performs a binary/ternary multiplication 16x parallelism

BB BB BB BB + BB BB BB BB + BB BB BB BB + BB BB BB BB + +

Fusion Unit

2-bit

F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE F-PE

Input Weight 2-bit

6

slide-7
SLIDE 7

Con Config #2 #2: 4-bit bit Mode de

Four BitBricks fuse to form a Fused-PE (F-PE) 4x Parallelism

BB BB BB BB + BB BB BB BB + BB BB BB BB + BB BB BB BB + +

Fusion Unit

F-PE F-PE F-PE

Input (4-bit) Weight (4-bit)

2-bit 2-bit 2-bit 2-bit

Par9al Products

7

slide-8
SLIDE 8

Con Config #3 : 8-bit, bit, 4-bit bit (Mix ixed ed-Mod Mode)

Eight BitBricks fuse to form a Fused-PE (F-PE) 2x Parallelism

BB BB BB BB + BB BB BB BB + BB BB BB BB + BB BB BB BB + +

Fusion Unit

F-PE

Input (8-bit) Weight (4-bit)

2-bit 2-bit

Par:al Products

2-bit 2-bit 2-bit 2-bit

8

slide-9
SLIDE 9

Sp Spatial Fusion

  • n vs. Tempor
  • ral De

Design

Temporal Design (Bit Serial): Combine results over time

Out << g0 h0 g1 h1 g2 h2 g3 h3 Out << e0 f0 e1 f1 e2 f2 e3 f3 Out << c0 d0 c1 d1 c2 d2 c3 d3 Out << a0 b0 a1 b1 a2 b2 a3 b3 Inputs over time 1 2 3

Spatial Fusion (Bit Parallel): Combine results over space

Out << << << << a0 b0 c0 d0 e0 f0 g0 h0 a1 b1 c1 d1 e1 f1 g1 h1 a2 b2 c2 d2 e2 f2 g2 h2 a3 b3 c3 d3 e3 f3 g3 h3 1 2 3 Inputs over time

9

slide-10
SLIDE 10

Sp Spatial Fusion

  • n Su

Surp rpasses Tempor

  • ral De

Design

Area (um^2) BitBricks Shift-Add Register Total Area Temporal 463 2989 1454 4905 Fusion Unit 369 934 91 1394 Power (nW) BitBricks Shift-Add Register Total Power Temporal 60 550 1103 1712 Fusion Unit 46 424 69 538 Synthesized using a commercial 45 nm technology

10

3.5x lower area 3.2x lower power

slide-11
SLIDE 11

Control BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+ +

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+ +

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+ +

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+

BB BB BB BB

+ +

Fusion Unit Fusion Unit Fusion Unit Fusion Unit IBUF (Shared) IBUF (Shared) WBUF WBUF WBUF WBUF OBUF

+

Pooling Unit Ac.va.on Unit

OBUF

+

Pooling Unit Ac.va.on Unit

Bit Fusion Systolic Array Architecture

11

slide-12
SLIDE 12

Pr Programmability: BitFusion ISA

Requirements Amortize cost of bit-level fusion Enable flexible Data-Path Concise

12

slide-13
SLIDE 13

IS ISA: Amortiz tize e the the Cost t of Bit it-Le Level Fusion

Use a block-structured ISA for groups of operations (layers)

Convolu'on 8-bit/8-bit Convolu'on 4-bit/1-bit

Conv 1 Block begin: 8-bit/8-bit Block end: next block Block begin: 4-bit/1-bit

Convolu'on 4-bit/8-bit

Block end: next block

13

slide-14
SLIDE 14

IS ISA: Conc ncis ise e Expr Expres essio ion n for DNNs

Use loop instructions as DNNs consist of large number of repeated operations

OC IC IC B B OC

Fully-Connected Layer

loop: for j in (1 OC) loop: for k in (1 IC) loop: for i in (1 B)

14

slide-15
SLIDE 15

IS ISA: Conc ncis ise e Expr Expres essio ion n for DNNs

DNNs have regular memory access pattern Use loop indices to generate memory accesses

OC IC IC B B OC

Fully-Connected Layer loop: for j in (1 OC) loop: for k in (1 IC) input k 1 + j 0 + i IC weight k 1 + j IC + i 0

  • utput k 0 + j 1 + i OC

loop: for i in (1 B)

15

slide-16
SLIDE 16

IS ISA: Fle Flexible xible Storage

ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands

2-bit mode 16x parallelism Need: 32-bit inputs, 32-bit weights 8-bit mode 1x parallelism Need: 8-bit input, 8-bit weight

16

slide-17
SLIDE 17

IS ISA: Fle Flexible xible Storage e (S (Soft ftware View)

WBUF

32-bit

WBUF

16-bit

Register Register WBUF

8-bit

Reg

Software views the buffers as having a flexible aspect ratio

17

slide-18
SLIDE 18

Be Benchma marked Platfor

  • rms

ms

18

Nvidia Titan-X

GPU

Nvidia Tegra TX2

ASIC

Bit-Serial Op5mized Dataflow High Performance Low Power Stripes (Micro’16) Eyeriss (ISCA’16)

slide-19
SLIDE 19

Be Benchma marked DNN Mod Models

SVHN VGG-7 RESNET-18 RNN

DNN

AlexNet CIFAR10 LSTM LeNet-5 CNN CNN CNN RNN

Type

CNN CNN RNN CNN 158 MOps 317 MOps 4,269 MOps 17 MOps

Mul(ply- Adds

2,678 MOps 617 MOps 13 MOps 16 MOps 0.8 MBytes 2.7 MBytes 13 MBytes 8.0 MBytes

Bit-Flexible Model Weights

116.3 MBytes 3.3 MBytes 6.2 MBytes 0.5 MBytes 24.4 MBytes 43.3 MBytes 103.7 MBytes 64.0 MBytes

Original Model Weights

898.6 MBytes 53.5 MBytes 49.4 MBytes 8.2 MBytes

19

slide-20
SLIDE 20

Comp Comparison

  • n with Eyeriss

3.9× speedup and 5.1× energy reduction over Eyeriss

Improvement

  • ver Eyeriss

0x 4x 8x 12x

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

5.1 9.9 10.0 5.1 1.9 4.3 4.8 14.0 1.5 3.9 7.7 8.6 2.7 1.9 2.7 2.4 13.0 1.9

Performance Energy Reduction

20

slide-21
SLIDE 21

Comp Comparison

  • n with Stripes

2.6× speedup and 3.9× energy reduction over Stripes

Improvement

  • ver Stripes

0× 2× 4× 6× 8×

AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean

3.9x 4.4x 2.7x 3.0x 4.4x 7.8x 3.1x 6.0x 2.7x 2.6x 2.9x 1.8x 2.0x 2.6x 5.2x 2.1x 4.0x 1.8x

Performance Energy Reduction

21

slide-22
SLIDE 22

Speedup over TX2 0× 10× 20× 30×

AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean

16x 48x 14x 39x 5x 11x 38x 34x 3x 19x 30x 21x 7x 31x 27x 7x 29x 23x

TitanX-INT8 Bit Fusion

Comp Comparison

  • n with GPUs

Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW

22

slide-23
SLIDE 23

Con Conclusion

  • n

Emerging research shows we can reduce bitwidths for DNNs without losing accuracy Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity BitFusion ISA exposes this capability to software stack

23

slide-24
SLIDE 24