Bit Fu Bi Fusion on Bit-Level Dynamically Composable - PowerPoint PPT Presentation

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Georgia Institute of Technology Jongse Park † Arm, Inc. Naveen Suda † Liangzhen Lai † ‡ University of California, San Diego Benson Chau Vikas Chandra † Alternative Computing Technologies (ACT) Lab Hadi Esmaeilzadeh ‡

2 DNNs DNNs T Tolerate Lo Low-Bi Bitwidth Operation ons 1bit/1bit 2bit/2bit 4bit/4bit 8bit/1bit 8bit/8bit 100% 80% 60% 40% 20% 0% AlexNet CIFAR10 LeNet-5 VGG-7 LSTM RESNET- SVHN Avg RNN Av RN 18 18 LS SV VG Le Al RE CI >99.4% Multiply-Adds require less than 8-bits

3 Bi Bitwidth Flexibility is Necessary for or Accuracy AlexNet: IMAGENET dataset (Mishra et al., WRPN, arXiv 2017) Conv. Conv. Conv. Conv. Conv. FC FC FC 8b/8b 4b/4b 4b/4b 4b/4b 4b/4b 4b/4b 4b/4b 8b/8b LeNet: MNIST dataset (Li et al., TWN, arXiv 2016) Conv. Conv. FC FC 2b/2b 2b/2b 2b/2b 2b/2b A fixed-bitwidth accelerator would either achieve limited benefits (8-bit), or compromise on accuracy (<8-bit)

4 Our Appr Our pproach: h: Bit it-level Composa sability WBUF BitBrick (BB) Fusion Unit BB BB BB BB sign mode BB + BB + Input BB BB Forward + sx x 1 x 0 sy y 1 y 0 BB BB BB BB BB + BB + 3 3 BB BB 6 Psum Forward BitBricks (BBs) are bit-level composable compute units

5 WBUF WBUF F-PE F-PE F-PE F-PE BB BB BB BB Input Input + + + + forward F-PE F-PE F-PE F-PE BB BB BB BB forward + + BB BB BB BB F-PE F-PE F-PE F-PE Compute units + + + + BB BB BB BB F-PE F-PE F-PE F-PE (BitBricks) Psum forward Psum forward logically fuse at (b) 16x Parallelism, Binary (1-bit) (a) Fusion Unit with 16 BitBricks runtime to form or Ternary (2-bit) Fused-PEs (F-PEs) WBUF WBUF that dynamically Input Input F-PE + + F-PE F-PE F-PE F-PE match bit-width forward forward + of the DNN layers + + Psum forward Psum forward (c) 4x Parallelism, Mixed-Bitwidth (d) No Parallelism, 8-bits (2-bit weights, 8-bit inputs)

6 Config #1 : Bi Con Binary/Ternary Mod Mode Fusion Unit 2-bit BB F-PE BB F-PE F-PE BB BB BB + BB + Input F-PE F-PE BB F-PE F-PE BB + Weight F-PE BB F-PE BB F-PE F-PE BB BB BB + BB + 2-bit F-PE F-PE BB F-PE F-PE BB Each BitBrick performs a binary/ternary multiplication 16x parallelism

7 Config #2 Con #2: 4-bit bit Mode de Fusion Unit Input (4-bit) BB BB BB BB BB + BB + F-PE 2-bit 2-bit BB BB + BB BB BB BB 2-bit 2-bit BB + BB + F-PE F-PE Weight (4-bit) BB BB Par9al Products Four BitBricks fuse to form a Fused-PE (F-PE) 4x Parallelism

8 Config #3 : 8-bit, Con bit, 4-bit bit (Mix ixed ed-Mod Mode) Fusion Unit Input (8-bit) BB BB BB BB BB + BB + 2-bit 2-bit 2-bit 2-bit BB BB + F-PE BB BB BB BB 2-bit 2-bit BB + BB + Weight (4-bit) BB BB Par:al Products Eight BitBricks fuse to form a Fused-PE (F-PE) 2x Parallelism

9 Spatial Fusion Sp on vs. Tempor oral De Design g 3 h 3 g 2 h 2 g 1 h 1 g 0 h 0 a 3 b 3 c 3 d 3 e 3 f 3 g 3 h 3 Inputs over Inputs over e 3 f 3 e 2 f 2 e 1 f 1 e 0 f 0 a 2 b 2 c 2 d 2 e 2 f 2 g 2 h 2 time time c 3 d 3 c 2 d 2 c 1 d 1 c 0 d 0 a 1 b 1 c 1 d 1 e 1 f 1 g 1 h 1 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0 a 0 b 0 c 0 d 0 e 0 f 0 g 0 h 0 1 1 << << << << << << << << 2 2 3 Out Out Out Out 3 Out Temporal Design (Bit Serial): Spatial Fusion (Bit Parallel): Combine results over time Combine results over space

10 Sp Spatial Fusion on Su Surp rpasses Tempor oral De Design Total Area (um^2) BitBricks Shift-Add Register Area 3.5x lower Temporal 463 2989 1454 4905 area Fusion Unit 369 934 91 1394 Total Power (nW) BitBricks Shift-Add Register Power 3.2x lower Temporal 60 550 1103 1712 power Fusion Unit 46 424 69 538 Synthesized using a commercial 45 nm technology

Control WBUF WBUF 11 Fusion Unit Fusion Unit BB BB BB BB BB BB BB BB IBUF (Shared) + + + + BB BB BB BB BB BB BB BB + + BB BB BB BB BB BB BB BB + + + + BB BB BB BB BB BB BB BB Bit Fusion WBUF WBUF Systolic Array Fusion Unit Fusion Unit BB BB BB BB BB BB BB BB IBUF (Shared) Architecture + + + + BB BB BB BB BB BB BB BB + + BB BB BB BB BB BB BB BB + + + + BB BB BB BB BB BB BB BB + + Pooling Unit Ac.va.on Unit Pooling Unit Ac.va.on Unit OBUF OBUF

12 Pr Programmability: BitFusion ISA Amortize cost of bit-level fusion Requirements Concise Enable flexible Data-Path

13 ISA: Amortiz IS tize e the the Cost t of Bit it-Le Level Fusion Block end: next block Convolu'on 4-bit/8-bit Block begin: 8-bit/8-bit Convolu'on Conv 1 8-bit/8-bit Block end: next block Convolu'on 4-bit/1-bit Block begin: 4-bit/1-bit Use a block-structured ISA for groups of operations (layers)

14 ISA: Conc IS ncis ise e Expr Expres essio ion n for DNNs loop: for i in (1 B) OC IC OC loop: for j in (1 OC) loop: for k in (1 IC) B IC B Fully-Connected Layer Use loop instructions as DNNs consist of large number of repeated operations

15 ISA: Conc IS ncis ise e Expr Expres essio ion n for DNNs loop: for i in (1 B) loop: for j in (1 OC) OC IC OC loop: for k in (1 IC) input k 1 + j 0 + i IC B IC B weight k 1 + j IC + i 0 output k 0 + j 1 + i OC Fully-Connected Layer DNNs have regular memory access pattern Use loop indices to generate memory accesses

16 IS ISA: Fle Flexible xible Storage 2-bit mode 8-bit mode 16x parallelism 1x parallelism Need: 32-bit inputs, Need: 8-bit input, 32-bit weights 8-bit weight ISA changes the semantics of off-chip and on-chip memory accesses according to bitwidth of operands

17 ISA: Fle IS Flexible xible Storage e (S (Soft ftware View) WBUF WBUF WBUF 8-bit 32-bit 16-bit Reg Register Register Software views the buffers as having a flexible aspect ratio

18 Be Benchma marked Platfor orms ms Low Power Nvidia Tegra TX2 GPU High Performance Nvidia Titan-X Bit-Serial Stripes (Micro’16) ASIC Op5mized Dataflow Eyeriss (ISCA’16)

19 Benchma Be marked DNN Mod Models Mul(ply- Bit-Flexible Original Model DNN Type Adds Model Weights Weights CNN 2,678 MOps 116.3 MBytes 898.6 MBytes AlexNet CNN 617 MOps 3.3 MBytes 53.5 MBytes CIFAR10 LSTM RNN 13 MOps 6.2 MBytes 49.4 MBytes LeNet-5 CNN 16 MOps 0.5 MBytes 8.2 MBytes RESNET-18 CNN 4,269 MOps 13 MBytes 103.7 MBytes RNN RNN 17 MOps 8.0 MBytes 64.0 MBytes CNN 158 MOps 0.8 MBytes 24.4 MBytes SVHN CNN 317 MOps 2.7 MBytes 43.3 MBytes VGG-7

20 Comparison Comp on with Eyeriss Performance Energy Reduction 13.0 14.0 12x 10.0 9.9 Improvement 8.6 over Eyeriss 7.7 8x 5.1 5.1 4.8 4.3 3.9 4x 2.7 2.7 2.4 1.9 1.9 1.9 1.5 0x AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean 3.9× speedup and 5.1× energy reduction over Eyeriss

21 Comparison Comp on with Stripes Performance Energy Reduction 7.8x 8 × Improvement 6.0x over Stripes 6 × 5.2x 4.4x 4.4x 4.0x 3.9x 4 × 3.1x 3.0x 2.9x 2.7x 2.7x 2.6x 2.6x 2.1x 2.0x 1.8x 1.8x 2 × 0 × AlexNet Cifar-10 LSTM LeNet-5ResNet-18 RNN SVHN VGG-7 geomean 2.6× speedup and 3.9× energy reduction over Stripes

22 Comparison Comp on with GPUs TitanX-INT8 Bit Fusion 34x 38x 31x 39x 30x 48x 29x 30 × 27x Speedup over 23x 21x 19x 20 × 16x 14x TX2 11x 10 × 7x 7x 5x 3x 0 × AlexNet Cifar-10 LSTM LeNet-5 ResNet-18 RNN SVHN VGG-7 geomean Bit Fusion provides almost the same performance as Titan Xp (250 W) with only 895 mW

23 Con Conclusion on Emerging research shows we can reduce bitwidths for DNNs without losing accuracy Bit Fusion defines a new dimension of bit-level dynamic composability to leverage this opportunity BitFusion ISA exposes this capability to software stack

Bit Fu Bi Fusion on Bit-Level Dynamically Composable - PowerPoint PPT Presentation

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Georgia Institute of Technology Jongse Park Arm, Inc. Naveen Suda Liangzhen Lai University of California, San Diego

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit

Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit is normally group with

2017 ACU Fusion 360 Capstone Award Hannah Crepeau, Associate PMM, Competitions 2017 ACU Fusion

' COLD FUSION ' Byron New Energy COLD FUSION MEETING NEW REVOLUTIONARY GREEN TECHNOLOGY

Collaborative Deep Learning for Recommender Systems Hao Wang Naiyan Wang Dit-Yan Yeung 1

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural

Few-Shot Learning Christian Simon Piotr Koniusz Richard Nock Mehrtash

Gandiva : Introspective Cluster Scheduling for Deep Learning Wencong Xiao, Romil Bhardwaj,

AnIsabelleFormalization oftheExpressiveness ofDeepLearning Alexander Bentkamp Vrije

A Semantic Loss Function for Deep Learning with Symbolic Knowledge Jingyi Xu, Zilu Zhang , Tal

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017

Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2014 A. G.

Bit Fu Bi Fusion on Bit-Level Dynamically Composable - PowerPoint PPT Presentation

Bit Fu Bi Fusion on Bit-Level Dynamically Composable Architecture for Deep Neural Networks Hardik Sharma Georgia Institute of Technology Jongse Park Arm, Inc. Naveen Suda Liangzhen Lai University of California, San Diego

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

High resolution image fusion via fusion frames Shidong Li San Francisco State University

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit

Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit is normally group with

2017 ACU Fusion 360 Capstone Award Hannah Crepeau, Associate PMM, Competitions 2017 ACU Fusion

' COLD FUSION ' Byron New Energy COLD FUSION MEETING NEW REVOLUTIONARY GREEN TECHNOLOGY

Collaborative Deep Learning for Recommender Systems Hao Wang Naiyan Wang Dit-Yan Yeung 1

Deep Neural Networks and Partial Differential Equations: Approximation Theory and Structural

Few-Shot Learning Christian Simon Piotr Koniusz Richard Nock Mehrtash

Gandiva : Introspective Cluster Scheduling for Deep Learning Wencong Xiao, Romil Bhardwaj,

AnIsabelleFormalization oftheExpressiveness ofDeepLearning Alexander Bentkamp Vrije

A Semantic Loss Function for Deep Learning with Symbolic Knowledge Jingyi Xu, Zilu Zhang , Tal

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017

Introduction to Deep Learning A. G. Schwing &amp; S. Fidler University of Toronto, 2014 A. G.

Introduction to Deep Learning A. G. Schwing & S. Fidler University of Toronto, 2014 A. G.