High-Performance Hardware for Machine Learning U.C. Berkeley - - PowerPoint PPT Presentation

high performance hardware for machine learning
SMART_READER_LITE
LIVE PREVIEW

High-Performance Hardware for Machine Learning U.C. Berkeley - - PowerPoint PPT Presentation

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles


slide-1
SLIDE 1

High-Performance Hardware for Machine Learning

U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University

slide-2
SLIDE 2

2

Machine learning is transforming computing

Speech Natural Language Understanding Question Answering Game Playing (Go) Vision Autonomous Vehicles Control Ad Placement

slide-3
SLIDE 3

3

Whole research fields rendered irrelevant

slide-4
SLIDE 4

Hardware and Data enable DNNs

slide-5
SLIDE 5

5

The Need for Speed

IMAGE RECOGNITION SPEECH RECOGNITION

Important Property of Neural Networks Results get better with more data + bigger models + more computation (Better algorithms, new insights and improved techniques always help, too!) 2012

AlexNet

2015

ResNet 152 layers 22.6 GFLOP ~3.5% error 8 layers 1.4 GFLOP ~16% Error

16X

Model

2014

Deep Speech 1

2015

Deep Speech 2

80 GFLOP 7,000 hrs of Data ~8% Error

10X

Training Ops

465 GFLOP 12,000 hrs of Data ~5% Error

slide-6
SLIDE 6

6

DNN primer

slide-7
SLIDE 7

7

WHAT NETWORK? DNNS, CNNS, AND RNNS

slide-8
SLIDE 8

8

DNN, KEY OPERATION IS DENSE M X V

Wij aj

weight matrix Input activations

bi

Output activations

= x

slide-9
SLIDE 9

9

CNNS – For image inputs, convolutional stages act as trained feature detectors

slide-10
SLIDE 10

10

Aij Aij

CNNS require convolution in addition to M X V

Axyk Input maps Axyc Kernels Multiple 3D Kuvkj Aij Aij Bxyk x Output maps Bxyk

slide-11
SLIDE 11

11

4 Distinct Sub-problems

Training Inference Convolutional Fully-Conn. Train Conv Inference Conv Train FC Inference FC

B x S Weight Reuse Act Dominated B Weight Reuse Weight Dominated

32b FP – large batches Large Memory Footprint Minimize Training Time 8b Int – small (unit) batches Meet real-time constraint

slide-12
SLIDE 12

12

DNNs are Trivially Parallelized

slide-13
SLIDE 13

Lots of parallelism in a DNN

  • Inputs
  • Points of a feature map
  • Filters
  • Elements within a filter
  • Multiplies within layer are independent
  • Sums are reductions
  • Only layers are dependent
  • No data dependent operations

=> can be statically scheduled

slide-14
SLIDE 14

Data Parallel – Run multiple inputs in parallel

  • Doesn’t affect latency for one input
  • Requires P-fold larger batch size
  • For training requires coordinated weight update
slide-15
SLIDE 15

Parameter Update

Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

Parameter Server Model! Workers Data! Shards p’ = p + ∆p ∆p p’

Large scale distributed deep networks

slide-16
SLIDE 16

Model-Parallel Convolution – by output region (x,y)

Aij Aij Axyk Input maps Axyk Kernels Multiple 3D Kuvkj Bxyj x Output maps Bxyj 6D Loop Forall region XY For each output map j For each input map k For each pixel x,y in XY For each kernel element u,v Bxyj += A(x-u)(y-v)k x Kuvkj Bxyj Bxyj Bxyj Bxyj Bxyj Bxyj Bxyj Bxyj Bxyj

slide-17
SLIDE 17

Model Parallel Fully-Connected Layer (M x V)

Wij aj weight matrix

Input activations

bi

Output activations

= x

bi Wij

slide-18
SLIDE 18

18

GPUs

slide-19
SLIDE 19

Pascal GP100

  • 10 TeraFLOPS FP32
  • 20 TeraFLOPS FP16
  • 16GB HBM – 750GB/s
  • 300W TDP
  • 67GFLOPS/W (FP16)
  • 16nm process
  • 160GB/s NV Link
slide-20
SLIDE 20

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

170 TFLOPS 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Optimized Deep Learning Software Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W

slide-21
SLIDE 21

Facebook’s deep learning machine

  • Purpose-Built for Deep Learning Training

2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy

Powered by Eight Tesla M40 GPUs Open Rack Compliant

Serkan Piantino Engineering Director of Facebook AI Research

“Most of the major advances in machine learning and AI in the

past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models”

slide-22
SLIDE 22

NVIDIA Parker

  • 1.5 Teraflop FP16
  • 4GB of LPDDR4 @ 25.6 GB/s
  • 15 W TDP (1W idle, <10W typical)
  • 100GFLOPS/W (FP16)
  • 16nm process

ARM v8 CPU COMPLEX (2x Denver 2 + 4x A57)

Coherent HMP

SECURITY ENGINES 2D ENGINE 4K60 VIDEO ENCODER 4K60 VIDEO DECODER AUDIO ENGINE DISPLAY ENGINES IMAGE PROC (ISP) 128-bit LPDDR4 BOOT and PM PROC GigE Ethernet MAC

I/O

Safety Engine

slide-23
SLIDE 23

23

XAVIER

AI SUPERCOMPUTER SOC

7 Billion Transistors 16nm FF 8 Core Custom ARM64 CPU 512 Core Volta GPU New Computer Vision Accelerator Dual 8K HDR Video Processors Designed for ASIL C Functional Safety

slide-24
SLIDE 24

24

DRIVE PX 2

2 PARKER + 2 PASCAL GPU | 20 TOPS DL | 120 SPECINT | 80W

XAVIER

20 TOPS DL | 160 SPECINT | 20W

XAVIER

AI SUPERCOMPUTER SOC

ONE ARCHITECTURE

slide-25
SLIDE 25

Parallel GPUs on Deep Speech 2

20 21 22 23 24 25 26 27 GPUs 211 212 213 214 215 216 217 218 219 Time (seconds) 5-3 (2560) 9-7 (1760)

Baidu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, 2015

slide-26
SLIDE 26

26

Reduced Precision

slide-27
SLIDE 27

How Much Precision is Needed for Dense M x V?

Wij aj weight matrix

Input activations

bi

Output activations

= x

𝑐"=𝑔 ∑ 𝑥"&

  • &

𝑏"

slide-28
SLIDE 28

Number Representation

FP32 FP16 Int32 Int16 Int8 S E M

1 8 23

Range Accuracy 10-38 - 1038 .000006% 6x10-5 - 6x104 .05% 0 – 2x109 ½ 0 – 6x104 ½ 0 – 127 ½ S E M

1 5 10

M

31

S S M

1 1 15

S M

1 7

slide-29
SLIDE 29

Cost of Operations

Operation: Energy (pJ) 8b Add 0.03 16b Add 0.05 32b Add 0.1 16b FP Add 0.4 32b FP Add 0.9 8b Mult 0.2 32b Mult 3.1 16b FP Mult 1.1 32b FP Mult 3.7 32b SRAM Read (8KB) 5 32b DRAM Read 640 Area (µm2) 36 67 137 1360 4184 282 3495 1640 7700 N/A N/A

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

slide-30
SLIDE 30

The Importance of Staying Local

LPDDR DRAM GB On-Chip SRAM MB Local SRAM KB 640pJ/word 50pJ/word 5pJ/word

slide-31
SLIDE 31

Mixed Precision

wij aj

x

bi

+

Store weights as 4b using Trained quantization, decode to 16b Store activations as 16b 16b x 16b multiply round result to 16b accumulate 24b or 32b to avoid saturation

Batch normalization important to ‘center’ dynamic range

slide-32
SLIDE 32

Weight Update

gj aj

x x

a

Learning rate may be very small (10-5 or less) Dw rounded to zero No learning!

wij

+

Dwij

slide-33
SLIDE 33

Stochastic Rounding

gj aj

x x

a

Learning rate may be very small (10-5 or less) Dw very small

wij

+

Dwij SR Dw’ij

E(Dw’ij) = Dwij

slide-34
SLIDE 34

Reduced Precision For Training

  • S. Gupta et.al “Deep Learning with Limited Numerical

𝑐" = 𝑔 * 𝑥"&

  • &

𝑏" 𝑥"& = 𝑥"& + α𝑏"𝑕&

slide-35
SLIDE 35

35

Pruning

slide-36
SLIDE 36

Pruning

pruning neurons pruning synapses after pruning before pruning

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

slide-37
SLIDE 37

Retrain to Recover Accuracy

Train Connectivity Prune Connections Train Weights

  • 4.5%
  • 4.0%
  • 3.5%
  • 3.0%
  • 2.5%
  • 2.0%
  • 1.5%
  • 1.0%
  • 0.5%

0.0% 0.5% 40% 50% 60% 70% 80% 90% 100% Accuracy Loss Parametes Pruned Away

L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain

Pruned

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

slide-38
SLIDE 38

Pruning of VGG-16

slide-39
SLIDE 39

Pruning Neural Talk and LSTM

slide-40
SLIDE 40

Speedup of Pruning on CPU/GPU

Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV

slide-41
SLIDE 41

Trained Quantization

(Weight Sharing)

Train Connectivity Prune Connections Train Weights Cluster the Weights Generate Code Book Quantize the Weights with Code Book Retrain Code Book Pruning: less quantity Quantization: less precision 100% Size 10% Size 3.7% Size same accuracy same accuracy

  • riginal

network

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

slide-42
SLIDE 42

Weight Sharing via K-Means

2.09

  • 0.98

1.48 0.09 0.05

  • 0.14
  • 1.08

2.12

  • 0.91

1.92

  • 1.03

1.87 1.53 1.49

3 1 1 1 1 3 3 1 3 1 2 2

  • 0.03
  • 0.01

0.03 0.02

  • 0.01

0.01

  • 0.02

0.12

  • 0.01

0.02 0.04 0.01

  • 0.07
  • 0.02

0.01

  • 0.02

0.04 0.02 0.04

  • 0.03
  • 0.03

0.12 0.02

  • 0.07

0.03 0.01 0.02

  • 0.01

0.01 0.04

  • 0.01
  • 0.02
  • 0.01

0.01 cluster

weights (32 bit float) centroids gradient

3 2 1 1 1 3 3 1 3 1 2 2

cluster index (2 bit uint)

2.00 1.50 0.00

  • 1.00
  • 0.02
  • 0.02

group by

fine-tuned centroids

reduce 1.96 1.48

  • 0.04
  • 0.97

1:

lr

0: 2: 3:

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

slide-43
SLIDE 43

Trained Quantization

Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

slide-44
SLIDE 44

Bits per Weight

slide-45
SLIDE 45

Pruning + Trained Quantization

slide-46
SLIDE 46

30x – 50x Compression Means

  • Complex DNNs can be put in mobile applications (<100MB total)

– 1GB network (250M weights) becomes 20-30MB

  • Memory bandwidth reduced by 30-50x

– Particuarly for FC layers in real-time applications with no reuse

  • Memory working set fits in on-chip SRAM

– 5pJ/word access vs 640pJ/word

slide-47
SLIDE 47

47

Efficient Inference Engine

slide-48
SLIDE 48

Sparse Matrix Representation

~ a

  • a1

a3

  • ×

~ b PE0 PE1 PE2 PE3 B B B B B B B B B B B B B @ w0,0w0,1 0 w0,3 0 w1,2 0 0 w2,1 0 w2,3 0 w4,2w4,3 w5,0 0 w6,3 0 w7,1 0 1 C C C C C C C C C C C C C A = B B B B B B B B B B B B B @ b0 b1 −b2 b3 −b4 b5 b6 −b7 1 C C C C C C C C C C C C C A

ReLU

⇒ B B B B B B B B B B B B B @ b0 b1 b3 b5 b6 1 C C C C C C C C C C C C C A

slide-49
SLIDE 49

Sparse Matrix Representation

~ a

  • a1

a3

  • ×

~ b PE0 PE1 PE2 PE3 B B B B B B B B B B B B B @ w0,0w0,1 0 w0,3 0 w1,2 0 0 w2,1 0 w2,3 0 w4,2w4,3 w5,0 0 w6,3 0 w7,1 0 1 C C C C C C C C C C C C C A = B B B B B B B B B B B B B @ b0 b1 −b2 b3 −b4 b5 b6 −b7 1 C C C C C C C C C C C C C A

ReLU

⇒ B B B B B B B B B B B B B @ b0 b1 b3 b5 b6 1 C C C C C C C C C C C C C A

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3 Relative Index 1 2 Column Pointer 1 2 3

slide-50
SLIDE 50

EIE Architecture

slide-51
SLIDE 51

Scalability

1 10 100

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Speedup 1PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs

slide-52
SLIDE 52

Load Balance

0% 20% 40% 60% 80% 100%

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM

Load Balance

FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256

slide-53
SLIDE 53

Implementation

slide-54
SLIDE 54

Energy Distribution

slide-55
SLIDE 55

55

FC Layer: Speedup on EIE

1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 2x 5x 1x 9x 10x 1x 2x 3x 2x 3x 14x 25x 14x 24x 22x 10x 9x 15x 9x 15x 56x 94x 21x 210x 135x 16x 34x 33x 25x 48x 0.6x 1.1x 0.5x 1.0x 1.0x 0.3x 0.5x 0.5x 0.5x 0.6x 3x 5x 1x 8x 9x 1x 3x 2x 1x 3x

248x 507x 115x 1018x 618x 92x 63x 98x 60x 189x

0.1x 1x 10x 100x 1000x

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedup CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE

slide-56
SLIDE 56

FC Layer: Energy Efficiency on EIE

1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 5x 9x 3x 17x 20x 2x 6x 6x 4x 6x 7x 12x 7x 10x 10x 5x 6x 6x 5x 7x 26x 37x 10x 78x 61x 8x 25x 14x 15x 23x 10x 15x 7x 13x 14x 5x 8x 7x 7x 9x 37x 59x 18x 101x 102x 14x 39x 25x 20x 36x

34,522x 61,533x 14,826x 119,797x 76,784x 11,828x 9,485x 10,904x 8,053x 24,207x

1x 10x 100x 1000x 10000x 100000x

Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Energy Efficiency CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE

slide-57
SLIDE 57

Comparison: Throughput

57

slide-58
SLIDE 58

Comparison: Area Efficiency

58

slide-59
SLIDE 59

Comparison: Energy Efficiency

59

slide-60
SLIDE 60

60

Sparse Convolutional Accelerator

slide-61
SLIDE 61

61

Blocking CNN Inference

x y c

……

c c k X’ Y’ k INPUT ACTIVATIONS WEIGHTS OUTPUT ACTIVATIONS

PURPLE: allocated to 1 PE

slide-62
SLIDE 62

62

Sparse Convolution

  • Only compute where both operands are nonzero
  • 10-30x Reduction in work

* =

slide-63
SLIDE 63

63

Sparse Convolution Engine

Sparse Weight Buffer Sparse Input Buffer W M MW … … MxW multiplier array Indices Indices Output Addr Computation W M MW Banked Output Buffer … Scatter-Add Unit

slide-64
SLIDE 64

64

Conclusion

slide-65
SLIDE 65

Hardware and Data enable DNNs

slide-66
SLIDE 66

Summary

  • Hardware has enabled the current resurgence of DNNs

– And limits the size of today’s networks

  • Inference

– Dynamically sparse activations x statically sparse weights – 8b weights sufficient (can be compressed to 2-4b) – Energy dominated by data movement and buffering – Fixed-function hardware will dominate inference

  • Training

– Only dynamic sparsity (3x activations, 2x dropout) – Medium precision (FP16 – for weights) – Large memory footprint (batch x retained activations) – can be 10s – 100s of GB – Parallelism to 10PF today 100PF in near future (Communication BW) – GPUs will dominate training

slide-67
SLIDE 67

67

4 Distinct Sub-problems

Training Inference Convolutional Fully-Conn. 32b FP Batch Activation Storage GPUs ideal Comm for Parallelism Low-Precision Compressed Latency-Sensitive Fixed-Function HW Arithmetic Dominated 32b FP Batch Weight Storage GPUs ideal

  • Comm. for Parallelism

Low-Precision Compressed Latency-Sensitive No weight reuse Fixed-Function HW Storage dominated

B x S Weight Reuse Act Dominated B Weight Reuse Weight Dominated

32b FP – large batches Minimize Training Time Enables larger networks 8b Int – small (unit) batches Meet real-time constraint

slide-68
SLIDE 68

Thank You