High-Performance Hardware for Machine Learning U.C. Berkeley - PowerPoint PPT Presentation

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University

Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles Question Answering Control Game Playing (Go) Ad Placement 2

Whole research fields rendered irrelevant 3

Hardware and Data enable DNNs

The Need for Speed IMAGE RECOGNITION SPEECH RECOGNITION Important Property of Neural Networks 16X 10X Results get better with Model Training Ops 152 layers 465 GFLOP more data + 22.6 GFLOP 12,000 hrs of Data bigger models + ~5% Error ~3.5% error more computation 8 layers 80 GFLOP 7,000 hrs of Data 1.4 GFLOP (Better algorithms, new insights and ~8% Error ~16% Error improved techniques always help, too!) 2012 2015 2014 2015 AlexNet ResNet Deep Speech 1 Deep Speech 2 5

DNN primer 6

WHAT NETWORK? DNNS, CNNS, AND RNNS 7

DNN, KEY OPERATION IS DENSE M X V x = b i W ij a j Output activations Input activations weight matrix 8

CNNS – For image inputs, convolutional stages act as trained feature detectors 9

CNNS require convolution in addition to M X V Kernels Multiple 3D K uvkj x A ij A ij A ij A ij B xyk A xyk Output maps Input maps B xyk A xyc 10

4 Distinct Sub-problems Training Inference B x S Weight Reuse Convolutional Act Dominated Inference Train Conv Conv Weight Dominated B Weight Reuse Fully-Conn. Inference Train FC FC 32b FP – large batches 8b Int – small (unit) batches Large Memory Footprint Meet real-time constraint 11 Minimize Training Time

DNNs are Trivially Parallelized 12

Lots of parallelism in a DNN • Inputs • Multiplies within layer are independent • Points of a feature map • Sums are reductions • Filters • Only layers are dependent • Elements within a filter • No data dependent operations => can be statically scheduled

Data Parallel – Run multiple inputs in parallel • Doesn’t affect latency for one input • Requires P-fold larger batch size • For training requires coordinated weight update

Parameter Update p ’ = p + ∆ p Parameter Server p ’ ∆ p Model ! Workers Data ! Shards Large scale distributed deep networks Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

Model-Parallel Convolution – by output region (x,y) Kernels Multiple 3D K uvkj x B xyj B xyj 6D Loop B xyj B xyj A ij Forall region XY A ij B xyj A xyk B xyj For each output map j B xyj B xyj For each input map k B xyj B xyj For each pixel x,y in XY For each kernel element u,v Output maps Input maps B xyj += A (x-u)(y-v)k x K uvkj B xyj A xyk

Model Parallel Fully-Connected Layer (M x V) b i W ij x = a j W ij b i Output activations Input activations weight matrix

GPUs 18

Pascal GP100 • 10 TeraFLOPS FP32 • 20 TeraFLOPS FP16 • 16GB HBM – 750GB/s • 300W TDP • 67GFLOPS/W (FP16) • 16nm process • 160GB/s NV Link

NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER 170 TFLOPS 8x Tesla P100 16GB NVLink Hybrid Cube Mesh Optimized Deep Learning Software Dual Xeon 7 TB SSD Deep Learning Cache Dual 10GbE, Quad IB 100Gb 3RU – 3200W NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

Facebook’s deep learning machine • Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant “ Most of the major advances in machine learning and AI in the Serkan Piantino past few years have been contingent on tapping into powerful Engineering Director of Facebook AI Research GPUs and huge data sets to build and train advanced models ”

NVIDIA Parker • 1.5 Teraflop FP16 ARM v8 CPU COMPLEX • 4GB of LPDDR4 @ 25.6 GB/s ( 2x Denver 2 + 4x A57) Coherent HMP • 15 W TDP (1W idle, <10W typical) 4K60 4K60 SECURITY AUDIO VIDEO VIDEO 2D ENGINE ENGINES ENGINE ENCODER DECODER • 100GFLOPS/W (FP16) GigE DISPLAY 128-bit BOOT and IMAGE Ethernet ENGINES LPDDR4 PM PROC PROC (ISP) • 16nm process MAC Safety I/O Engine

XAVIER AI SUPERCOMPUTER SOC 7 Billion Transistors 16nm FF 8 Core Custom ARM64 CPU 512 Core Volta GPU New Computer Vision Accelerator Dual 8K HDR Video Processors Designed for ASIL C Functional Safety 23

Parallel GPUs on Deep Speech 2 2 19 5-3 (2560) 2 18 9-7 (1760) 2 17 Time (seconds) 2 16 2 15 2 14 2 13 2 12 2 11 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 GPUs Baidu, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, 2015

Reduced Precision 26

� How Much Precision is Needed for Dense M x V? x = b i W ij a j Output activations Input activations weight matrix 𝑐 " = 𝑔 ∑ 𝑥 "& 𝑏 " &

Number Representation Range Accuracy 23 1 8 FP32 10 -38 - 10 38 S E M .000006% 1 5 10 S E M FP16 6x10 -5 - 6x10 4 .05% 31 1 Int32 S M 0 – 2x10 9 ½ 1 15 Int16 0 – 6x10 4 ½ S M 1 7 Int8 0 – 127 ½ S M

Cost of Operations Area ( µ m 2 ) Operation: Energy (pJ) 8b Add 0.03 36 16b Add 0.05 67 32b Add 0.1 137 16b FP Add 0.4 1360 32b FP Add 0.9 4184 8b Mult 0.2 282 32b Mult 3.1 3495 16b FP Mult 1.1 1640 32b FP Mult 3.7 7700 32b SRAM Read (8KB) 5 N/A 32b DRAM Read 640 N/A Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

The Importance of Staying Local LPDDR DRAM GB 640pJ/word On-Chip SRAM MB 50pJ/word Local SRAM KB 5pJ/word

Mixed Precision Store weights as 4b using Trained quantization, w ij decode to 16b accumulate 24b or 32b x + b i to avoid saturation Store activations as 16b a j 16b x 16b multiply round result to 16b Batch normalization important to ‘center’ dynamic range

Weight Update Learning rate may be very small (10 -5 or less) a a j + x w ij D w ij x D w rounded to zero g j No learning!

Stochastic Rounding Learning rate may be very small (10 -5 or less) a D w very small a j + x D w’ ij D w ij w ij SR x E( D w’ ij ) = D w ij g j

� Reduced Precision For Training 𝑐 " = 𝑔 * 𝑥 "& 𝑏 " 𝑥 "& = 𝑥 "& + α𝑏 " 𝑕 & & S. Gupta et.al “Deep Learning with Limited Numerical

Pruning 35

Pruning before pruning after pruning pruning synapses pruning neurons Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

Retrain to Recover Accuracy L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% Train Connectivity 0.0% -0.5% -1.0% Accuracy Loss Prune Connections -1.5% -2.0% -2.5% -3.0% Train Weights -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

Pruning of VGG-16

Pruning Neural Talk and LSTM

Speedup of Pruning on CPU/GPU Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cuBLAS GEMV, cuSPARSE CSRMV NVIDIA Tegra K1: cuBLAS GEMV, cuSPARSE CSRMV

Trained Quantization (Weight Sharing) Quantization: less precision Pruning: less quantity Cluster the Weights Train Connectivity original same same Generate Code Book accuracy accuracy network Prune Connections 100% Size 10% Size 3.7% Size Quantize the Weights with Code Book Train Weights Retrain Code Book Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

Weight Sharing via K-Means weights cluster index fine-tuned (32 bit float) centroids (2 bit uint) centroids 3: 2.09 -0.98 1.48 0.09 2.00 1.96 3 0 2 1 2: 0.05 -0.14 -1.08 2.12 1.50 1.48 cluster 1 1 0 3 3 0 1 1 1 1 0 3 1: -0.91 1.92 0 -1.03 0.00 -0.04 0 3 1 0 0 3 1 0 lr 1.87 0 1.53 1.49 0: -1.00 -0.97 3 1 2 2 3 1 2 2 gradient -0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04 group by reduce -0.01 0.01 -0.02 0.12 0.03 0.01 -0.02 0.02 -0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04 -0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03 Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

Trained Quantization Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, arXiv 2015

Bits per Weight

Pruning + Trained Quantization

High-Performance Hardware for Machine Learning U.C. Berkeley - PowerPoint PPT Presentation

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes

Simulation-Based Admissible Dominance Pruning Alvaro Torralba, J org Hoffmann HSDIP

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI & JACO VAN DE POL CNRS/LIPN,

Example: Age, Income and Owning a flat 250 Training set

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations Xiaoyang Gao,

Introduction to Machine Learning CART: Stopping Criteria & Pruning

1 - Pruning Pseudocode - Pruning Properties Pruning has no effect on final result

High-Performance Hardware for Machine Learning U.C. Berkeley - PowerPoint PPT Presentation

High-Performance Hardware for Machine Learning U.C. Berkeley October 19, 2016 William Dally NVIDIA Corporation Stanford University Machine learning is transforming computing Speech Vision Natural Language Understanding Autonomous Vehicles

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

High Performance Hardware, High Performance Hardware, Memory &amp; CPU Memory &amp; CPU Step

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

MID-TERM FOLLOW-UP Semantic mining: Unsupervised acquisition of multilingual semantic classes

Simulation-Based Admissible Dominance Pruning Alvaro Torralba, J org Hoffmann HSDIP

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI &amp; JACO VAN DE POL CNRS/LIPN,

Example: Age, Income and Owning a flat 250 Training set

Random Sampling Revisited: Lattice Enumeration with Discrete Pruning Yoshinori Aono

Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations Xiaoyang Gao,

Introduction to Machine Learning CART: Stopping Criteria &amp; Pruning

1 - Pruning Pseudocode - Pruning Properties Pruning has no effect on final result

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step

PRUNING NESTED-DFS FOR PARAMETRIC TIMED AUTOMATA LAURE PETRUCCI & JACO VAN DE POL CNRS/LIPN,

Introduction to Machine Learning CART: Stopping Criteria & Pruning