Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What - PowerPoint PPT Presentation

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr

What do we want AI to do? Keep us Guide us to organized content Help us find things Help us communicate Drive us to work Serve drinks? 帮助我们沟通 @ctnzr

OCR-based Translation App Baidu IDL hello Bryan Catanzaro

Medical Diagnostics App Baidu BDL AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems. Bryan Catanzaro

Image Captioning Baidu IDL A yellow bus driving down a road Living room with white couch and with green trees and green grass in blue carpeting. Room in apartment the background. gets some afternoon sun. Bryan Catanzaro

Image Q&A Baidu IDL Sample questions and answers @ctnzr

Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis @ctnzr

Demo • Deep Speech public API @ctnzr

Computer vision: Find coffee mug Andrew Ng

Why is computer vision hard? The camera sees : Andrew Ng

Artificial Neural Networks Neurons in the brain Deep Learning: Neural network Output Andrew Ng

Computer vision: Find coffee mug Andrew Ng

Supervised learning (learning from tagged data) X Y Input Output tag: Yes/No Image (Is it a coffee mug?) Yes Data: No Learning X ➡ Y mappings is hugely useful Andrew Ng

Machine learning in practice • Progress bound by latency of hypothesis testing Idea Think really hard… Hack up in Matlab Test Code Run on workstation @ctnzr

Deep Neural Net • A very simple universal approximator X ! y j = f w ij x i x w y i One layer ( 0 , x < 0 f ( x ) = x, x ≥ 0 Deep Neural Net nonlinearity @ctnzr

Why Deep Learning? Accuracy 1. Scale Matters Deep Learning – Bigger models usually win Many previous 2. Data Matters methods – More data means less cleverness necessary Data & Compute 3. Productivity Matters – Teams with better tools can try out more ideas @ctnzr

Training Deep Neural Networks x w y X ! y j = f w ij x i i • Computation dominated by dot products • Multiple inputs, multiple outputs, batch means GEMM – Compute bound • Convolutional layers even more compute bound @ctnzr

Computational Characteristics • High arithmetic intensity – Arithmetic operations / byte of data – O(Exaflops) / O(Terabytes) : 10^6 – Math limited – Arithmetic matters • Medium size datasets – Generally fit on 1 node Training 1 model: ~20 Exaflops Bryan Catanzaro

Speech Recognition: Traditional ASR • Getting higher performance is hard • Improve each stage by engineering Expert engineering. Traditional ASR Accuracy Data + Model Size @ctnzr

Speech recognition: Traditional ASR • Huge investment in features for speech! – Decades of work to get very small improvements Spectrogram Flux MFCC @ctnzr

Speech Recognition 2: Deep Learning! • Since 2011, deep learning for features Transcription Acoustic Model Language Model HMM “The quick brown fox jumps over the lazy dog.” @ctnzr

Speech Recognition 2: Deep Learning! • With more data, DL acoustic models perform better than traditional models DL V1 for Speech Traditional ASR Accuracy Data + Model Size @ctnzr

Speech Recognition 3: “Deep Speech” • End-to-end learning Transcription “The quick brown fox jumps over the lazy dog.” @ctnzr

Speech Recognition 3: “Deep Speech” • We believe end-to-end DL works better when we have big models and lots of data Deep Speech DL V1 for Speech Traditional ASR Accuracy Data + Model Size @ctnzr

End-to-end speech with DL • Deep neural network predicts characters directly from audio T H _ E … D O G . . . . . . @ctnzr

Recurrent Network • RNNs model temporal dependence • Various flavors used in many applications – LSTM, GRU, Bidirectional, … – Especially sequential data (time series, text, etc.) • Sequential dependence complicates parallelism • Feedback complicates arithmetic @ctnzr

Connectionist Temporal Classification (a cost function for end-to-end learning) • We compute this in log space • Probabilities are tiny @ctnzr

Training sets • Train on 45k hours (~5 years) of data – Still growing • Languages – English – Mandarin • End-to-end deep learning is key to assembling large datasets @ctnzr

Performance for RNN training one node multi node 512 256 128 64 Typical TFLOP/s 32 training run 16 8 4 2 1 1 2 4 8 16 32 64 128 Number of GPUs • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • This scalability key to large models & large datasets @ctnzr

Computer Arithmetic for training • Standard practice: FP32 • But big efficiency gains from smaller arithmetic • e.g. NVIDIA GP100 has 21 Tflops 16-bit FP, but 10.5 Tflops 32-bit FP • Expect continued push to lower precision • Some people report success in very low precision training – Down to 1 bit! – Quite dependent on problem/dataset Bryan Catanzaro

Training: Stochastic Gradient Descent w 0 = w � γ X r w Q ( x i , w ) n i • Simple algorithm – Add momentum to power through local minima – Compute gradient by backpropagation • Operates on minibatches – This makes it a GEMM problem instead of GEMV • Choose minibatches stochastically – Important to avoid memorizing training order • Difficult to parallelize – Prefers lots of small steps – Increasing minibatch size not always helpful @ctnzr

Training: Learning rate w 0 = w � γ X r w Q ( x i , w ) n i • is very small (1e-4) γ • We learn by making many very small updates to the parameters • Terms in this equation often very lopsided Computer Arithmetic Problem @ctnzr

Cartoon optimization problem Q = − ( w − 3) 2 + 3 ∂ Q ∂ w = − 2( w − 3) γ = . 01 [Erich Elsen] @ctnzr

Cartoon Optimization Problem Q ∂ Q γ ∂ Q ∂ w ∂ w [Erich Elsen] w @ctnzr

Rounding is not our friend w γ ∂ Q Resolution ∂ w of FP16 w [Erich Elsen] @ctnzr

Solution 1 Stochastic Rounding [S. Gupta et al., 2015] • Round up or down with probability related to the distance to the neighboring grid points x = 100 , y = 0 . 1 , ✏ = 1 ( 100 w.p. 0 . 99 x + y = 101 w.p. 0 . 01 • Efficient to implement – Just need a bunch of random numbers – And an FMA instruction with round-to-nearest-even [Erich Elsen] @ctnzr

Stochastic Rounding • After adding .01, 100 times to 100 – With r2ne we will still have 100 – With stochastic rounding we will expect to have 101 • Allows us to make optimization progress even when the updates are small [Erich Elsen] @ctnzr

Solution 2 High precision accumulation • Keep two copies of the weights – One in high precision (fp32) – One in low precision (fp16) • Accumulate updates to the high precision copy • Round the high precision copy to low precision and perform computations [Erich Elsen] @ctnzr

High precision accumulation • After adding .01, 100 times to 100 – We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights • Allows for accurate accumulation while maintaining the benefits of fp16 computation • Requires more weight storage, but weights are usually a small part of the memory footprint [Erich Elsen] @ctnzr

Deep Speech Training Results FP16 storage FP32 math [Erich Elsen] @ctnzr

Deployment • Once a model is trained, we need to deploy it • Technically a different problem – No more SGD – Just forward-propagation • Arithmetic can be even smaller for deployment – We currently use FP16 – 8-bit fixed point can work with small accuracy loss • Need to choose scale factors for each layer – Higher precision accumulation very helpful • Although all of this is ad hoc @ctnzr

Magnitude distributions Dense, Layer 1 10000 10000000 frequency 1000000 1000 100000 10000 100 1000 100 10 10 parameters input output 1 1 -20 -15 -10 -5 0 5 log_2(magnitude) • “ Peaked ” power law distributions [M. Shoeybi] @ctnzr

Determinism • Determinism very important • So much randomness, hard to tell if you have a bug • Networks train despite bugs, although accuracy impaired • Reproducibility is important – For the usual scientific reasons – Progress not possible without reproducibility • We use synchronous SGD @ctnzr

Conclusion • Deep Learning is solving many hard problems • Many interesting computer arithmetic issues in Deep Learning • The DL community could use your help understanding them! – Pick the right format – Mix formats – Better arithmetic hardware @ctnzr

Thanks • Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley, Erich Elsen, Greg Diamos, Chris Fougner, Mohammed Shoeybi … and all of SVAIL Bryan Catanzaro @ctnzr @ctnzr

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What - PowerPoint PPT Presentation

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What do we want AI to do? Keep us Guide us to organized content Help us find things Help us communicate Drive us to work Serve drinks? @ctnzr OCR-based

By Shervin Daneshpajouh Computer Arithmetic Computer Arithmetic p Computer Computer Arithmetic

Digital Design Discussion: Arithmetic Binary Arithmetic Floating-Point Arithmetic Binary

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Lecture 4 Arithmetic-Logic Unit 1 Arithmetic - Logic Unit ALU Handles integers Does the

Numeration and Computer Arithmetic Some Examples JC Bajard LIRMM, CNRS UM2 161 rue Ada, 34392

Arithmetic for Computers October 31, 2008 Arithmetic for Computers ALU Arithmetic Logic Unit

Section 4 Section 4 Arithmetic Units a 4-1 1 ALU ALU a 4-2 2 Arithmetic Logic Unit (ALU)

Fast Arithmetic Philipp Koehn 27 September 2019 Philipp Koehn Computer Systems Fundamental:

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Arithmetic Series (Lesson Slides) UNIT #7: Sequences and Series WARMUP Arithmetic Series

Arithmetic Logic Unit (ALU) By : Khawar Nehal 18 June 2020 Updated 21 June 2020 1 / 32

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

How to AI COGS 105 Many robotics and engineering problems work from a task- Week 14b: AI and

Agreement as a window to the process of corpus annotation Ron Artstein 29 September 2012 The

TCIPG TESTBEDS RESEARCH, CAPABILITIES, INDUSTRY NOVEMBER 12, 2014 TIM YARDLEY UNIVERSITY OF

12/17/2015 Gregory Pashayan 1

Event-driven Video Frame Synthesis Zihao Wang 1 , Weixin Jiang 1 , Kuan He 1 , Boxin Shi 2 ,

Lecture 06 Wireless Communication I-Hsiang Wang ihwang@ntu.edu.tw National Taiwan University

Liveness Checking as Safety Checking for Infinite State Spaces Viktor Schuppan 1 , Armin Biere 2 1

Sambuz

Useful Links

Newsletter

Mail Us