computer arithmetic in deep learning
play

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What - PowerPoint PPT Presentation

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What do we want AI to do? Keep us Guide us to organized content Help us find things Help us communicate Drive us to work Serve drinks? @ctnzr OCR-based


  1. Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr

  2. What do we want AI to do? Keep us Guide us to organized content Help us find things Help us communicate Drive us to work Serve drinks? 帮助我们沟通 @ctnzr

  3. OCR-based Translation App Baidu IDL hello Bryan Catanzaro

  4. Medical Diagnostics App Baidu BDL AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems. Bryan Catanzaro

  5. Image Captioning Baidu IDL A yellow bus driving down a road Living room with white couch and with green trees and green grass in blue carpeting. Room in apartment the background. gets some afternoon sun. Bryan Catanzaro

  6. Image Q&A Baidu IDL Sample questions and answers @ctnzr

  7. Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis @ctnzr

  8. Demo • Deep Speech public API @ctnzr

  9. Computer vision: Find coffee mug Andrew Ng

  10. Computer vision: Find coffee mug Andrew Ng

  11. Why is computer vision hard? The camera sees : Andrew Ng

  12. Artificial Neural Networks Neurons in the brain Deep Learning: Neural network Output Andrew Ng

  13. Computer vision: Find coffee mug Andrew Ng

  14. Supervised learning (learning from tagged data) X Y Input Output tag: Yes/No Image (Is it a coffee mug?) Yes Data: No Learning X ➡ Y mappings is hugely useful Andrew Ng

  15. Machine learning in practice • Progress bound by latency of hypothesis testing Idea Think really hard… Hack up in Matlab Test Code Run on workstation @ctnzr

  16. Deep Neural Net • A very simple universal approximator X ! y j = f w ij x i x w y i One layer ( 0 , x < 0 f ( x ) = x, x ≥ 0 Deep Neural Net nonlinearity @ctnzr

  17. Why Deep Learning? Accuracy 1. Scale Matters Deep Learning – Bigger models usually win Many previous 2. Data Matters methods – More data means less cleverness necessary Data & Compute 3. Productivity Matters – Teams with better tools can try out more ideas @ctnzr

  18. Training Deep Neural Networks x w y X ! y j = f w ij x i i • Computation dominated by dot products • Multiple inputs, multiple outputs, batch means GEMM – Compute bound • Convolutional layers even more compute bound @ctnzr

  19. Computational Characteristics • High arithmetic intensity – Arithmetic operations / byte of data – O(Exaflops) / O(Terabytes) : 10^6 – Math limited – Arithmetic matters • Medium size datasets – Generally fit on 1 node Training 1 model: ~20 Exaflops Bryan Catanzaro

  20. Speech Recognition: Traditional ASR • Getting higher performance is hard • Improve each stage by engineering Expert engineering. Traditional ASR Accuracy Data + Model Size @ctnzr

  21. Speech recognition: Traditional ASR • Huge investment in features for speech! – Decades of work to get very small improvements Spectrogram Flux MFCC @ctnzr

  22. Speech Recognition 2: Deep Learning! • Since 2011, deep learning for features Transcription Acoustic Model Language Model HMM “The quick brown fox jumps over the lazy dog.” @ctnzr

  23. Speech Recognition 2: Deep Learning! • With more data, DL acoustic models perform better than traditional models DL V1 for Speech Traditional ASR Accuracy Data + Model Size @ctnzr

  24. Speech Recognition 3: “Deep Speech” • End-to-end learning Transcription “The quick brown fox jumps over the lazy dog.” @ctnzr

  25. Speech Recognition 3: “Deep Speech” • We believe end-to-end DL works better when we have big models and lots of data Deep Speech DL V1 for Speech Traditional ASR Accuracy Data + Model Size @ctnzr

  26. End-to-end speech with DL • Deep neural network predicts characters directly from audio T H _ E … D O G . . . . . . @ctnzr

  27. Recurrent Network • RNNs model temporal dependence • Various flavors used in many applications – LSTM, GRU, Bidirectional, … – Especially sequential data (time series, text, etc.) • Sequential dependence complicates parallelism • Feedback complicates arithmetic @ctnzr

  28. Connectionist Temporal Classification (a cost function for end-to-end learning) • We compute this in log space • Probabilities are tiny @ctnzr

  29. Training sets • Train on 45k hours (~5 years) of data – Still growing • Languages – English – Mandarin • End-to-end deep learning is key to assembling large datasets @ctnzr

  30. Performance for RNN training one node multi node 512 256 128 64 Typical TFLOP/s 32 training run 16 8 4 2 1 1 2 4 8 16 32 64 128 Number of GPUs • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • This scalability key to large models & large datasets @ctnzr

  31. Computer Arithmetic for training • Standard practice: FP32 • But big efficiency gains from smaller arithmetic • e.g. NVIDIA GP100 has 21 Tflops 16-bit FP, but 10.5 Tflops 32-bit FP • Expect continued push to lower precision • Some people report success in very low precision training – Down to 1 bit! – Quite dependent on problem/dataset Bryan Catanzaro

  32. Training: Stochastic Gradient Descent w 0 = w � γ X r w Q ( x i , w ) n i • Simple algorithm – Add momentum to power through local minima – Compute gradient by backpropagation • Operates on minibatches – This makes it a GEMM problem instead of GEMV • Choose minibatches stochastically – Important to avoid memorizing training order • Difficult to parallelize – Prefers lots of small steps – Increasing minibatch size not always helpful @ctnzr

  33. Training: Learning rate w 0 = w � γ X r w Q ( x i , w ) n i • is very small (1e-4) γ • We learn by making many very small updates to the parameters • Terms in this equation often very lopsided Computer Arithmetic Problem @ctnzr

  34. Cartoon optimization problem Q = − ( w − 3) 2 + 3 ∂ Q ∂ w = − 2( w − 3) γ = . 01 [Erich Elsen] @ctnzr

  35. Cartoon Optimization Problem Q ∂ Q γ ∂ Q ∂ w ∂ w [Erich Elsen] w @ctnzr

  36. Rounding is not our friend w γ ∂ Q Resolution ∂ w of FP16 w [Erich Elsen] @ctnzr

  37. Solution 1 Stochastic Rounding [S. Gupta et al., 2015] • Round up or down with probability related to the distance to the neighboring grid points x = 100 , y = 0 . 1 , ✏ = 1 ( 100 w.p. 0 . 99 x + y = 101 w.p. 0 . 01 • Efficient to implement – Just need a bunch of random numbers – And an FMA instruction with round-to-nearest-even [Erich Elsen] @ctnzr

  38. Stochastic Rounding • After adding .01, 100 times to 100 – With r2ne we will still have 100 – With stochastic rounding we will expect to have 101 • Allows us to make optimization progress even when the updates are small [Erich Elsen] @ctnzr

  39. Solution 2 High precision accumulation • Keep two copies of the weights – One in high precision (fp32) – One in low precision (fp16) • Accumulate updates to the high precision copy • Round the high precision copy to low precision and perform computations [Erich Elsen] @ctnzr

  40. High precision accumulation • After adding .01, 100 times to 100 – We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights • Allows for accurate accumulation while maintaining the benefits of fp16 computation • Requires more weight storage, but weights are usually a small part of the memory footprint [Erich Elsen] @ctnzr

  41. Deep Speech Training Results FP16 storage FP32 math [Erich Elsen] @ctnzr

  42. Deployment • Once a model is trained, we need to deploy it • Technically a different problem – No more SGD – Just forward-propagation • Arithmetic can be even smaller for deployment – We currently use FP16 – 8-bit fixed point can work with small accuracy loss • Need to choose scale factors for each layer – Higher precision accumulation very helpful • Although all of this is ad hoc @ctnzr

  43. Magnitude distributions Dense, Layer 1 10000 10000000 frequency 1000000 1000 100000 10000 100 1000 100 10 10 parameters input output 1 1 -20 -15 -10 -5 0 5 log_2(magnitude) • “ Peaked ” power law distributions [M. Shoeybi] @ctnzr

  44. Determinism • Determinism very important • So much randomness, hard to tell if you have a bug • Networks train despite bugs, although accuracy impaired • Reproducibility is important – For the usual scientific reasons – Progress not possible without reproducibility • We use synchronous SGD @ctnzr

  45. Conclusion • Deep Learning is solving many hard problems • Many interesting computer arithmetic issues in Deep Learning • The DL community could use your help understanding them! – Pick the right format – Mix formats – Better arithmetic hardware @ctnzr

  46. Thanks • Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley, Erich Elsen, Greg Diamos, Chris Fougner, Mohammed Shoeybi … and all of SVAIL Bryan Catanzaro @ctnzr @ctnzr

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend