Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What - - PowerPoint PPT Presentation

computer arithmetic in deep learning
SMART_READER_LITE
LIVE PREVIEW

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What - - PowerPoint PPT Presentation

Computer Arithmetic in Deep Learning Bryan Catanzaro @ctnzr What do we want AI to do? Keep us Guide us to organized content Help us find things Help us communicate Drive us to work Serve drinks? @ctnzr OCR-based


slide-1
SLIDE 1

Computer Arithmetic in Deep Learning

Bryan Catanzaro @ctnzr

slide-2
SLIDE 2

@ctnzr

What do we want AI to do?

Drive us to work Serve drinks? Help us communicate

帮助我们沟通

Keep us

  • rganized

Help us find things Guide us to content

slide-3
SLIDE 3

Bryan Catanzaro

OCR-based Translation App

Baidu IDL

hello

slide-4
SLIDE 4

Bryan Catanzaro

Medical Diagnostics App

Baidu BDL

AskADoctor can assess 520 different diseases, representing ~90 percent

  • f the most common

medical problems.

slide-5
SLIDE 5

Bryan Catanzaro

Image Captioning

Baidu IDL

A yellow bus driving down a road with green trees and green grass in the background. Living room with white couch and blue carpeting. Room in apartment gets some afternoon sun.

slide-6
SLIDE 6

@ctnzr

Image Q&A

Baidu IDL

Sample questions and answers

slide-7
SLIDE 7

@ctnzr

Natural User Interfaces

  • Goal: Make interacting with computers as

natural as interacting with humans

  • AI problems:

– Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis

slide-8
SLIDE 8

@ctnzr

Demo

  • Deep Speech public API
slide-9
SLIDE 9

Computer vision: Find coffee mug

Andrew Ng

slide-10
SLIDE 10

Andrew Ng

Computer vision: Find coffee mug

slide-11
SLIDE 11

The camera sees :

Why is computer vision hard?

Andrew Ng

slide-12
SLIDE 12

Neurons in the brain

Output

Deep Learning: Neural network

Andrew Ng

Artificial Neural Networks

slide-13
SLIDE 13

Computer vision: Find coffee mug

Andrew Ng

slide-14
SLIDE 14

Supervised learning (learning from tagged data)

Yes No

Y X

Input Image Output tag: Yes/No (Is it a coffee mug?) Data:

Andrew Ng

Learning X ➡ Y mappings is hugely useful

slide-15
SLIDE 15

@ctnzr

Machine learning in practice

  • Progress bound by latency of hypothesis

testing

Idea Code Test

Hack up in Matlab Run on workstation Think really hard…

slide-16
SLIDE 16

@ctnzr

Deep Neural Net

  • A very simple universal approximator

yj = f X

i

wijxi !

x w y

f(x) = ( 0, x < 0 x, x ≥ 0 One layer nonlinearity Deep Neural Net

slide-17
SLIDE 17

@ctnzr

Why Deep Learning?

  • 1. Scale Matters

– Bigger models usually win

  • 2. Data Matters

– More data means less cleverness necessary

  • 3. Productivity Matters

– Teams with better tools can try out more ideas

Data & Compute Accuracy Deep Learning Many previous methods

slide-18
SLIDE 18

@ctnzr

Training Deep Neural Networks

yj = f X

i

wijxi !

x w y

  • Computation dominated by dot products
  • Multiple inputs, multiple outputs, batch

means GEMM

– Compute bound

  • Convolutional layers even more compute

bound

slide-19
SLIDE 19

Bryan Catanzaro

Computational Characteristics

  • High arithmetic intensity

– Arithmetic operations / byte of data – O(Exaflops) / O(Terabytes) : 10^6 – Math limited – Arithmetic matters

  • Medium size datasets

– Generally fit on 1 node

Training 1 model: ~20 Exaflops

slide-20
SLIDE 20

@ctnzr

Speech Recognition: Traditional ASR

  • Getting higher performance is hard
  • Improve each stage by engineering

Accuracy

Traditional ASR

Data + Model Size

Expert engineering.

slide-21
SLIDE 21

@ctnzr

Speech recognition: Traditional ASR

  • Huge investment in features for speech!

– Decades of work to get very small improvements

Spectrogram MFCC Flux

slide-22
SLIDE 22

@ctnzr

Speech Recognition 2: Deep Learning!

  • Since 2011, deep learning for features

Acoustic Model

HMM Language Model Transcription

“The quick brown fox jumps over the lazy dog.”

slide-23
SLIDE 23

@ctnzr

Speech Recognition 2: Deep Learning!

  • With more data, DL acoustic models perform

better than traditional models

Accuracy

Traditional ASR

Data + Model Size DL V1 for Speech

slide-24
SLIDE 24

@ctnzr

Speech Recognition 3: “Deep Speech”

  • End-to-end learning

“The quick brown fox jumps over the lazy dog.”

Transcription

slide-25
SLIDE 25

@ctnzr

Speech Recognition 3: “Deep Speech”

  • We believe end-to-end DL works better

when we have big models and lots of data Accuracy

Traditional ASR

Data + Model Size

DL V1 for Speech Deep Speech

slide-26
SLIDE 26

@ctnzr

End-to-end speech with DL

  • Deep neural network predicts characters directly

from audio

. . . . . .

T H _ E … D O G

slide-27
SLIDE 27

@ctnzr

Recurrent Network

  • RNNs model temporal dependence
  • Various flavors used in many applications

– LSTM, GRU, Bidirectional, … – Especially sequential data (time series, text, etc.)

  • Sequential dependence complicates parallelism
  • Feedback complicates arithmetic
slide-28
SLIDE 28

@ctnzr

Connectionist Temporal Classification (a cost function for end-to-end learning)

  • We compute this

in log space

  • Probabilities are tiny
slide-29
SLIDE 29

@ctnzr

Training sets

  • Train on 45k hours

(~5 years) of data

– Still growing

  • Languages

– English – Mandarin

  • End-to-end deep learning is key to

assembling large datasets

slide-30
SLIDE 30

@ctnzr

Performance for RNN training

  • 55% of GPU FMA peak using a single GPU
  • ~48% of peak using 8 GPUs in one node
  • This scalability key to large models & large datasets

1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128

TFLOP/s Number of GPUs

Typical training run

  • ne node

multi node

slide-31
SLIDE 31

Bryan Catanzaro

Computer Arithmetic for training

  • Standard practice: FP32
  • But big efficiency gains from smaller arithmetic
  • e.g. NVIDIA GP100 has 21 Tflops 16-bit FP, but

10.5 Tflops 32-bit FP

  • Expect continued push to lower precision
  • Some people report success in very low precision

training

– Down to 1 bit! – Quite dependent on problem/dataset

slide-32
SLIDE 32

@ctnzr

Training: Stochastic Gradient Descent

  • Simple algorithm

– Add momentum to power through local minima – Compute gradient by backpropagation

  • Operates on minibatches

– This makes it a GEMM problem instead of GEMV

  • Choose minibatches stochastically

– Important to avoid memorizing training order

  • Difficult to parallelize

– Prefers lots of small steps – Increasing minibatch size not always helpful

w0 = w γ n X

i

rwQ(xi, w)

slide-33
SLIDE 33

@ctnzr

  • is very small (1e-4)
  • We learn by making many very small updates

to the parameters

  • Terms in this equation often very lopsided

Training: Learning rate

γ

w0 = w γ n X

i

rwQ(xi, w)

Computer Arithmetic Problem

slide-34
SLIDE 34

@ctnzr

Cartoon optimization problem

[Erich Elsen]

Q = −(w − 3)2 + 3 ∂Q ∂w = −2(w − 3) γ = .01

slide-35
SLIDE 35

@ctnzr

Cartoon Optimization Problem

[Erich Elsen]

w

Q

∂Q ∂w

γ ∂Q ∂w

slide-36
SLIDE 36

@ctnzr

Rounding is not our friend

w γ ∂Q

∂w

w

[Erich Elsen]

Resolution

  • f FP16
slide-37
SLIDE 37

@ctnzr

Solution 1 Stochastic Rounding [S. Gupta et al., 2015]

  • Round up or down with probability related to

the distance to the neighboring grid points

  • Efficient to implement

– Just need a bunch of random numbers – And an FMA instruction with round-to-nearest-even

[Erich Elsen]

x + y = ( 100 w.p. 0.99 101 w.p. 0.01 x = 100, y = 0.1, ✏ = 1

slide-38
SLIDE 38

@ctnzr

Stochastic Rounding

  • After adding .01, 100 times to 100

– With r2ne we will still have 100 – With stochastic rounding we will expect to have 101

  • Allows us to make optimization progress

even when the updates are small

[Erich Elsen]

slide-39
SLIDE 39

@ctnzr

Solution 2 High precision accumulation

  • Keep two copies of the weights

– One in high precision (fp32) – One in low precision (fp16)

  • Accumulate updates to the high precision

copy

  • Round the high precision copy to low

precision and perform computations

[Erich Elsen]

slide-40
SLIDE 40

@ctnzr

High precision accumulation

  • After adding .01, 100 times to 100

– We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights

  • Allows for accurate accumulation while

maintaining the benefits of fp16 computation

  • Requires more weight storage, but weights

are usually a small part of the memory footprint

[Erich Elsen]

slide-41
SLIDE 41

@ctnzr

Deep Speech Training Results

[Erich Elsen]

FP16 storage FP32 math

slide-42
SLIDE 42

@ctnzr

Deployment

  • Once a model is trained, we need to deploy it
  • Technically a different problem

– No more SGD – Just forward-propagation

  • Arithmetic can be even smaller for

deployment

– We currently use FP16 – 8-bit fixed point can work with small accuracy loss

  • Need to choose scale factors for each layer

– Higher precision accumulation very helpful

  • Although all of this is ad hoc
slide-43
SLIDE 43

@ctnzr

Magnitude distributions

[M. Shoeybi]

frequency

1 10 100 1000 10000 1 10 100 1000 10000 100000 1000000 10000000

  • 20
  • 15
  • 10
  • 5

5

Dense, Layer 1

parameters input

  • utput

log_2(magnitude)

  • “Peaked” power law distributions
slide-44
SLIDE 44

@ctnzr

Determinism

  • Determinism very important
  • So much randomness,

hard to tell if you have a bug

  • Networks train despite bugs,

although accuracy impaired

  • Reproducibility is important

– For the usual scientific reasons – Progress not possible without reproducibility

  • We use synchronous SGD
slide-45
SLIDE 45

@ctnzr

Conclusion

  • Deep Learning is solving many hard

problems

  • Many interesting computer arithmetic issues

in Deep Learning

  • The DL community could use your help

understanding them!

– Pick the right format – Mix formats – Better arithmetic hardware

slide-46
SLIDE 46

@ctnzr

Thanks

  • Andrew Ng, Adam Coates, Awni Hannun,

Patrick LeGresley, Erich Elsen, Greg Diamos, Chris Fougner, Mohammed Shoeybi … and all of SVAIL Bryan Catanzaro @ctnzr