Training RNNs with 16-bit Floa5ng Point Erich Elsen - PowerPoint PPT Presentation

Training ¡RNNs ¡with ¡16-‑bit ¡ Floa5ng ¡Point ¡ Erich ¡Elsen ¡ Research ¡Scien5st ¡

Silicon ¡Valley ¡AI ¡Lab ¡@Baidu ¡ • State ¡of ¡the ¡Art ¡Speech ¡Recogni5on ¡ CTC Systems ¡ Fully Connected – Deep ¡Speech ¡2 ¡(hJp://arxiv.org/abs/ 1512.02595) ¡ • Easily ¡Adaptable ¡for ¡a ¡variety ¡of ¡ Recurrent languages ¡ or GRU (Bidirectional) • Trained ¡on ¡up ¡to ¡40,000 ¡hours ¡of ¡speech ¡ Batch Normalization – 4.5 ¡years! ¡ • Each ¡Model ¡requires ¡over ¡20 ¡Exa-‑Flops ¡to ¡ train ¡ 1D or 2D Invariant – Can ¡take ¡two ¡or ¡more ¡weeks ¡using ¡32 ¡GPUs ¡ Convolution • => ¡FP16 ¡is ¡one ¡tool ¡to ¡decrease ¡training ¡ Spectrogram 5me ¡ Erich ¡Elsen ¡

Neural ¡Network ¡Op5miza5on ¡ • A ¡network ¡consists ¡of ¡parameters, ¡x, ¡and ¡we ¡try ¡to ¡ minimize ¡the ¡cost ¡J, ¡of ¡the ¡network ¡on ¡some ¡data ¡set ¡ • Everybody ¡uses ¡some ¡variant ¡of ¡Batch ¡Stochas5c ¡ Gradient ¡Descent ¡(SGD) ¡ – Batch ¡= ¡use ¡mul5ple ¡examples ¡per ¡gradient ¡calcula5on ¡ ¡ x n + 1 = x n − α ∂ J ∂ x – α ¡oden ¡between ¡.01 ¡and ¡.0001 ¡ • The ¡ra5o ¡of ¡the ¡two ¡terms ¡on ¡the ¡right ¡is ¡very ¡ important ¡ Erich ¡Elsen ¡

Training ¡Recurrent ¡Neural ¡Networks ¡ (RNNs) ¡ • Parameters ¡are ¡W, ¡U ¡ • x, ¡h ¡are ¡the ¡network ¡ac5va5ons ¡ • Forward ¡pass ¡eqn: ¡ h t = σ ( Wx t + Uh t − 1 ) • Performance ¡of ¡GEMM ¡very ¡important ¡ Erich ¡Elsen ¡

Training ¡RNNs ¡ Memory ¡Usage ¡ • Must ¡save ¡x, ¡h ¡for ¡the ¡backward ¡pass! ¡ • Most ¡memory ¡is ¡used ¡to ¡store ¡ac5va5ons ¡ • Weights ¡< ¡10% ¡allocated ¡memory ¡ • Standard ¡is ¡to ¡use ¡32-‑bit ¡floa5ng ¡point ¡for ¡weights ¡and ¡ac5va5ons ¡ 2048 ¡ U ¡ h ¡ h ¡ h ¡ h ¡ h ¡ h ¡ h ¡ MB ¡= ¡32 ¡-‑ ¡256 ¡ Timesteps ¡= ¡50 ¡-‑ ¡800 ¡ Erich ¡Elsen ¡

FP16 ¡ • By ¡using ¡only ¡16-‑bits ¡per ¡number ¡we: ¡ – can ¡store ¡twice ¡as ¡many ¡numbers ¡ • Increase ¡mini-‑batch ¡size! ¡ – can ¡move ¡twice ¡as ¡many ¡numbers ¡around ¡in ¡the ¡same ¡ amount ¡of ¡5me ¡ • All ¡bandwidth ¡bound ¡opera5ons ¡improve ¡ – Hardware ¡arithme5c ¡units ¡take ¡up ¡less ¡and ¡are ¡faster ¡ • New ¡hardware ¡will ¡have ¡twice ¡as ¡many ¡FP16 ¡flops ¡as ¡FP32 ¡ • All ¡compute ¡bound ¡opera5ons ¡will ¡improve ¡ • No ¡Free ¡Lunch ¡– ¡Op5miza5on ¡becomes ¡more ¡ difficult ¡ Erich ¡Elsen ¡

FP32 ¡GEMM ¡Performance ¡ Erich ¡Elsen ¡

Pseudo-‑FP16 ¡GEMM ¡Performance ¡ Inputs/Outputs ¡FP16, ¡ • Internals ¡FP32 ¡ ¡ Use ¡Today ¡On ¡Maxwell: ¡ ¡ • CublasSgemmEx • Nervana GEMM 2-‑3x ¡Faster! ¡ Erich ¡Elsen ¡

True ¡FP16 ¡GEMM ¡Performance ¡ (es5mated) ¡ Inputs/Outputs/ Internals ¡FP16 ¡ ¡ Use ¡In ¡the ¡Future ¡ on ¡Pascal ¡ ¡ 4-‑6x ¡Faster! ¡ ¡ Op5miza5on ¡ problem ¡even ¡more ¡ challenging ¡ Erich ¡Elsen ¡

Number ¡Representa5on ¡ • Total ¡bits ¡= ¡1 ¡sign ¡bit ¡+ ¡bits ¡for ¡m ¡+ ¡bits ¡for ¡e ¡ N = ± m ∗ 2 e • Intui5on ¡– ¡2^m ¡values ¡between ¡each ¡power ¡of ¡2 ¡ • Imagine ¡2 ¡bits ¡for ¡m ¡and ¡2 ¡bits ¡for ¡e ¡ Erich ¡Elsen ¡

FP16 ¡vs ¡FP32 ¡ FP16 ¡ FP32 ¡ Max. ¡Value ¡ ~61,000 ¡ ~1e38 ¡ Grid ¡points ¡between ¡each ¡ 2048 ¡ ~16,700,000 ¡ power ¡of ¡2 ¡ Smallest ¡number ¡you ¡can ¡ add ¡to ¡one ¡and ¡get ¡a ¡ ~.000489 ¡ ~.00000006 ¡ different ¡number ¡ (ULP ¡rela5ve ¡to ¡1) ¡ Erich ¡Elsen ¡

Rounding ¡ • Generally ¡accepted ¡prac5ce ¡(implemented ¡in ¡ hardware) ¡is ¡round ¡to ¡nearest ¡even ¡(r2ne) ¡ • Go ¡to ¡the ¡nearest ¡point ¡and ¡if ¡you’re ¡exactly ¡halfway ¡ go ¡to ¡the ¡nearest ¡even ¡number ¡(in ¡binary) ¡ Erich ¡Elsen ¡

Summa5on ¡and ¡Rounding ¡ • x ¡updated ¡by ¡adding ¡a ¡sequence ¡of ¡rela5vely ¡ small ¡numbers ¡ • if ¡the ¡updates ¡are ¡too ¡small, ¡we ¡will ¡never ¡ make ¡any ¡progress ¡with ¡round ¡to ¡nearest ¡even ¡ x n + 1 = x n − α ∂ J ∂ x = x n Erich ¡Elsen ¡

Saddle ¡Points! ¡ • Local ¡minimum ¡not ¡a ¡problem ¡ • Saddle ¡points ¡are ¡what ¡make ¡op5miza5on ¡ hard ¡ – Flat ¡ – Small ¡Deriva5ve ¡ Image ¡from ¡wikipedia ¡ Erich ¡Elsen ¡

In ¡1-‑D ¡ J = − ( x − 3) 2 + 3 ∂ J ∂ x = − 2( x − 3) α ¡ ¡= ¡.01 ¡ Erich ¡Elsen ¡

1-‑D ¡Op5miza5on ¡Problem ¡ Erich ¡Elsen ¡

1-‑D ¡ra5os ¡of ¡updates ¡to ¡x ¡ Erich ¡Elsen ¡

Solu5on ¡1 ¡ Stochas5c ¡Rounding ¡ • Round ¡up ¡or ¡down ¡with ¡probability ¡related ¡to ¡ the ¡distance ¡to ¡the ¡neighboring ¡grid ¡points ¡ • Example ¡– ¡if ¡the ¡closest ¡grid ¡points ¡are ¡100 ¡ and ¡101 ¡and ¡the ¡value ¡is ¡100.01 ¡ – We ¡round ¡up ¡1% ¡of ¡the ¡5me ¡ – Round ¡down ¡99% ¡of ¡the ¡5me ¡ Erich ¡Elsen ¡

Stochas5c ¡Rounding ¡ • Ader ¡adding ¡.01, ¡100 ¡5mes ¡to ¡100 ¡ – With ¡r2ne ¡we ¡will ¡s5ll ¡have ¡100 ¡ – With ¡stochas5c ¡rounding ¡we ¡will ¡ expect ¡ to ¡have ¡ 101 ¡ • Allows ¡us ¡to ¡make ¡op5miza5on ¡progress ¡even ¡ when ¡the ¡updates ¡are ¡small ¡ Erich ¡Elsen ¡

Solu5on ¡2 ¡ High ¡precision ¡accumula5on ¡ • Keep ¡two ¡copies ¡of ¡the ¡weights ¡ – One ¡in ¡high ¡precision ¡(fp32) ¡ – One ¡in ¡low ¡precision ¡(fp16) ¡ • Accumulate ¡updates ¡to ¡the ¡high ¡precision ¡ copy ¡ • Round ¡the ¡high ¡precision ¡copy ¡to ¡low ¡ precision ¡and ¡perform ¡computa5ons ¡ Erich ¡Elsen ¡

High ¡precision ¡accumula5on ¡ • Ader ¡adding ¡.01, ¡100 ¡5mes ¡to ¡100 ¡ – We ¡will ¡have ¡exactly ¡101 ¡in ¡the ¡high ¡precision ¡ weights, ¡which ¡will ¡round ¡to ¡101 ¡in ¡the ¡low ¡ precision ¡weights ¡ • Allows ¡for ¡accurate ¡accumula5on ¡while ¡ maintaining ¡the ¡benefits ¡of ¡fp16 ¡computa5on ¡ • Requires ¡more ¡weight ¡storage, ¡but ¡weights ¡ are ¡usually ¡a ¡small ¡part ¡of ¡the ¡memory ¡ footprint ¡ Erich ¡Elsen ¡

Batch ¡Normaliza5on ¡Helps ¡ Erich ¡Elsen ¡

Results ¡ Erich ¡Elsen ¡

Conclusion ¡ • Half ¡precision ¡enables ¡bigger, ¡deeper ¡ networks ¡ • Half ¡precision ¡enables ¡faster ¡training ¡and ¡ evalua5on ¡of ¡networks ¡ • Half ¡precision ¡enables ¡beJer ¡scaling ¡to ¡ mul5ple ¡GPUs ¡ • Training ¡can ¡be ¡tricky, ¡these ¡techniques ¡can ¡ help ¡ Erich ¡Elsen ¡

Training RNNs with 16-bit Floa5ng Point Erich Elsen - PowerPoint PPT Presentation

Training RNNs with 16-bit Floa5ng Point Erich Elsen Research Scien5st Silicon Valley AI Lab @Baidu State of the Art Speech Recogni5on CTC

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit

Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit is normally group with

Fixed Point Real Numbers 16-bit Unsigned with Binary Point: 8 XXXXXXXX.XXXXXXXX

^v' f it i ; {l,r ;" ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

Lecture 19 Additional Material: Optimization CS109A Introduction to Data Science Pavlos

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline

Asymptotics of work distributions in driven Langevin systems Andreas Engel, Sascha von

Modeling Prey-Predator Populations Alison Pool and Lydia Silva December 13, 2006 Alison Pool and

Toward a Polanyi Rule Polanyi Rule Picture of Nuclear Picture of Nuclear Toward

Direct image-analysis methods for surgical simulation and mixed meshfree methods Jack S. Hale

Training RNNs with 16-bit Floa5ng Point Erich Elsen - PowerPoint PPT Presentation

Training RNNs with 16-bit Floa5ng Point Erich Elsen Research Scien5st Silicon Valley AI Lab @Baidu State of the Art Speech Recogni5on CTC

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

https://bit.ly/3pptcRS 3 4 https://bit.ly/2UiBgWq Vase Face Face https://bit.ly/3luge2Q

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed

Recurrent Neural Networks (RNNs) for NLP MACHINE LEARNING MEETUP DR. ANA PELETEIRO RAMALLO

The MIPS instruction set architecture The MIPS has a 32 bit architecture, with 32 bit

Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit is normally group with

Fixed Point Real Numbers 16-bit Unsigned with Binary Point: 8 XXXXXXXX.XXXXXXXX

^v' f it i ; {l,r ;&quot; ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff

With Splitting Steepest Descent Splitting yields adaptive net structure optimization Questions

Lecture 19 Additional Material: Optimization CS109A Introduction to Data Science Pavlos

Lecture 6: Optimization CS109B Data Science 2 Pavlos Protopapas and Mark Glickman Outline

Asymptotics of work distributions in driven Langevin systems Andreas Engel, Sascha von

Modeling Prey-Predator Populations Alison Pool and Lydia Silva December 13, 2006 Alison Pool and

Toward a Polanyi Rule Polanyi Rule Picture of Nuclear Picture of Nuclear Toward

Direct image-analysis methods for surgical simulation and mixed meshfree methods Jack S. Hale

^v' f it i ; {l,r ;" ^' - ru!; ;: 'L) (,h aLP Hr# 'il';Y o.r,r,t n to h, fl0 ,ff