Mixed Precision Training PAI Overview What is mixed-precision - PowerPoint PPT Presentation

Mixed Precision Training 计算平台事业部 PAI 团队

Overview • What is mixed-precision & Why mixed-precision • How mixed-precision • Mixed-precision tools on PAI-tensorflow • Experimental results 1

What is mixed-precision • mixed-precision • FP32 and FP16 • More precision format in the future • TensorCore • Matrix-multiply and accumulate units • FP16 storage/inputs • FP32/Fp16 accumulator • Such as: • Conv • MatMul 2

Why mixed-precision • Two key points which matter in training/inference: • Computation • Tensorcore 8X higher throughput in MP than FP32 (15Tflops v.s. 120Tflops) • Memory access • Inputs is FP16 • Memory access is reduced by 2X 3

How mixed-precision • Key Strategies in mixed-precision training • Issues using FP16 for training and the solutions • Less bits in fraction: → Precision gap in sum • Less bits in exponent: → Gradients underflow • Arithmetic precision design • Considering both efficiency and performance 5

Issues using FP16 for training • Less bits in fraction: Precision gap in sum • A+B, if A/B>2 10 , B will degrade to zero . • For FP32, the ratio can be up to 2 23 • Common in weight update ： FP16 • W←W+lr *dW FP32 • Less bits in exponent: Gradients underflow • Gradients smaller that 2 -24 will become zero 6

Precision gap in sum • variables v.s. gradients • weight update: W ←W+lr *dW ( lr normally in [10 -1 , 10 -4 ] ) gradients: 2 -30 to 2 -5 Variables: 2 -16 to 2 -4 Fig . Variables and gradients histogram in Faster RCNN • Solution: Variables stored in FP32, and optimizer computation in FP32 7

Gradients underflow in FP16 • Gradients of variables FP16 FP32 Fig . Histogram for gradients of variables in Faster RCNN, respectively training in mixed-precision and FP32 8

Gradients underflow in FP16 • Gradients of activations FP16 FP32 Fig . Histogram for gradients of variables in Faster RCNN, respectively training in mixed-precision and FP32 9

Gradients underflow in FP16 Solution: gradients shift using loss scaling 10

Gradients underflow in FP16 • Constant loss scaling • Scale the loss by a factor S • Backprop to compute the dW • Unscale dW by 1/S • Automatic loss scaling • Start with a large scaling factor S • For each training iteration: • Scale the loss by S • Backprop to compute the dW • Unscale dW by 1/S • If dW contains Inf/NaN, the decrease S by a step factor S/step • Otherwise, update dW to W • If there is no Inf/NaN for N updates, the increase S by a step factor S*step

How mixed-precision • Key Strategies in mixed-precision training • Issues using FP16 for training and the solutions • Less bits in fraction: → Precision gap in sum • Solution: Variables stored in FP32, and optimizer computation in FP32 • Less bit in exponent: → Gradients underflow • Solution: loss scaling • Arithmetic precision design • Considering both efficiency and performance 12

How mixed-precision • Key Strategies in mixed-precision training • Issues using FP16 for training and the solutions • Less bits in fraction: → Precision gap in sum • Solution: Variables stored in FP32, and optimizer computation in FP32 • Less bit in exponent: → Gradients underflow • Solution: loss scaling • Arithmetic precision design • Considering both efficiency and performance 13

Arithmetic precision design • Arithmetic can be categorized into: Take advantage of Tensorcore: • Inputs: FP16 1. Compute-bound • Accumulator: FP32 • • Convolution, Matmul Outputs: FP32 2. Memory-bound ① Reductions • Inputs/Ouputs in FP16 • Batch-norm/layer-norm/group-norm • Computation in FP32 • Softmax / Average pooling ② Element-wise operation • Inputs/Ouputs in FP16 • Add/mul, etc • Computation in FP16 • Computation in FP32 14

Arithmetic precision design • Compute-bound operations: • Inputs in FP16 • Computation using Tensorcore • Outputs in FP32 • Memory-bound operations: • Inputs/outputs in FP16 • Computation in FP32 15

How mixed-precision training →Can be in MP →should be in FP32 Computation: forward and backward Optimizer related 16

MP training (var in FP32): • Convert the computation part to MP • Remain the optimizer part in FP32 Computation in MP Optimizer related: in FP32 17

MP training (var in FP32): • Loss Scaling strategy ( constant scaling ) 18

MP training (var in FP32): • Auto Loss Scaling strategy 19

MP training tools on PAI-TF • Graph Optimization + Loss Scaling Training Strategy • Graph Optimization ： AutoMixedPrecision Graph Optimization Pass Automatically conversion FP32 graph_def MP graph_def • MP Training Strategy: MP optimizer wrapper • Wrap the standard optimizers to automatically adopt the constant/automatic loss scaling strategy • opt = tf.contrib.mixed_precision.MixedPrecisionOptimizer(opt) • Both constant/automatic loss scaling supported Mixed-precision optimizer Standard optimizer 24

Experimental results • ResNet50 on ImageNet 25

Experimental results • Faster RCNN (VGG backbone) on PASCAL VOC 07 26

Experimental results • SSD (VGG backbone) on PASCAL VOC 07+12 27

Experimental results • Small NMT on WMT German-English • Encoder: 2 layers • Decoder: 2 layers with attention 28

PGAN • PGAN (Progressive growth of GAN) 29 Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality." Stability, and Variation.

PGAN • G loss 30 Karras, Tero, et al. "Progressive Growing of GANs for Improved Quality." Stability, and Variation.

PGAN • Generation results (cifar10 dataset) fp32 mp-no-scaling mp-auto-scaling Exp. fp32 mp-auto-scaling mp-no-scaling sliced_wasserstein 9.3764 9.1662 7.9601 31

Font Generation 32 Pyramid Embedded Generative Adversarial Network for Automated Font Generation

Font Generation • G loss 33 Pyramid Embedded Generative Adversarial Network for Automated Font Generation

Font Generation • Generation results ( 金陵刻经体 ) fp32 mp-no-scaling mp-auto-scaling 34

Wide & Deep Learning • Predict the probability that the individual has an annual income of over 50,000 dollars 35 Wide & Deep Learning for Recommender Systems

Wide & Deep Learning • Loss Exp fp32 mp-no-scaling Accuracy 84.31% 84.27% 36 Wide & Deep Learning for Recommender Systems

More try: small inputs (for normalization layers) • Underflow in FP16 gradients • Design the model to be more adaptive to FP16 representation • Move the gradient itself into the FP16 representable range • Especially the activation gradients • Batch normalization 37

Small input • Derivatives of BN layer • Reduce the magnitude of the inputs • Reduce magnitude of the forward activations, so as to reduce the overflow in forward propagation when using FP16 • Improve the magnitude of the derivatives • Tips for Network with BN: • Normalize the layer to have std to be 1/S rather than 1.0 Smaller Inputs and Bigger derivatives 38

Small inputs • ResNet32+CIFAR10 • Activations and the gradients activations activation gradients 39

Small inputs • ResNet32+CIFAR10 • All without loss scaling 40

Small inputs • SSD on PASCAL VOC 07+12 • Activations and the gradients activations gradients 41

Small inputs • SSD on PASCAL VOC 07+12 • Activations and the gradients 42

Conclusion • Mixed-precision tools have been supported on PAI-tensorflow • More effort is still conducted to explore more in mixed- precision • More precision supported • More training strategy

Thank you

Mixed Precision Training PAI Overview What is mixed-precision - PowerPoint PPT Presentation

Mixed Precision Training PAI Overview What is mixed-precision & Why mixed-precision How mixed-precision Mixed-precision tools on PAI-tensorflow Experimental results 1 What is mixed-precision

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

Training of Convolutional Neural Networks (CNNs) Typical Datasets Typical Networks CIFAR10

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

AUTOMATIC MIXED PRECISION IN PYTORCH Michael Carilli and Michael Ruberry, 3/20/2019 THIS TALK

MAAC Precision Aerobatics MAAC Precision Aerobatics JUDGES TRAINING JUDGES TRAINING

Mixed Feelings about Mixed Precision? Judy Hill Scientific Computing Group Leader, Center for

Mixed Strategies Krzysztof R. Apt CWI, Amsterdam, the Netherlands , University of Amsterdam

Estimating the Growth Rate of the Zeta Function Using Exponent Pairs Shreejit Bandyopadhyay July

Taylors law for Human Linguistic Sequences Tatsuru Kobayashi Kumiko Tanaka-Ishii Research

Where we are Where we are heading CLP-USA and Dnet-Bangladesh Teams July 23, 2016 Who we are

Lesson 50 Say the base and exponent for that group. Note: Students will need a calculator

Hero Acquisitions A subsidiary of HSS Hire Group plc Investor Presentation July 2017 Disclaimer

MODELING LAYERED NO X REDUCTION TECHNOLOGIES S. A. Bible, Volker Rummenhohl, Mark Siebeking, Reid

Adjusting the performance-based allocation system March 2017 Lisandro Martin, Chief Operational

T Board Room 803 S. Main St. Culpeper, Virginia F Members Present Members Absent R. Michael