Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1

RANDOMNESS Pseudo-random number generation Random mini-batching Stochastic gradient descent Data augmentation Regularization / generalization 2

DETERMINISM Elimination of truly random effects Bit-exact reproducibility from run-to-run Same model weights Same inference results Same graph generated 3

GOALS Reasonably high performance No changes to models 4

GUARANTEED FOR SAME number of GPUs GPU architecture driver version CUDA version cuDNN version framework version distribution setup 5

ADVANTAGES AUDITING EXPERIMENTATION In safety-critical applications Hold all independent variables constant DEBUGGING REGRESSION Reproduce a failure in a long run Re-factor without introducing bugs 6 6

“Correct” range after change r e f e r e n c e Probability of Accuracy (on given run) Model Accuracy 7

BELIEFS “TensorFlow is inherently non-deterministic.” “GPUs are inherently non-deterministic.” “This problem can’t be solved.” “Nobody cares about this.” “Non-determinism is required for high-performance.” “It’s easy. Just set the seeds.” 8

HYPOTHESES random seeds eigen kernels tf.reduce_sum / tf.reduce_mean max-pooling broadcast addition (for adding bias) distributed gradient update TensorFlow autotune multi-threading in the data loader gate_gradients image and video decoding TensorRT data augmentation asynchronous reductions CPU compute GEMM split between thread-blocks CUDA atomicAdd() 9

TWO-SIGMA BLOG POST “A Workaround for Non-Determinism in TensorFlow” bit.ly/two-sigma-determinism tf.reduce_sum() add bias using tf.add() 10

WORK-AROUND PART 1 input = tf.constant([[1, 2, 3], [4, 5, 6]]) 1 2 3 b = tf.ones_like(a) 4 5 6 1 1 1 1 1 1 deterministic_sum = tf.matmul( a, b, transpose_b=True) 1 2 3 4 5 6 21 a = tf.reshape(input, [1, -1]) 11

WORK-AROUND PART 2 layer outputs o = i * w + b w00 w01 w02 w03 w10 w11 w12 w13 layer inputs b0 b1 b2 b3 i00 i01 1 o00 o01 o02 o03 i10 i11 1 o10 o11 o12 o13 batch i20 i22 1 o20 o21 o22 o23 deterministic_mm_with_bias = tf.matmul(concat_1(i), concat(w, b)) 12

BUT NOW tf.reduce_sum() is deterministic tf.add() is deterministic 13

SOLVE A REAL PROBLEM Project MagLev: at-scale machine-learning platform 2D object detection model for autonomous vehicles Production scale: Millions of trainable variables Millions of training examples 14

bit.ly/how-to-debug 15

HOW TO DEBUG Determine what is working Determine precisely what is not working Generate hypotheses Test hypotheses using divide and conquer 16

model loss function target example fop fop fop store data loader var var var prediction Generic wgrad Deep Learning loss bop bop bop Process dgrad back-prop 17

DETERMINISM DEBUG TOOL Insert probe ops at various places in graph Train the model twice Identifies location and step of non-determinism injection 18

DETERMINISM DEBUG TOOL from tensorflow-determinism import probe tensorflow_op_output = probe.monitor( tensorflow_op_output, "name_for_place_in_graph") 19

DETERMINISM DEBUG TOOL Inserts back-propagatable monitor ops for: list, named-tuple, dict, or element ● ● element is int, float, string, or tf.Tensor (including zero-dimensional tensor) ● recursively, e.g. list-of-named-tuples-of-elements 20

DETERMINISM DEBUG TOOL Some of the other types of monitors: ● probe.monitor_keras() For monitoring output of Keras layer ● probe.monitor_gradients() Place between compute_gradients() and apply_gradients() ● probe.summarize_trainable_variables() Use before training, after each step, or at the end of training Also monitoring tools for tf.estimator and tf.keras, gradients and trainable variables 21

... ... 22

CONVOLUTION Back-Prop to Weight Gradients CONVOLUTION OUTPUT N FILTER (per output channel) C_input W reduction H weight gradients output gradients ∆ loss / ∆ weights ∆ loss / ∆ convolution_output (per output channel) (per output channel) 23

CONVOLUTION Back-Prop to Data Gradients C_output C_input W H reduction input gradients W ∆ loss / ∆ input (per batch index) H output gradients ∆ loss / ∆ output INPUT (per batch index) (per batch index) OUTPUT (per batch index) 24

CONVOLUTION Matrix-Multiplication Hierarchical Reduction Final gradients reduction partial gradients from each thread-block 25

CONVOLUTION CUDA atomicAdd() 0x10000: 1.0 0x10000: 6.0 Warp A Warp B 0x10000: 10.0 0x10000: 1.0 0x10000: 5.0 atomicAdd(0x10000, 5.0) atomicAdd(0x10000, 4.0) 0x10000: 10.0 Atomics Unit Memory 26

CONVOLUTION atomicAdd() Advantages Serializes operations without stalling parallel threads Assures atomic read-modify-write of memory i.e. avoids race conditions Very easy to program No need to synchronize between thread-blocks Very fast read-modify-write loop near memory/cache 27

CONVOLUTION Floating-Point Rounding Errors usually A + B == B + A usually A + B != B + C + + C A 28

CONVOLUTION Root Cause and Solution CUDA atomicAdd() $ export TF_CUDNN_DETERMINISTIC=true $ python tf_training_script.py TensorFlow cuDNN auto-tuning TF_CUDNN_DETERMINISTIC to disable auto-tuning and select deterministic cuDNN convolution #!/usr/bin/python import os algorithms import tensorflow as tf os.environ[‘TF_CUDNN_DETERMINISTIC’] = ’true’ Added to TensorFlow master # build a graph branch: bit.ly/tf-pr-24747 29

BIAS ADDITION Root Cause N BIAS ADD OUTPUT (per output channel)) bias gradient ∆ loss / ∆ bias (single value W per output channel) reduction H BIAS VALUE output gradients (per output channel) ∆ loss / ∆ bias_add_output (per output channel) tensorflow.python.ops.nn.bias_add() uses CUDA atomicAdd() 30

BIAS ADDITION Temporary Solution Dynamically patch tensorflow.python.ops.nn.bias_add() Use deterministic ops including implicit broadcasting if data_format == 'NCHW': value = tf.math.add(value, tf.reshape(bias, (1, tf.size(bias), 1, 1))) elif data_format == 'NHWC' or data_format == None: value = tf.math.add(value, bias) from tensorflow-determinism import patch patch.bias_add() 31

RARER NON-DETERMINISM tf.nn.fused_batch_norm() back-prop ○ Approximately every 10 steps ○ Temporary solution: run on CPU gate_gradients=tf.train.Optimizer.GATE_OP (default) ○ optimizer.compute_gradients() parameter ○ Approximately every 100 steps GATE_GRAPH is guaranteed to be deterministic ○ 32

RAREST NON-DETERMINISM Every few thousand steps at random locations Changed from Pascal to Volta card => non-determinism persisted Added ability to dump and compare probed tensors between runs Suspected memory allocation and ownership (time / location) Ran on cluster => fully deterministic Updated my driver => fully deterministic locally Possible causes: off-by-one memory allocation, incorrect cache invalidation, race conditions, clock speed, interface trims batch-norm and gate_gradients fixes not required 33

INTERIM STATUS Autonomous-vehicle TensorFlow determinism Deterministic cuDNN production model debugging tool developed convolution fixes training fully upstreamed to deterministically and TensorFlow master correctly on millions of branch examples 34

SINGLE GPU PERFORMANCE Proprietary AV Perception Model With unoptimized 6% DECREASE bias-add solution non- deterministic deterministic 35

MULTI-GPU WITH HOROVOD Based on single-GPU determinism recipe Two GPUs: deterministic out-of-the-box More than two GPUs non-deterministic Horovod uses NCCL2 ring-allreduce 36

RING-ALLREDUCE STEP 1 STEP 2 STEP 3 GPU 1 A B C D GPU 4 GPU 2 GPU 3 37 Patarasuk, Pitch & Yuan, Xin. (2007). Bandwidth Efficient All-reduce Operation on Tree Topologies. 1 - 8. 10.1109/IPDPS.2007.370405.

HOROVOD TENSOR FUSION Batch-reduce partial gradient tensors as they become ready Order of reduction changes on each training step (apparently) For now: disable Tensor Fusion $ HOROVOD_FUSION_THRESHOLD=0 python train.py 38

MULTI-GPU PERFORMANCE Using Single-GPU Determinism Recipe Tensor Fusion Tensor Fusion Tensor Fusion 39

ANOTHER REAL PROBLEM GE Healthcare Segmentation and Labeling Alerts for Critical Conditions Optimal Scans CT : BoneVCAR X-Ray : GE Critical Care Suite MR : GE MR AI Rx 40

MAX-POOLING 1 2 1 1 3 3 1 1 3 1 1 reduction 3 3 1 1 1 1 1 41

MAX-POOLING Root Cause & Solution CUDA atomicAdd() $ export TF_CUDNN_DETERMINISTIC=true $ python tf_training_script.py TF_CUDNN_DETERMINISTIC Added to TensorFlow master branch: bit.ly/tf-pr-25269 #!/usr/bin/python import os import tensorflow as tf os.environ[‘TF_CUDNN_DETERMINISTIC’] = ’true’ # build a graph 42

CPU NON-DETERMINISM Noticed while I was debugging the distilled model Much greater variance than GPU Injection occuring at weight update step Solution: Use single CPU thread session_config.intra_op_parallelism_threads = 1 (default: 2) session_config.inter_op_parallelism_threads = 1 (default: 5) Only needed when running on CPU (vs GPU) 43

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 - PowerPoint PPT Presentation

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 RANDOMNESS Pseudo-random number generation Random mini-batching Stochastic gradient descent Data augmentation Regularization / generalization 2 DETERMINISM Elimination of truly

To Be Free Terence Picton Physical Determinism Free Will Neuro-Determinism Imagined Future

Section 3 Non-Determinism, Regular Expressions, and Kleenes Theorem Automata Theory

Section 3 Non-Determinism, Regular Expressions, and Kleenes Theorem Automata Theory

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Technological determinism: PAUL THOMPSON AND revitalising labour KNUT LAASER process analyses

Race Why is parallelism hard? Non-determinism!! Practice Theory 2 Why is parallelism

Time-domain determinism using modern SoCs OSPERT 2019 David Haworth 1/42 1 / 42 Elektrobit

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACOS3X ACOS3 eXpress Microprocessor Card A Product Presentation www.acs.com.hk Rundown 1.

Sampling for Detailed NDIP Inspections Emma OKeefe StR in Dental Public Health Presented at

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, Ph.D. R&D Staff,

The random module Python Marquette University A Monte Carlo Method for Area calculation

The Design of Slot Machine Games Kevin Harrigan, PhD University of Waterloo Nov 17, 2009, New

Product Redesign Workshop Design Challenge Redesign a commercial carpet tile that is safe and

How to Make a Lumpy Random Number Generator Michael A. Covington University of Georgia and

Using math games to create fun, spark interest and build number sense Chad Madsen Astoria

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 - PowerPoint PPT Presentation

Determinism in Deep Learning (S9911) Duncan Riach, GTC 2019 1 RANDOMNESS Pseudo-random number generation Random mini-batching Stochastic gradient descent Data augmentation Regularization / generalization 2 DETERMINISM Elimination of truly

To Be Free Terence Picton Physical Determinism Free Will Neuro-Determinism Imagined Future

Section 3 Non-Determinism, Regular Expressions, and Kleenes Theorem Automata Theory

Section 3 Non-Determinism, Regular Expressions, and Kleenes Theorem Automata Theory

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Technological determinism: PAUL THOMPSON AND revitalising labour KNUT LAASER process analyses

Race Why is parallelism hard? Non-determinism!! Practice Theory 2 Why is parallelism

Time-domain determinism using modern SoCs OSPERT 2019 David Haworth 1/42 1 / 42 Elektrobit

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACOS3X ACOS3 eXpress Microprocessor Card A Product Presentation www.acs.com.hk Rundown 1.

Sampling for Detailed NDIP Inspections Emma OKeefe StR in Dental Public Health Presented at

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, Ph.D. R&amp;D Staff,

The random module Python Marquette University A Monte Carlo Method for Area calculation

The Design of Slot Machine Games Kevin Harrigan, PhD University of Waterloo Nov 17, 2009, New

Product Redesign Workshop Design Challenge Redesign a commercial carpet tile that is safe and

How to Make a Lumpy Random Number Generator Michael A. Covington University of Georgia and

Using math games to create fun, spark interest and build number sense Chad Madsen Astoria

A first strike at an OpenACC C++ Monte Carlo code Seth R Johnson, Ph.D. R&D Staff,