ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 7 December 2018

ACCELERATED COMPUTING: REDUCE LATENCY OF IDEA GENERATION Research as a sequential, cyclic process Idea Limit: Limit: Programmability Ingenuity Hack Invent Test Code Train Limit: Throughput @ctnzr 2

WHY IS DEEP LEARNING SUCCESSFUL Accuracy Big data sets Deep Learning New algorithms Computing hardware Many previous methods Focus of this talk Data & Compute @ctnzr 3

MORE COMPUTE: MORE AI https://blog.openai.com/ai-and-compute/ @ctnzr 4

DEEP NEURAL NETWORKS 101 Simple, powerful function approximators X ! x w y y j = f w ij x i i One layer: nonlinearity ⚬ linear combination ( 0 , x < 0 f ( x ) = x, x ≥ 0 Deep Neural Network nonlinearity @ctnzr 5

TRAINING NEURAL NETWORKS x w y X ! y j = f w ij x i i Computation dominated by dot products Multiple inputs, multiple outputs, batch means it is compute bound Train one model: 20+ Exaflops @ctnzr 6

LAWS OF PHYSICS Volta Successful AI uses Accelerated Computing 20X gap 20X in 10 years and growing… 10 1 GPU TFLOPs 0.1 General Purpose Performance Accelerated Performance @ctnzr 7

MATRIX MULTIPLICATION Thor’s hammer k n !(# $ ) communication m m !(# & ) computation k n @ctnzr 8

TENSOR CORE Mixed Precision Matrix Math 4x4 matrices A 0,0 A 0,1 A 0,2 A 0,3 B 0,0 B 0,1 B 0,2 B 0,3 C 0,0 C 0,1 C 0,2 C 0,3 D = A 1,0 A 1,1 A 1,2 A 1,3 B 1,0 B 1,1 B 1,2 B 1,3 C 1,0 C 1,1 C 1,2 C 1,3 A 2,0 A 2,1 A 2,2 A 2,3 B 2,0 B 2,1 B 2,2 B 2,3 C 2,0 C 2,1 C 2,2 C 2,3 A 3,0 A 3,1 A 3,2 A 3,3 B 3,0 B 3,1 B 3,2 B 3,3 C 3,0 C 3,1 C 3,2 C 3,3 FP16 or FP32 FP16 FP16 FP16 or FP32 D = AB + C @ctnzr 9

CHUNKY INSTRUCTIONS AMORTIZE OVERHEAD Taking advantage of that !(# $ ) goodness Operation Energy** Overhead* 1 FMA HFMA 1.5pJ 2000% HDP4A 6.0pJ 500% 4 FMA HMMA 110pJ 27% 128 FMA Tensor cores yield efficiency benefits, but are still programmable *Overhead is instruction fetch, decode, and operand fetch – 30pJ **Energy numbers from 45nm process Bill Dally @ctnzr 10

TESLA V100 21B transistors 815 mm 2 80 SM* 5120 CUDA Cores 640 Tensor Cores 32 GB HBM2 900 GB/s HBM2 300 GB/s NVLink @ctnzr 11 *full GV100 chip contains 84 SMs

GPU PERFORMANCE COMPARISON P100 V100 Ratio T4 12x Training acceleration 10 TFLOPS 120 TFLOPS 65 TFLOPS Inference 6x 20 TFLOPS 120 TFLOPS 130 TOPS acceleration 7.5/15 1.5x FP64/FP32 5/10 TFLOPS 0.25/8 TFLOPS TFLOPS 1.2x Memory Bandwidth 720 GB/s 900 GB/s 320 GB/s 1.9x NVLink Bandwidth 160 GB/s 300 GB/s -- 1.5x L2 Cache 4 MB 6 MB 4 MB 7.7x L1 Caches 1.3 MB 10 MB 6 MB 1.2x Power 250 W 300 W 70 W @ctnzr 12

PRECISION VOLTA+ 32 bit accumulation Turing follows Volta (Tesla T4, Titan RTX) Includes lower precision tensor cores (Not shown: 1 bit @ 128X throughput) @ctnzr 13

COMPUTATIONAL EVOLUTION Deep learning changes every day: In tension with Specialization Mask R-CNN Sparsely Gated Batch Normalization Mixture of Experts AlexNet NCCL Transformer WinoGrad 2012 2013 2014 2015 2016 2017 2018 GLOW 1-bit SGD Persistent RNNs OpenAI 5 FFT Convolutions Phased LSTM BigGAN cuDNN New solvers, new layers, new scaling techniques, new applications for old techniques, and much more… @ctnzr 14

PROGRAMMABILITY Where the research happens Computation dominated by linear operations But the research happens elsewhere: New loss functions CTC loss New non-linearities New normalizations Swish New inputs & outputs CUDA is fast and flexible parallel C++ @ctnzr 15

REFINING CUDA: CUDA GRAPHS Latency & Overhead Reductions Launch latencies: § CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux § Pre-defined graph allows launch of any number of kernels in one single operation Launch Launch Launch Launch Launch CPU Idle A B C D E Useful for small models A B C D E time Works with JIT graph Build compilers CPU Idle Launch Graph Graph A B C D E @ctnzr 16

CUDA LIBRARIES Optimized Kernels CUBLAS: Linear algebra Many flavors of GEMM CUDNN: Neural network kernels Convolutions (direct, Winograd, FFT) Can achieve > Speed of Light! Recurrent Neural Networks Lowering Convolutions to GEMM @ctnzr 17

IMPROVED HEURISTICS FOR CONVOLUTIONS cuDNN 7.4.1 (Nov 2018) vs. cuDNN 7.0.5 (Dec 2017) Speedup of unique cuDNN convolutions calls for the SSD detector model 100.0x 10.0x Speedup 1.0x Batch=32 Batch=128 Batch=256 0.1x Unique cuDNN convolution API calls @ctnzr 18

PERSISTENT RNN SPEEDUP ON V100 cuDNN 7.4.1 (Nov 2018) vs. cuDNN 7.0.5 (Dec 2017) Speedup of unique cuDNN Persistent RNN calls for GNMT @ batch=32 12x 10x 8x Speedup 6x 4x 2x 0x Unique cuDNN Persistent RNN API Calls @ctnzr 19

TENSORCORES WITH FP32 MODELS cuDNN 7.4.1 (Nov 2018) vs. cuDNN 7.0.5 (Dec 2017) Average speedup of unique cuDNN convolution calls during training 3.5x 3.0x Speedup 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x Batch=32 Batch=128 Batch=32 Batch=128 Batch=32 Batch=128 Resnet-50 v1.5 SSD Mask-RCNN • Enabled as an experimental feature in the TensorFlow NGC Container via an environment variable (same for cuBLAS) Should use in conjunction with Loss Scaling • @ctnzr 20

NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches 3 4 Eight EDR Infiniband/100 GigE 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth 5 PCIe Switch Complex 6 Two Intel Xeon Platinum CPUs 8 30 TB NVME SSDs Internal Storage 7 1.5 TB System Memory Dual 10/25 GigE 9 @ctnzr 21 21

NVSWITCH: NETWORK FABRIC FOR AI Inspired by leading edge research that demands • unrestricted model parallelism Each GPU can make random reads, writes • • 2.4 TB/s bisection bandwidth and atomics to each other GPU’s memory • Equivalent to a PCIe bus with 18 NVLink ports per switch • 1,200 lanes @ctnzr 22 22

DGX-2: ALL-TO-ALL CONNECTIVITY Each switch connects to 8 GPUs Each GPU connects to 6 switches Each switch connects to the other half of the system with 8 links 2 links on each switch reserved @ctnzr 23

FRAMEWORKS Several AI frameworks Let researchers prototype rapidly Different perspectives on APIs All are GPU accelerated @ctnzr 24

AUTOMATIC MIXED PRECISION Four Lines of Code => 2.3x Training Speedup in PyTorch (RN-50) Mixed precision training uses half- + amp_handle = amp.init() precision floating point (FP16) to # ... Define model and optimizer accelerate training for x, y in dataset: prediction = model(x) You can start using mixed precision today loss = criterion(prediction, y) with four lines of code - loss.backward() + with amp_handle.scale_loss(loss, This example uses AMP: Automatic Mixed + optimizer) as scaled_loss: Precision, a PyTorch library + scaled_loss.backward() optimizer.step() No hyperparameters changed @ctnzr 25

AUTOMATIC MIXED PRECISION Four Lines of Code => 2.3x Training Speedup (RN-50) Real-world single-GPU runs using default PyTorch ImageNet example NVIDIA PyTorch 18.08-py3 container AMP for mixed precision Minibatch=256 Single GPU RN-50 speedup for FP32 -> M.P. (with 2x batch size): MxNet: 2.9x TensorFlow: 2.2x TensorFlow + XLA: ~3x PyTorch: 2.3x Work ongoing to bring to 3x everywhere @ctnzr 26

DATA LOADERS Fast training means greater demands on the rest of the system Data transfer from storage (network) CPU bottlenecks happen fast Move all this to the GPU with DALI: https://github.com/NVIDIA/DALI GPU accelerated, user defined data loaders Research video data loader using HW decoding: Move decompression & augmentation to GPU NVVL: https://github.com/NVIDIA/NVVL Both for still images and videos @ctnzr 27

SIMULATION Many important AI tasks involve agents interacting with the real world For this, you need simulators Physics Appearance Simulation has a big role to play in AI progress RL needs good simulators – NVIDIA PhysX is now open source: https://github.com/NVIDIAGameWorks/PhysX-3.4 @ctnzr 28

MAKE INGENUITY THE LIMITING FACTOR Accelerated Computing for AI High computational intensity + Programmability & flexibility fundamental for AI systems Need a systems approach Chips are not enough And lots of software to make it all useful Bryan Catanzaro @ctnzr @ctnzr 29

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 7 December 2018 - PowerPoint PPT Presentation

ACCELERATED COMPUTING FOR AI Bryan Catanzaro, 7 December 2018 ACCELERATED COMPUTING: REDUCE LATENCY OF IDEA GENERATION Research as a sequential, cyclic process Idea Limit: Limit: Programmability Ingenuity Hack Invent Test Code Train

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

ACCELERATED COMPUTING WITH NVIDIA GPUS Jesse Tetreault, Solutions Architect October 2019

Accelerated Development of Materials, The Future Is Here (!) Raymundo Arryave Accelerated

The Scholars Academy: The Scholars Academy: An Accelerated Program for An Accelerated Program

What is Accelerated Reader? Accelerated Reader is a computer program that helps teachers manage

Roseburn Primary School Dream Believe Achieve Accelerated Reading A Guide for Parents

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Accelerated Learning - for Breakthrough Results Whole brain, person, systems approach Debbie

Trustworthy Computing * Reverse engineers agree on that! Trustworthy Computing Trustworthy

COMPUTING COMMUNITY CONSORTIUM The mission of the Computing Research Association's Computing

THE COMPUTING COMMUNITY CONSORTIUM (CCC) COMPUTING COMMUNITY CONSORTIUM The mission of Computing

Calm Computing The Coming Age of Mark Weiser and John Seely Brown Calm Computing Whyfor, Calm

Ray Wu Presentation to School of Computing, National University of Singapore Computing Evolution

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

Accelerated Aging and Life Time Prediction for Solar Concentrators CSP Today 2015, Sevilla J.

Markov modulated Brownian motion and the flip-flop fluid queue Guy Latouche Universit e libre

Introducing Autodesk Maya 2015 Chapter 7: Maya Shading and Texturing Maya topics covered in this

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Fermilab Status Don Holmgren USQCD All-Hands Meeting Fermilab March 22-23, 2007 Outline

Distributed Memory Systems: Part I Jens Saak Scientific Computing II 252/348 Distributed

Building and Refining General Purpose Computing Clusters in an Emerging HPC Oriented Research

Parameter Tuning of a Hybrid Treecode-FMM on GPUs Rio Yokota, Lorena Barba Department of