DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna

Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and Features AGENDA Example - Import, Optimize and Deploy TensorFlow Models with TensorRT Key Takeaways and Additional Resources Q&A 2

DEEP LEARNING IN PRODUCTION Speech Recognition Recommender Systems Autonomous Driving Real-time Object Recognition Robotics Real-time Language Translation Many More… 3

CURRENT DEPLOYMENT WORKFLOW TRAINING UNOPTIMIZED DEPLOYMENT 1 Deploy training Data Management framework 2 Training Training Trained Neural Deploy custom Data Network application using NVIDIA DL SDK Model Assessment 3 Framework or custom CPU-Only application CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) 4

CHALLENGES WITH CURRENT APPROACHES Requirement Challenges Unable to processing high-volume, high-velocity data High Throughput ➢ Impact: Increased cost ($, time) per inference Applications don’t deliver real -time results ➢ Impact: Negatively affects user experience (voice recognition, Low Response Time personalized recommendations, real-time object detection) Inefficient applications Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment Efficiency infeasible Research frameworks not designed for production Deployment-Grade ➢ Impact: Framework overhead and dependencies increases time Solution to solution and affects productivity 5

NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 6 developer.nvidia.com/tensorrt

TENSORRT PERFORMANCE 40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT) 600 500 40 550 5700 6,000 450 35 500 400 5,000 30 350 Latency (ms) Images/sec 400 Latency (ms) 4,000 Images/sec 25 300 280 ms 20 300 250 3,000 14 ms 200 15 2,000 200 153 ms 150 10 117 ms 6.83 ms 6.67 ms 100 1,000 5 100 305 140 50 25 4 0 0 0 0 CPU-Only V100 + V100 + TensorRT CPU-Only + Torch V100 + Torch V100 + TensorRT TensorFlow Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE- batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On with AVX512. 7 developer.nvidia.com/tensorrt

TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 8

MODEL IMPORTING ➢ AI Researchers ➢ Data Scientists Example: Importing a TensorFlow model Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 9 developer.nvidia.com/tensorrt

TENSORRT LAYERS Built-in Layer Support Custom Layer API Deployed Application • Convolution TensorRT Runtime • LSTM and GRU Custom Layer • Activation: ReLU, tanh, sigmoid • Pooling: max and average • Scaling Element wise operations • LRN • Fully-connected • SoftMax • Deconvolution • CUDA Runtime 10

TENSORRT OPTIMIZATIONS Layer & Tensor Fusion ➢ Optimizations are completely automatic ➢ Performed with a single function call Weights & Activation Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory 11

LAYER & TENSOR FUSION Un-Optimized Network TensorRT Optimized Network next input next input concat relu relu relu relu bias bias bias bias 1x1 CBR 3x3 CBR 5x5 CBR 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. relu relu max pool bias bias max pool 1x1 CBR 1x1 conv. 1x1 conv. input input concat 12

LAYER & TENSOR FUSION Un-Optimized Network TensorRT Optimized Network Vertical Fusion • next input • Horizonal Fusion next input concat Layer Elimination • relu relu relu relu bias bias bias bias Network Layers Layers 1x1 CBR 3x3 CBR 5x5 CBR 3x3 conv. 5x5 conv. 1x1 conv. 1x1 conv. before after relu relu VGG19 43 27 max pool bias bias max pool 1x1 CBR Inception 309 113 1x1 conv. 1x1 conv. V3 input input ResNet-152 670 159 concat 13

FP16, INT8 PRECISION CALIBRATION Reduced Precision Inference Performance Precision Dynamic Range (ResNet50) 38 ~ +3.4x10 38 6,000 Training precision FP32 -3.4x10 FP16 Tensor Core No calibration required FP16 -65504 ~ +65504 5,000 INT8 -128 ~ +127 Requires calibration 4,000 Images/Second 3,000 Precision calibration for INT8 inference: 2,000 INT8 ➢ Minimizes information loss between FP32 and FP32 1,000 INT8 inference on a calibration dataset FP32 FP32 ➢ Completely automatic 0 CPU-Only P4 V100 14

FP16, INT8 PRECISION CALIBRATION Reduced Precision Inference Performance Precision Dynamic Range (ResNet50) FP32 INT8 Difference 38 ~ +3.4x10 Top 1 38 Top 1 6,000 Training precision FP32 -3.4x10 FP16 Googlenet 68.87% 68.49% 0.38% Tensor Core No calibration required FP16 -65504 ~ +65504 5,000 VGG 68.56% 68.45% 0.11% INT8 -128 ~ +127 Requires calibration Resnet-50 73.11% 72.54% 0.57% 4,000 Images/Second Resnet-152 75.18% 74.56% 0.61% 3,000 Precision calibration for INT8 inference: 2,000 INT8 ➢ Minimizes information loss between FP32 and FP32 1,000 INT8 inference on a calibration dataset FP32 FP32 ➢ Completely automatic 0 CPU-Only P4 V100 15

KERNEL AUTO-TUNING DYNAMIC TENSOR MEMORY Kernel Auto-Tuning Dynamic Tensor Memory Reduces memory footprint and • 100s for specialized kernels improves memory re-use Optimized for every GPU platform Manages memory allocation for • each tensor only for the duration of its usage Multiple parameters: • Batch size • Input dimensions • Filter dimensions Tesla V100 Jetson TX2 Drive PX2 16 ...

TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 17

EXAMPLE: DEPLOYING TENSORFLOW MODELS WITH TENSORRT Deployment and Inference Import, optimize and deploy TensorFlow models using TensorRT python API New Data Steps: Start with a frozen TensorFlow • Trained Neural Network model • Create a model parser TensorRT Optimize model and create a • Optimizer Optimized runtime engine Runtime Engine Perform inference using the • optimized runtime engine Inference Results 18 developer.nvidia.com/tensorrt

7 STEPS TO DEPLOYMENT WITH TENSORRT Step 1: Convert trained model into TensorRT format Step 2: Create a model parser Step 3: Register inputs and outputs Step 4: Optimize model and create a runtime engine Step 5: Serialize optimized engine Step 6: De-serialize engine Step 7: Perform inference developer.nvidia.com/tensorrt

RECAP: DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Import Serialize Model Engine Plan file FP32, FP16, keras_vgg19_b1_fp32.engine VGG19 Batch Size 1 Step 2 : Deploy optimized plans with runtime De-serialize Deploy New flower Engine Runtime images Plan file Prediction keras_vgg19_b1_fp32.engine Results TensorRT Runtime Engine 20

CHALLENGES ADDRESSED BY TENSORRT Requirement TensorRT Delivers Maximizes inference performance on NVIDIA GPUs High Throughput ➢ INT8, FP16 Precision Calibration, Layer & Tensor Fusion, Kernel Auto-Tuning ➢ Up to 40x Faster than CPU-Only inference and 18x faster inference of TensorFlow models Low Response Time ➢ Under 7ms real-time latency Performs target specific optimizations Power and Memory ➢ Platform specific kernels for Embedded (Jetson), Datacenter Efficiency (Tesla GPUs) and Automotive (DrivePX) ➢ Dynamic Tensor Memory management improves memory re-use Designed for production environments Deployment-Grade ➢ No framework overhead, minimal dependencies Solution ➢ Multiple frameworks, Network Definition API ➢ C++, Python API, Customer Layer API 21

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna - PowerPoint PPT Presentation

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and Features

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O

INTEGRATION OF DALI WITH TENSORRT ON XAVIER Josh Park (joshp@nvidia.com), Manager - Automotive Deep

Fast Neural Network Inference with TensorRT on Autonomous Vehicles Zejia Zheng (zheng@zoox.com)

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Neural Network Deployment with DIGITS and TensorRT Twin Karmakharm Certified Instructor, NVIDIA

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100

Deep Learning Inferencing on IBM Cloud with NVIDIA TensorRT Khoa Huynh Senior Technical Staff

TensorRT Inference with TensorFlow Pooya Davoodi (NVIDIA) Chul Gwon (Clarifai) Guangda Lai

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

NVIDIA GRID DEPLOYMENT ERIK BOHNHORST , SR. GRID SOLUTION ARCHITECT , NVIDIA RONALD GRASS,

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

8-bit Inference with TensorRT Szymon Migacz, NVIDIA May 8, 2017 Intro Goal: Convert FP32

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Project Plan Mobile Maestro The Capstone Experience Team Urban Science Dane Rosseter Mustafa

1996 to 2021 Catherine Chase, President July 24, 2020 Over 30 Years of Safety Advocacy The

The Scalable Readout System (SRS) integration into the TOTEM experiment Adrian Fiergolski

Neutron Imaging Detector based on the PIC Joe Parker Cosmic Ray Group, Kyoto University

Voice of the Customer Biomedical Engineering Project Opportunities Elise Bernstein Jonathan

DALI KUBIK ONE A complete sound system in a simple, stylish and easy to place package DALI KUBIK

similar words and may include, without limitation, statements regarding plans, strategies and

The Philips R s Res espironics V680 V680 Two V Ven enti tilators s in O One Philips