S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken - - PowerPoint PPT Presentation

s8286 quick and easy dl workflow proof of concept
SMART_READER_LITE
LIVE PREVIEW

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken - - PowerPoint PPT Presentation

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT Alec Gunny Ken Hester Deep Learning in Production - Current Approaches - Deployment Challenges NVIDIA TensorRT - Programmable Inference Accelerator - Performance, Optimizations and


slide-1
SLIDE 1

Alec Gunny Ken Hester

S8286 : QUICK AND EASY DL WORKFLOW PROOF OF CONCEPT

slide-2
SLIDE 2

2

AGENDA

Deep Learning in Production

  • Current Approaches
  • Deployment Challenges

NVIDIA TensorRT

  • Programmable Inference Accelerator
  • Performance, Optimizations and Features

Example

  • Import, Optimize and Deploy

TensorFlow Models with TensorRT

Additional Resources Q&A

slide-3
SLIDE 3

3

SINGLE GPU PLATFORM FOR ALL ACCELERATED WORKLOADS

10M Users 40 years of video/day TESLA V100 - UNIVERSAL GPU

BOOSTS ALL ACCELERATED WORKLOADS

HPC AI Training AI Inference Data Analytic

cuBLAS

DeepStream SDK

NCCL

cuDNN

NVIDIA DEEP LEARNING SDK and CUDA libraries

+450 Applications

DGX

slide-4
SLIDE 4

4

WHERE TO TRAIN

At Your Desk

On-Prem

In-the-Cloud

slide-5
SLIDE 5

5

CURRENT DEPLOYMENT WORKFLOW

TRAINING

Training Data Management Model Assessment Trained Neural Network Training Data

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) UNOPTIMIZED DEPLOYMENT Framework or custom CPU-Only application

3

Deploy custom application using NVIDIA DL SDK

2

Deploy training framework

1

slide-6
SLIDE 6

6

CHALLENGES WITH CURRENT APPROACHES

Requirement Challenges High Throughput Unable to processing high-volume, high-velocity data

➢ Impact: Increased cost ($, time) per inference

Low Response Time Applications don’t deliver real-time results

➢ Impact: Negatively affects user experience (voice recognition, personalized recommendations, real-time object detection)

Power and Memory Efficiency Inefficient applications

➢ Impact: Increased cost (running and cooling), makes deployment infeasible

Deployment-Grade Solution Research frameworks not designed for production

➢ Impact: Framework overhead and dependencies increases time to solution and affects productivity

slide-7
SLIDE 7

7

NVIDIA DEEP LEARNING SDK and CUDA

developer .nvidia.com/deep-learning-software

NVIDIA DEEP LEARNING SOFTWARE PLATFORM

TRAINING

Training Data Management Model Assessment Trained Neural Network Training Data

INFERENCE

Embedded Automotive Data center GRE + T ensorRT DriveWorks SDK JETPACK SDK

slide-8
SLIDE 8

8

NVIDIA TENSORRT

Programmable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4 TESLA V100

FRAMEWORKS GPU PLATFORMS TensorRT

Optimizer Runtime

slide-9
SLIDE 9

9

TESLA V100 DRIVE PX 2 TESLA P4 JETSON TX2 NVIDIA DLA

NVIDIA TENSORRT PROGRAMMABLE INFERENCING PLATFORM

NVIDIA TENSORRT PROGRAMMABLE INFERENCING PLATFORM

TRT Network API UFF

Optimizer Runtime

TensorRT

slide-10
SLIDE 10

10

140 305 5700

14 ms 6.67 ms 6.83 ms

5 10 15 20 25 30 35 40

1,000 2,000 3,000 4,000 5,000 6,000 CPU-Only V100 + TensorFlow V100 + TensorRT

Latency (ms) Images/sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.

4 25 550

280 ms 153 ms 117 ms

50 100 150 200 250 300 350 400 450 500

100 200 300 400 500 600 CPU-Only + Torch V100 + Torch V100 + TensorRT

Latency (ms) Images/sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100- PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On

TENSORRT PERFORMANCE

developer.nvidia.com/tensorrt

40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50) 140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT)

slide-11
SLIDE 11

11

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-12
SLIDE 12

12

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-13
SLIDE 13

13

MODEL IMPORTING

developer.nvidia.com/tensorrt

Model Importer Network Definition API Python/C++ API

Other Frameworks

Python/C++ API

➢ AI Researchers ➢ Data Scientists Runtime inference C++ or Python API

Example: Importing a T ensorFlow model

slide-14
SLIDE 14

14

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning Layer & Tensor Fusion Dynamic Tensor Memory Weights & Activation Precision Calibration

➢ Optimizations are completely automatic ➢ Performed with a single function call

slide-15
SLIDE 15

15

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-16
SLIDE 16

16

NVIDIA TENSORRT 3 NOW AVAILABLE

Volta TensorCore  TensorFlow Importer  Python API

Volta TensorCore Support

Improved productivity with easy to use Python API for data science workflows

Python API Free download to members of NVIDIA Developer Program developer.nvidia.com/tensorrt

Compiled & Optimized Model

Import TensorFlow Models

Optimize and deploy TensorFlow models up to 18x faster vs. TensorFlow framework 3.7x faster inference on Tesla V100 vs. Tesla P100 under 7ms real-time latency

Data Scientists

slide-17
SLIDE 17

17

NVIDIA JETPACK 3.2

SDK for embedded AI computing

Deep Learning TensorRT cuDNN DIGITS Workflow Computer Vision VisionWorks OpenCV Multimedia ISP Support Camera Imaging Video CODEC Also includes ROS compatibility, OpenGL, advanced developer tools, and much more CUDA CUDA Libs GPU Compute

slide-18
SLIDE 18

18

Jetson TX2

AI Computer on a Module

Advanced tech for intelligent machines Unmatched performance under 10W Smaller than a credit card

DEMO

slide-19
SLIDE 19

19

Jetson: developer.nvidia.com/embedded-computing Success Stories: developer.nvidia.com/embedded/learn/success-stories Partners and Ecosystem: developer.nvidia.com/embedded/community Deep Learning Institute: www.nvidia.com/object/deep-learning-institute.html Two Days To A Demo: developer.nvidia.com/embedded/twodaystoademo Inception Program: www.nvidia.com/inception

LEARN MORE

slide-20
SLIDE 20