S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT - - PowerPoint PPT Presentation

s8495 deploying deep neural networks as a service using
SMART_READER_LITE
LIVE PREVIEW

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT - - PowerPoint PPT Presentation

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect Deploying Deep Learning Models - Current Approaches - Production Deployment


slide-1
SLIDE 1

Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER

slide-2
SLIDE 2

2

AGENDA

Deploying Deep Learning Models

  • Current Approaches
  • Production Deployment Challenges

NVIDIA TensorRT as a Deployment Solution

  • Performance, Optimizations and Features

Deploying DL models with TensorRT

  • Import, Optimize and Deploy
  • TensorFlow image classification
  • PyTorch LSTM
  • Caffe object detection

Inference Server Demos Q&A

slide-3
SLIDE 3

3

WHAT DO I DO WITH MY TRAINED DL MODELS?

  • Congrats, you’ve just finished trained your DL model (and it works)!
  • My DL serving solution wish list:
  • Can deliver sufficient performance  key metric!
  • Is easy to set up
  • Can handle models for multiple use cases from various training frameworks
  • Can be accessed easily by end-users

Gain insight from data

slide-4
SLIDE 4

4

CURRENT DEPLOYMENT WORKFLOW

TRAINING

Training Data Management Model Assessment Trained Neural Network Training Data

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) DEPLOYMENT Deploy framework or custom CPU-Only application

1

Deploy custom application using NVIDIA DL SDK

3

Deploy training framework on GPU

2

slide-5
SLIDE 5

5

DEEP LEARNING AS - A - (EASY) SERVICE

  • Opportunities for optimizing our deployment performance
  • 1. High performance serving infrastructure
  • 2. Improving model inference performance  we’ll start here
  • DL-aas Proof-of-Concept:
  • Use NVIDIA TensorRT to create optimized inference engines for our models
  • Freely available as a container in the NVIDIA GPU Cloud (ngc.nvidia.com)
  • More details to come on TensorRT…
  • Create a simple Python Flask application to expose models via REST endpoints

Proof of Concept

slide-6
SLIDE 6

6

DEEP LEARNING AS - A - (EASY) SERVICE

/detect (Caffe) /generate (PyTorch) /classify (Keras/TF)

NVIDIA GPU Cloud container: (nvcr.io/nvidia/tensorrt:18.01-py2)

End Users: Send inference request, receive response from server

(RESTful API endpoints from Python Flask app) (TensorRT Inference Engines) Server with GPU

Architecture Diagram

slide-7
SLIDE 7

7

NVIDIA TENSORRT OVERVIEW

slide-8
SLIDE 8

8

NVIDIA TENSORRT

Programmable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4 TESLA V100

FRAMEWORKS GPU PLATFORMS TensorRT

Optimizer Runtime

slide-9
SLIDE 9

9

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning Layer & Tensor Fusion Dynamic Tensor Memory Weights & Activation Precision Calibration

➢ Optimizations are completely automatic ➢ Performed with a single function call

slide-10
SLIDE 10

10

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-11
SLIDE 11

11

IMPORTING MODELS TO TENSORRT

slide-12
SLIDE 12

12

TENSORRT DEPLOYMENT WORKFLOW

TensorRT Optimizer TensorRT Runtime Engine Trained Neural Network

Step 1: Optimize trained model

Plan 1 Plan 2 Plan 3

Optimized Plans

Step 2: Deploy optimized plans with runtime

Embedded Automotive Data center

Import Model Serialize Engine

Plan 1 Plan 2 Plan 3

Optimized Plans

De-serialize Engine Deploy Runtime

slide-13
SLIDE 13

13

MODEL IMPORTING PATHS

developer.nvidia.com/tensorrt

Model Importer Network Definition API Python/C++ API

Other Frameworks

Python/C++ API

➢ AI Researchers ➢ Data Scientists Runtime inference C++ or Python API

slide-14
SLIDE 14

14

VGG19: KERAS/TF

Model is Keras VGG19 model pretrained on ImageNet, finetuned for flowers dataset from TF Slim Using TF backend, freeze graph to convert weight variables to constants Import into TensorRT using built-in TF->UFF->TRT parser

Image classification

slide-15
SLIDE 15

15

CHAR_RNN: PYTORCH

Model is character-level RNN model (using LSTM cell) trained with PyTorch Training data: .py files from PyTorch source code Export PyTorch model weights to Numpy, permute to match FICO weight ordering used by cuDNN/TensorRT Import into TensorRT using Network Definition API

Text Generation

slide-16
SLIDE 16

16

SINGLE SHOT DETECTOR: CAFFE

Model is SSD object detection model trained with Caffe Training data: Annotated traffic intersection data Network includes several layers unsupported by TensorRT: Permute, PriorBox, etc  Requires use of custom layer API! Use built-in Caffe network parser to import network along with custom layers

Object Detection

slide-17
SLIDE 17

17

DESIGNING THE INFERENCE SERVER

Using TensorRT Python API, we can wrap all of these inference engines together into a simple Flask application

Similar example code provided in TensorRT container

Create three endpoints to expose models:

/classify /generate /detect

Putting it all together…

slide-18
SLIDE 18

18

SCALING IT UP

slide-19
SLIDE 19

19

DESIGNING THE INFERENCE SERVER

Our DL-aas proof-of-concept works, yay! One main drawback: single threaded serving Instead, can use tools like Gunicorn & Nginx to easily scale your inference workload across more compute

Multithreaded containerized workers tied to their own GPU Straightforward to integrate w/ Flask app

Easy improvements for better perf

(Single entrypoint handles load balancing among workers) <IP>:5000 <IP>:5001 <IP>:5002 <IP>:5003 <IP>:8000

slide-20
SLIDE 20

20

GETTING CLOSER TO PRODUCTION

  • Previous example mostly addresses our needs, but has room for improvement…
  • Potential improvements:
  • Batching of requests
  • Autoscaling of compute resources based on workload
  • Improving performance of pre/post processing around TensorRT inference
  • E.g. image resizing
  • Better UI/UX for client side

Areas for potential improvement

slide-21
SLIDE 21

21

TENSORRT KEY TAKEAWAYS

✓ Generate optimized, deployment-ready runtime engines for low latency inference ✓ Import models trained using Caffe or TensorFlow or use Network Definition API ✓ Deploy in FP32 or reduced precision INT8, FP16 for higher throughput ✓ Optimize frequently used layers and integrate user defined custom layers

slide-22
SLIDE 22

22

LEARN MORE

  • GPU Inference Whitepaper:
  • https://images.nvidia.com/content/pdf/inference-technical-overview.pdf
  • Blogpost on using TensorRT 3.0 for TF model inference:
  • https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/
  • TensorRT documentation:
  • http://docs.nvidia.com/deeplearning/sdk/index.html#inference

Helpful Links

slide-23
SLIDE 23

23

LEARN MORE

developer.nvidia.com/tensorrt

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

slide-24
SLIDE 24

24

Q&A

slide-25
SLIDE 25