S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT - PowerPoint PPT Presentation

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect

Deploying Deep Learning Models - Current Approaches - Production Deployment Challenges NVIDIA TensorRT as a Deployment Solution - Performance, Optimizations and Features AGENDA Deploying DL models with TensorRT - Import, Optimize and Deploy - TensorFlow image classification - PyTorch LSTM - Caffe object detection Inference Server Demos Q&A 2

WHAT DO I DO WITH MY TRAINED DL MODELS? Gain insight from data Congrats, you’ve just finished trained your DL model (and it works)! • My DL serving solution wish list: • Can deliver sufficient performance  key metric! • Is easy to set up • Can handle models for multiple use cases from various training frameworks • Can be accessed easily by end-users • 3

CURRENT DEPLOYMENT WORKFLOW TRAINING DEPLOYMENT 1 Deploy framework or Data custom CPU-Only Management application Training Training 2 Trained Neural Data Network Deploy training framework on GPU Model Assessment 3 Deploy custom application using NVIDIA DL SDK CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) 4

DEEP LEARNING AS - A - (EASY) SERVICE Proof of Concept Opportunities for optimizing our deployment performance • 1. High performance serving infrastructure 2. Improving model inference performance  we’ll start here DL-aas Proof-of-Concept: • Use NVIDIA TensorRT to create optimized inference engines for our models • • Freely available as a container in the NVIDIA GPU Cloud (ngc.nvidia.com) More details to come on TensorRT … • Create a simple Python Flask application to expose models via REST endpoints • 5

DEEP LEARNING AS - A - (EASY) SERVICE Architecture Diagram (TensorRT Inference Engines) /classify /generate /detect (Caffe) (Keras/TF) (PyTorch) (RESTful API endpoints End Users: Send from Python Flask app) inference request, receive response from server NVIDIA GPU Cloud container: (nvcr.io/nvidia/tensorrt:18.01-py2) Server with GPU 6

NVIDIA TENSORRT OVERVIEW 7

NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 8 developer.nvidia.com/tensorrt

TENSORRT OPTIMIZATIONS Layer & Tensor Fusion ➢ Optimizations are completely automatic ➢ Performed with a single function call Weights & Activation Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory 9

TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 10

IMPORTING MODELS TO TENSORRT 11

TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 12

MODEL IMPORTING PATHS ➢ AI Researchers ➢ Data Scientists Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 13 developer.nvidia.com/tensorrt

VGG19: KERAS/TF Image classification Model is Keras VGG19 model pretrained on ImageNet, finetuned for flowers dataset from TF Slim Using TF backend, freeze graph to convert weight variables to constants Import into TensorRT using built-in TF->UFF->TRT parser 14

CHAR_RNN: PYTORCH Text Generation Model is character-level RNN model (using LSTM cell) trained with PyTorch Training data: .py files from PyTorch source code Export PyTorch model weights to Numpy, permute to match FICO weight ordering used by cuDNN/TensorRT Import into TensorRT using Network Definition API 15

SINGLE SHOT DETECTOR: CAFFE Object Detection Model is SSD object detection model trained with Caffe Training data: Annotated traffic intersection data Network includes several layers unsupported by TensorRT: Permute, PriorBox, etc  Requires use of custom layer API! Use built-in Caffe network parser to import network along with custom layers 16

DESIGNING THE INFERENCE SERVER Putting it all together… Using TensorRT Python API, we can wrap all of these inference engines together into a simple Flask application Similar example code provided in TensorRT container Create three endpoints to expose models: /classify /generate /detect 17

SCALING IT UP 18

DESIGNING THE INFERENCE SERVER Easy improvements for better perf (Single entrypoint handles <IP>:8000 load balancing among Our DL-aas proof-of-concept works, yay! workers) One main drawback: single threaded serving Instead, can use tools like Gunicorn & Nginx to easily scale your inference workload across more compute <IP>:5002 <IP>:5000 <IP>:5001 <IP>:5003 Multithreaded containerized workers tied to their own GPU Straightforward to integrate w/ Flask app 19

GETTING CLOSER TO PRODUCTION Areas for potential improvement Previous example mostly addresses our needs, but has room for improvement… • • Potential improvements: Batching of requests • • Autoscaling of compute resources based on workload Improving performance of pre/post processing around TensorRT inference • • E.g. image resizing Better UI/UX for client side • 20

TENSORRT KEY TAKEAWAYS ✓ Generate optimized, deployment-ready runtime engines for low latency inference ✓ Import models trained using Caffe or TensorFlow or use Network Definition API ✓ Deploy in FP32 or reduced precision INT8, FP16 for higher throughput ✓ Optimize frequently used layers and integrate user defined custom layers 21

LEARN MORE Helpful Links GPU Inference Whitepaper: • • https://images.nvidia.com/content/pdf/inference-technical-overview.pdf Blogpost on using TensorRT 3.0 for TF model inference: • • https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/ TensorRT documentation: • http://docs.nvidia.com/deeplearning/sdk/index.html#inference • 22

LEARN MORE PRODUCT PAGE DOCUMENTATION TRAINING nvidia.com/dli developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk 23

Q&A 24

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT - PowerPoint PPT Presentation

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect Deploying Deep Learning Models - Current Approaches - Production Deployment

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deploying Large Scale AVB/TSN Networks Jeff Koftinoff, Meyer Sound Laboratories, Inc. June 19,

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Deploying And Supporting Perl 6 Jonathan Worthington UKUUG Spring 2007 Conference Deploying And

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

Credit: Toronto Zoo CS109A, P ROTOPAPAS , R ADER , T ANNER 1 Lecture #3: Getting our hands dirty:

SHOW ME THE MONEY: Sustainable Cities Grant Workshop September 13, 2017 Ann Marie Hess Research

Power Human with AI A World-leading AI Company 1 About Megvii 2 Public IoT 3 Personal

Protocol For Arbitrary Digraphs Mahyar R. Malekpour http://shemesh.larc.nasa.gov/people/mrm/ DASC

Preparing for PARCC: Challenges and Opportunities for Higher Education Rutgers University

Securing the Frisbee Multicast Disk Loader Robert Ricci, Jonathon Duerig University of Utah 1

INVESTIGATION UNIT INVESTIGATION UNIT Fatal injuries during maintenance of shearer loader at

Public Workshop to Discuss the Zero-Emission Airport Ground S upport Equipment Measure June

Sambuz

Useful Links

Newsletter

Mail Us