s8495 deploying deep neural networks as a service using
play

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT - PowerPoint PPT Presentation

S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect Deploying Deep Learning Models - Current Approaches - Production Deployment


  1. S8495: DEPLOYING DEEP NEURAL NETWORKS AS-A-SERVICE USING TENSORRT AND NVIDIA-DOCKER Prethvi Kashinkunti, Solutions Architect Alec Gunny, Solutions Architect

  2. Deploying Deep Learning Models - Current Approaches - Production Deployment Challenges NVIDIA TensorRT as a Deployment Solution - Performance, Optimizations and Features AGENDA Deploying DL models with TensorRT - Import, Optimize and Deploy - TensorFlow image classification - PyTorch LSTM - Caffe object detection Inference Server Demos Q&A 2

  3. WHAT DO I DO WITH MY TRAINED DL MODELS? Gain insight from data Congrats, you’ve just finished trained your DL model (and it works)! • My DL serving solution wish list: • Can deliver sufficient performance  key metric! • Is easy to set up • Can handle models for multiple use cases from various training frameworks • Can be accessed easily by end-users • 3

  4. CURRENT DEPLOYMENT WORKFLOW TRAINING DEPLOYMENT 1 Deploy framework or Data custom CPU-Only Management application Training Training 2 Trained Neural Data Network Deploy training framework on GPU Model Assessment 3 Deploy custom application using NVIDIA DL SDK CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL) 4

  5. DEEP LEARNING AS - A - (EASY) SERVICE Proof of Concept Opportunities for optimizing our deployment performance • 1. High performance serving infrastructure 2. Improving model inference performance  we’ll start here DL-aas Proof-of-Concept: • Use NVIDIA TensorRT to create optimized inference engines for our models • • Freely available as a container in the NVIDIA GPU Cloud (ngc.nvidia.com) More details to come on TensorRT … • Create a simple Python Flask application to expose models via REST endpoints • 5

  6. DEEP LEARNING AS - A - (EASY) SERVICE Architecture Diagram (TensorRT Inference Engines) /classify /generate /detect (Caffe) (Keras/TF) (PyTorch) (RESTful API endpoints End Users: Send from Python Flask app) inference request, receive response from server NVIDIA GPU Cloud container: (nvcr.io/nvidia/tensorrt:18.01-py2) Server with GPU 6

  7. NVIDIA TENSORRT OVERVIEW 7

  8. NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 8 developer.nvidia.com/tensorrt

  9. TENSORRT OPTIMIZATIONS Layer & Tensor Fusion ➢ Optimizations are completely automatic ➢ Performed with a single function call Weights & Activation Precision Calibration Kernel Auto-Tuning Dynamic Tensor Memory 9

  10. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 10

  11. IMPORTING MODELS TO TENSORRT 11

  12. TENSORRT DEPLOYMENT WORKFLOW Step 1 : Optimize trained model Plan 1 Import Serialize Model Engine Plan 2 Plan 3 Trained Neural TensorRT Optimizer Optimized Plans Network Step 2 : Deploy optimized plans with runtime De-serialize Deploy Plan 1 Engine Runtime Plan 2 Data center Plan 3 TensorRT Runtime Engine Optimized Plans Automotive Embedded 12

  13. MODEL IMPORTING PATHS ➢ AI Researchers ➢ Data Scientists Other Frameworks Python/C++ API Python/C++ API Network Model Importer Definition API Runtime inference C++ or Python API 13 developer.nvidia.com/tensorrt

  14. VGG19: KERAS/TF Image classification Model is Keras VGG19 model pretrained on ImageNet, finetuned for flowers dataset from TF Slim Using TF backend, freeze graph to convert weight variables to constants Import into TensorRT using built-in TF->UFF->TRT parser 14

  15. CHAR_RNN: PYTORCH Text Generation Model is character-level RNN model (using LSTM cell) trained with PyTorch Training data: .py files from PyTorch source code Export PyTorch model weights to Numpy, permute to match FICO weight ordering used by cuDNN/TensorRT Import into TensorRT using Network Definition API 15

  16. SINGLE SHOT DETECTOR: CAFFE Object Detection Model is SSD object detection model trained with Caffe Training data: Annotated traffic intersection data Network includes several layers unsupported by TensorRT: Permute, PriorBox, etc  Requires use of custom layer API! Use built-in Caffe network parser to import network along with custom layers 16

  17. DESIGNING THE INFERENCE SERVER Putting it all together… Using TensorRT Python API, we can wrap all of these inference engines together into a simple Flask application Similar example code provided in TensorRT container Create three endpoints to expose models: /classify /generate /detect 17

  18. SCALING IT UP 18

  19. DESIGNING THE INFERENCE SERVER Easy improvements for better perf (Single entrypoint handles <IP>:8000 load balancing among Our DL-aas proof-of-concept works, yay! workers) One main drawback: single threaded serving Instead, can use tools like Gunicorn & Nginx to easily scale your inference workload across more compute <IP>:5002 <IP>:5000 <IP>:5001 <IP>:5003 Multithreaded containerized workers tied to their own GPU Straightforward to integrate w/ Flask app 19

  20. GETTING CLOSER TO PRODUCTION Areas for potential improvement Previous example mostly addresses our needs, but has room for improvement… • • Potential improvements: Batching of requests • • Autoscaling of compute resources based on workload Improving performance of pre/post processing around TensorRT inference • • E.g. image resizing Better UI/UX for client side • 20

  21. TENSORRT KEY TAKEAWAYS ✓ Generate optimized, deployment-ready runtime engines for low latency inference ✓ Import models trained using Caffe or TensorFlow or use Network Definition API ✓ Deploy in FP32 or reduced precision INT8, FP16 for higher throughput ✓ Optimize frequently used layers and integrate user defined custom layers 21

  22. LEARN MORE Helpful Links GPU Inference Whitepaper: • • https://images.nvidia.com/content/pdf/inference-technical-overview.pdf Blogpost on using TensorRT 3.0 for TF model inference: • • https://devblogs.nvidia.com/tensorrt-3-faster-tensorflow-inference/ TensorRT documentation: • http://docs.nvidia.com/deeplearning/sdk/index.html#inference • 22

  23. LEARN MORE PRODUCT PAGE DOCUMENTATION TRAINING nvidia.com/dli developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk 23

  24. Q&A 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend