MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT - PowerPoint PPT Presentation

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin, Soyoung Jeong

AGENDA Important capabilities to maximize data center utilization TensorRT Inference Server architecture for maximum utilization Multi-frameworks Multi-models Model concurrency Real-world use-case: Naver 2

MAXIMIZING UTILIZATION Often GPU is not fully utilized by a single model… increase utilization by: Supporting a variety of model frameworks Supporting concurrent model execution, one or multiple models Supporting many model types: CNN, RNN, “stateless”, “stateful” Enabling both “online” and “offline” inference use cases Enabling scalable, reliable deployment 3

TENSORRT INFERENCE SERVER Architected for Maximum Datacenter Utilization Support a variety of model frameworks TensorRT, TensorFlow, Caffe2, custom Support concurrent model execution, one or multiple models Multi-model, multi-GPU and asynchronous HTTP and GRPC request handling Support many model types: CNN, RNN, “stateless”, “stateful” Multiple scheduling and batching algorithms Enable both “online” and “offline” inference use cases Batch 1, batch n, dynamic batching Enable scalable, reliable deployment Prometheus metrics, live/ready endpoints, Kubernetes integration 4

EXTENSIBLE ARCHITECTURE Extensible backend architecture allows multiple framework and custom support Extensible scheduler architecture allows support for different model types and different batching strategies Leverage CUDA to support model concurrency and multi-GPU 5

MODEL REPOSITORY File-system based repository of the models loaded and served by the inference server Model metadata describes framework, scheduling, batching, concurrency and other aspects of each model ModelX ModelZ platform: TensorRT platform: TensorFlow scheduler: default scheduler: sequence-batcher concurrency: … concurrency: ... ModelY platform: TensorRT scheduler: dynamic-batcher concurrency: … 6

BACKEND ARCHITECTURE Backend acts as interface between inference requests and a standard or custom framework Supported standard frameworks: TensorRT, TensorFlow, Caffe2 Providers efficiently communicate inference request inputs and outputs (HTTP or GRPC) Efficient data movement, no additional copies ModelX Backend ModelX Inference TensorRT Runtime Request Default Scheduler Input Tensors Output Tensors Providers 7

MULTIPLE MODELS ModelZ Backend ModelZ Inference TensorFlow Runtime Request Sequence Batcher ModelY Backend ModelY Inference TensorRT Runtime Request Dynamic Batcher ModelX Backend ModelX Inference TensorRT Runtime Request Default Scheduler 8

MODEL CONCURRENCY Multiple Models Sharing a GPU By default each model gets one instance on each available GPU (or 1 CPU instance if no GPUs) Each instance has an execution context that encapsulates the state needed by the runtime to execute the model ModelZ Backend ModelY Backend ModelX Backend TensorRT Runtime GPU Default Context Scheduler 9

MODEL CONCURRENCY Multiple Instances of the Same Model Model metadata allows multiple instances to be configured for each model Multiple model instances allow multiple inference requests to be executed simultaneously ModelX Backend TensorRT Runtime Context Default Scheduler Context GPU Context 10

MODEL CONCURRENCY Multiple Instances of Multiple Models ModelZ Backend TensorFlow Runtime Context Sequence ModelY Backend TensorRT Runtime Batcher Context Context Dynamic ModelX Backend TensorRT Runtime Batcher GPU Context Context Default Scheduler Context Context 11

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 12

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 13

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Execute ModelX Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 14

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Execute ModelX Execute ModelX Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 15

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Execute ModelX Execute ModelX Execute ModelY Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 16

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Execute ModelX Execute ModelX Execute ModelY Execute ModelY Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 17

CONCURRENT EXECUTION TIMELINE GPU Activity Over Time Execute ModelX Execute ModelX Execute ModelX Execute ModelY Execute ModelY Execute ModelY Time ModelX ModelX ModelX ModelY ModelY ModelY Incoming Inference Requests 18

SHARING A GPU CUDA Enables Multiple Model Execution on a GPU ModelY Backend TensorRT Runtime GPU Context Dynamic Hardware Scheduler Batcher Context ModelX Backend CUDA Streams TensorRT Runtime Context Default Scheduler Context Context 19

MUTLI-GPU Execution Contexts Can Target Multiple GPUs GPU ModelY Backend TensorRT Runtime GPU Hardware Scheduler Context Dynamic Hardware Scheduler Batcher Context ModelX Backend CUDA Streams TensorRT Runtime Context Default Scheduler Context Context 20

CUSTOM FRAMEWORK Integrate Custom Logic Into Inference Server Provide implementation of your “framework”/”runtime” as shared library Implement simple API: Initialize, Finalize, Execute All inference server features are available: multi-model, multi-GPU, concurrent execution, scheduling and batching algorithms, etc. libcustom.so ModelCustom Backend Inference Request Custom Wrapper ModelCustom Custom Default Scheduler Runtime Input Tensors Output Tensors Providers 21

SCHEDULER ARCHITECTURE Scheduler responsible for managing all inference requests to a given model Distribute requests to the available execution contexts Each model can configure the type of scheduler appropriate for the model Model Backend Runtime Context Scheduler Context 22

DEFAULT SCHEDULER Distribute Individual Requests Across Available Contexts ModelX Backend Runtime Batch-1 Request Batch-4 Request Context Default Scheduler Context 23

DEFAULT SCHEDULER Distribute Individual Requests Across Available Contexts ModelX Backend Runtime Context Default Scheduler Context Incoming requests to ModelX queued in scheduler 24

DEFAULT SCHEDULER Distribute Individual Requests Across Available Contexts Assuming GPU is fully utilized by executing 2 batch-4 inferences at ModelX Backend the same time. Runtime Utilization = 3/8 = 37.5% Context Default Scheduler Context requests assigned in order to ready contexts 25

DEFAULT SCHEDULER Distribute Individual Requests Across Available Contexts Assuming GPU is fully utilized by executing 2 batch-4 inferences at ModelX Backend the same time. Runtime Utilization = 2/8 = 25% Context Default Scheduler Context When context completes a new request is assigned 26

DEFAULT SCHEDULER Distribute Individual Requests Across Available Contexts Assuming GPU is fully utilized by executing 2 batch-4 inferences at ModelX Backend the same time. Runtime Utilization = 4/8 = 50% Context Default Scheduler Context When context completes a new request is assigned 27

DYNAMIC BATCHING SCHEDULER Group Requests To Form Larger Batches, Increase GPU Utilization Default scheduler takes advantage of multiple model instances But GPU utilization dependent on the batch-size of the inference request Batching is often on of the best ways to increase GPU utilization Dynamic batch scheduler (aka dynamic batcher) forms larger batches by combining multiple inference request 28

DYNAMIC BATCHING SCHEDULER Group Requests To Form Larger Batches, Increase GPU Utilization ModelY Backend Runtime Batch-1 Request Batch-4 Request Context Dynamic Batcher Context 29

DYNAMIC BATCHING SCHEDULER Group Requests To Form Larger Batches, Increase GPU Utilization ModelY Backend Runtime Context Dynamic Batcher Context Incoming requests to ModelY queued in scheduler 30

DYNAMIC BATCHING SCHEDULER Group Requests To Form Larger Batches, Increase GPU Utilization Dynamic batcher configuration for ModelY can specify preferred ModelY Backend batch-size. Assume 4 gives best Runtime utilization. Dynamic batcher groups requests Context Dynamic to give 100% utilization Batcher Context 31

SEQUENCE BATCHING SCHEDULER Dynamic Batching for Stateful Models Default and dynamic-batching schedulers work with stateless models; each request is scheduled and executed independently Some models are stateful, a sequence of inference requests must be routed to the same model instance “Online” ASR, TTS, and similar models Models that use LSTM, GRU, etc. to maintain state across inference requests Multi-instance and batching required by these models to maximum GPU utilization Sequence-batching scheduler provides dynamically batching for stateful models 32

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT - PowerPoint PPT Presentation

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin, Soyoung Jeong AGENDA Important capabilities to maximize data center utilization TensorRT Inference Server architecture for maximum utilization

Maximizing Fleet Utilization Donovan ONeil, Local Government Project Manager Ohio Auditor of

Comprehensive Utilization of Comprehensive Utilization of Comprehensive Utilization of Woody

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the

Emergency Department Emergency Department Utilization Team Utilization Team PCP Access Pilot

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Maximizing the Efficiency Potential Maximizing the Efficiency Potential in New Hampshire N

Member Orientation: Maximizing your SEEP Member Benefits Member Orientation: Maximizing your

Maximizing Anterior Vertebral Maximizing Anterior Vertebral Screw Fixation for Spinal Screw

Maximizing your slow cooker is about Maximizing the flavor of foods you prepare, which will

OLA 2009: OLA 2009: Maximizing the Value of Your Maximizing the Value of Your OCLC Cataloging

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Maximizing the Spread of Maximizing the Spread of I nfluence through a Social I nfluence through

6 Utilization Utilization A P S D E U - A P S D E U - 1- -3 3 June 2005 June 2005 1 KMA

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Mission-critical 101 Resiliency Performance Fault Scalability Tolerance Disaster

Towards Data-Aware QoS-Driven Adaptation for Service Orchestrations c 1 , Manuel Carro 1 , Manuel

COBHAM WIRELESS Leaders in Advanced Network Test TM500/E500 Test Systems Introduction March 2015

State Estimation and Contingency Analysis of the Power Grid in a Cyber-Adversarial Environment

Taming Distributed Pets with Kubernetes Matthew Bates & James Munnelly QCon London 2018

Virtualization and High Availability Mika Karlstedt AMICT'08 May 2008 Faculty of Science

TensorFlow: A system for large-scale machine learning Martn Abadi et. al, 2016 Presented by

Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT - PowerPoint PPT Presentation

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin, Soyoung Jeong AGENDA Important capabilities to maximize data center utilization TensorRT Inference Server architecture for maximum utilization

Maximizing Fleet Utilization Donovan ONeil, Local Government Project Manager Ohio Auditor of

Comprehensive Utilization of Comprehensive Utilization of Comprehensive Utilization of Woody

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Learning From Data Lecture 23 SVMs: Maximizing the Margin A Better Hyperplane Maximizing the

Emergency Department Emergency Department Utilization Team Utilization Team PCP Access Pilot

CIMMYT CAGE meeting CIMMYT CAGE meeting Update : Identification and utilization of novel sources

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Maximizing the Efficiency Potential Maximizing the Efficiency Potential in New Hampshire N

Member Orientation: Maximizing your SEEP Member Benefits Member Orientation: Maximizing your

Maximizing Anterior Vertebral Maximizing Anterior Vertebral Screw Fixation for Spinal Screw

Maximizing your slow cooker is about Maximizing the flavor of foods you prepare, which will

OLA 2009: OLA 2009: Maximizing the Value of Your Maximizing the Value of Your OCLC Cataloging

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Maximizing the Spread of Maximizing the Spread of I nfluence through a Social I nfluence through

6 Utilization Utilization A P S D E U - A P S D E U - 1- -3 3 June 2005 June 2005 1 KMA

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Mission-critical 101 Resiliency Performance Fault Scalability Tolerance Disaster

Towards Data-Aware QoS-Driven Adaptation for Service Orchestrations c 1 , Manuel Carro 1 , Manuel

COBHAM WIRELESS Leaders in Advanced Network Test TM500/E500 Test Systems Introduction March 2015

State Estimation and Contingency Analysis of the Power Grid in a Cyber-Adversarial Environment

Taming Distributed Pets with Kubernetes Matthew Bates &amp; James Munnelly QCon London 2018

Virtualization and High Availability Mika Karlstedt AMICT'08 May 2008 Faculty of Science

TensorFlow: A system for large-scale machine learning Martn Abadi et. al, 2016 Presented by

Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes

Taming Distributed Pets with Kubernetes Matthew Bates & James Munnelly QCon London 2018