S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - - PowerPoint PPT Presentation

s9422 an auto batching api for
SMART_READER_LITE
LIVE PREVIEW

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - - PowerPoint PPT Presentation

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer Technology Engineer, NVIDIA March 20, 2019 REAL-TIME INFERENCE Sequence Models Based on RNNs Sequence models for automatic speech recognition (ASR),


slide-1
SLIDE 1

Murat Efe Guney – Developer Technology Engineer, NVIDIA March 20, 2019

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE

slide-2
SLIDE 2

2

REAL-TIME INFERENCE

Sequence Models Based on RNNs

Sequence models for automatic speech recognition (ASR), translation, and speech generation Real-time applications have a stream of inference requests from multiple users Challenge is to perform inferencing with low latency and high throughput

“Hello, my name is Alice” “I am Susan” “This is Bob”

ASR

slide-3
SLIDE 3

3

BATCHING VS NON-BATCHING

Batch size = 1

  • Run a single RNN inference task on a GPU
  • Low-latency, but the GPU is underutilized

Batch size = N

  • Group RNN inference instances together
  • High throughput and GPU utilization
  • Allows employing Tensor Cores in Volta and Turing

Batching: Grouping Inference Requests Together

W W W W W

slide-4
SLIDE 4

4

BATCHING VS NON-BATCHING

Performance Data on T4

1.2 23.0 27.6 32.9 31.5 1.8 51.4 83.8 116.5 138.4

1 2 3 4 5 6 7 8 9 20 40 60 80 100 120 140 160 180 Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256

Latency (ms per a timestep) Throughput (timesteps per ms)

RNN Inference Throughput and Latency

FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency

slide-5
SLIDE 5

5

RNN BATCHING

Existing real-time codes are written for inferencing many instances with batch size = 1 Real-time batching requires extra programming effort A naïve implementation can suffer from significant increase in latency An ideal solution will allow making a tradeoff between latency and throughput RNN cells provide an opportunity to merge inference tasks at different timesteps

Challenges and Opportunities

slide-6
SLIDE 6

6

RNN BATCHING

Combining RNNs at Different Timesteps

Inference Tasks Arrival Time t0 t1 t0 t2 t1 t0 t0 t2 t1 t1 Batch Size = 4 fill with a new inference task Batched Execution of Timesteps Common model parameters Time steps →

slide-7
SLIDE 7

7

RNN CELLS

RNN Cells Supported in TensorRT and cuDNN

TANH LSTM GRU

it = σ(Wixt + Riht-1 + bWi + bRi) ft = σ(Wfxt + Rfht-1 + bWf + bRf)

  • t = σ(Woxt + Roht-1 + bWo + bRo)

c't = tanh(Wcxt + Rcht-1 + bWc + bRc) ct = ft ◦ ct-1 + it ◦ c't ht = ot ◦ tanh(ct) ht = tanh(Wixt + Riht-1 + bWi + bRi) ht = ReLU(Wixt + Riht-1 + bWi + bRi) it = σ(Wixt + Riht-1 + bWi + bRu) rt = σ(Wrxt + Rrht-1 + bWr + bRr) h't = tanh(Whxt + rt◦(Rhht-1 + bRh) + bWh) ht = (1 - it) ◦ h't + it ◦ ht-1

RELU

slide-8
SLIDE 8

8

HIGH-PERFORMANCE RNN INFERENCING

High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells An arbitrary batch size and number of timesteps can be executed Easy access to internal and hidden states of the RNN cells for each timestep Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0) LSTMs with recurrent projections to reduce the op count Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4)

cuDNN Features

slide-9
SLIDE 9

9

UTILIZING TENSOR CORES

cuDNN

cuDNN, cuBLAS and TensorRT

// input, output and weight data types are FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH); // input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION);

cuBLAS and cuBLASLt

cublasGemmEx(...); cublasLtMatmul(...);

TensorRT

builder->setFp16Mode(true);

slide-10
SLIDE 10

10

RNN INFERENCING WITH CUDNN

cudnnCreateRNNDescriptor(&rnnDesc); // creates an RNN descriptor cudnnSetRNNDescriptor(rnnDesc, … ); // sets the RNN descriptor cudnnGetRNNLinLayerMatrixParams(cudnnHandle, rnnDesc, …); // set weights cudnnGetRNNLinLayerBiasParams(cudnnHandle, rnnDesc, …); // set bias cudnnRNNForwardInference(cudnnHandle, rnnDesc, … ); // perform inferencing cudnnDestroyRnnDescriptor(rnnDesc); // destroy the RNN descriptor

Key Functions

slide-11
SLIDE 11

11

AUTO-BATCHING FOR HIGH THROUGHPUT

Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation Input, hidden states and outputs are tracked automatically with a new API Exploits optimization opportunities by overlapping compute, transfer and host computations Similar ideas explored at:

  • Low‐latency RNN inference using cellular batching (Jinyang Li et. al., GTC 2018)
  • Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et.

al., CoRR, 2015)

Automatically Group Inference Instances

slide-12
SLIDE 12

12

STREAMING INFERENCE API

Non-blocking function calls with a mechanism to wait on completion Inferencing can be performed in segments with multiple timesteps for real-time processing A background thread that combines and executes the single inference tasks

An Auto-batching Solution

t0 t1 t2 t3 t0 t1 t2 t3 t4 t5 t6 t7 t4 t5 t6

Submit 4 steps Submit 4 steps Submit 4 steps Submit 3 steps Wait for t7 completion Inference-0 Inference-1 Inference-0 Inference-1

slide-13
SLIDE 13

13

STREAMING INFERENCE API

List of Functions

streamHandle = CreateRNNInferenceStream(modelDesc); rnnHandle = CreateRNNInference (streamHandle); RNNInference (rnnHandle, pInput, pOutput, seqLength); timeStep = WaitRNNInferenceTimeStep(rnnHandle, timeStep); timeStep = GetRNNInferenceProgress(rnnHandle); DestroyRNNInference (rnnHandle); DestroyRNNInferenceStream(streamHandle);

slide-14
SLIDE 14

14

EXAMPLE USAGE

// Create the inference stream with shared model parameters streamHandle = CreateRNNInferenceStream(modelDesc); // Create two RNN inference instances rnnHandle[0] = CreateRNNInference(streamHandle); rnnHandle[1] = CreateRNNInference(streamHandle); // Request inferencing for each inference instance with 10 timesteps (non-blocking call) RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instance RNNInference(rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5); // Wait for the completion of lastly added inferencing job WaitRNNInferenceTimeStep(rnnHandle[1], 15); // Destroy the two inferencing tasks and the inference stream DestroyRNNInference(rnnHandle[0]); DestroyRNNInference(rnnHandle[1]); DestroyRNNStream(streamHandle);

Two Inference Instances

slide-15
SLIDE 15

15

RNN INFERENCING WITH SEGMENTS

Execution Queue and Task Switching for Batch Size = 2

t0 t1 t2 t3

Inference-0: i0

i1

  • 1

i2

  • 2

i3

  • 3

t0 t1 t2 t3

Inference-1: i0 i1 i2 i3

  • 1
  • 2
  • 3

Inference-2:

t0 t1 t2 t3

i0 i1 i2 i3

  • 1
  • 2
  • 3

t4 t5

i4

  • 4

i5

  • 5

Execution Queue

t2 t3 t0 t1 t2 t3

Task Arrival Time

t6 t7

i4

  • 6

i5

  • 7

t4 t5

i4

  • 4

i5

  • 5

t6 t7

i4

  • 6

i5

  • 7

t4 t5

i4

  • 4

i5

  • 5

t6 t7

i4

  • 6

i5

  • 7

X Y Z Time X

t2 t3

Time Y

t0 t1 t2

Time Z

t6 t7 t4 t5 t6 t7 t3

Store inference-0 states Store inference-2 states Restore Inference-0 states

t5

slide-16
SLIDE 16

16

IMPLEMENTATION

  • 1. Find the inference tasks ready to execute time steps
  • 2. Determine the batch slots for each inference task
  • 3. Send the inputs to GPU for batched processing
  • 4. Restore hidden states as needed (+ cell states for LSTMs)
  • 5. Batched execution on the GPU
  • 6. Store hidden states as needed (+ cell states for LSTMs)
  • 7. Send the batched results back to host
  • 8. De-batch the results on the host

Auto-batching and GPU Execution

slide-17
SLIDE 17

17

IMPLEMENTATION

Batching, Executing, and De-batching Inference Tasks

LSTM Batching Projection Output DtoH Input HtoD De-batching

Inference Tasks Input Inference Tasks Output

Host Op GPU Op

Background thread accepting inference tasks At each timestep inference tasks are batched, executed on the GPU and de-batched

Top K Data Transfer Restore Store

slide-18
SLIDE 18

18

PERFORMANCE OPTIMIZATIONS

Hiding Host Processing, Data transfers, and State Management

Overlapping opportunities between timesteps for compute, batching, de-batching and transfer Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU

  • verlap

Employ three CUDA streams and triple-buffering of the output to better exploit concurrency

LSTM Batching Projection Output DtoH Input HtoD De-batching Top K Restore Store LSTM Batching Projection Output DtoH Input HtoD De-batching Top K Restore Store LSTM Batching Projection Input HtoD Top K Restore Store

t t+1 t+2

LSTM Batching Input HtoD Restore

t+3

CUDA Stream 0 CUDA Stream 1 CUDA Stream 2 CUDA Stream 0

slide-19
SLIDE 19

19

PERFORMANCE EXPERIMENTS

Input size = 128 7 LSTM layers with 1024 hidden cells A final projection layer with 1024 output Timestep per each inference segment = 10 Total sequence length = 1000 Experiments are performed on T4 and GV100 End-to-end time: task submission to results arriving to host

An Example LSTM Model

128 1024 1024 128 1024 128 … One inference request has 10 timesteps

slide-20
SLIDE 20

20

PERFORMANCE EXPERIMENTS

Benchmarking Code

// Queue up inferencing tasks with 10 timesteps each time[0] = time(); RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); time[1] = time(); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); ... RNNInference(rnnHandle[N-1], pInput[N-1], pOutput[N-1], 10); // Wait for the completion of first inferencing task WaitRNNInferenceTimeStep(rnnHandle[0], 10); time[0] = time[0] - time(); RNNInference(rnnHandle[N], pInput[N], pOutput[N], 10); // Wait for the completion of second inferencing task WaitRNNInferenceTimeStep(rnnHandle[1], 10); time[1] = time[1] - time(); RNNInference(rnnHandle[N+1], pInput[N+1], pOutput[N+1], 10); ...

There is at most N inference requests on the fly at a given time. Measure the time required to finish each inference request including the data transfer time.

slide-21
SLIDE 21

21

COMPARISION AGAINST BATCHED CUDNN

FP32 Model, GV100 Numbers

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 80 90 100 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512 Throughput as % of Batched cuDNN Throughput (Timesteps per ms)

Throughput of Streaming Inference API vs. Batched cuDNN

Streaming API Batched % of Batched

slide-22
SLIDE 22

22

PERFORMANCE WITH TENSOR CORES

FP16 on GV100

2 4 6 8 10 12 14 16 18 20 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512

Throughput (TFlop/sec)

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

slide-23
SLIDE 23

23

PERFORMANCE WITH TENSOR CORES

FP16 on T4

2 4 6 8 10 12 14 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512

Throughput (TFlop/sec)

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

slide-24
SLIDE 24

24

LATENCY VS THROUGHPUT TRADEOFF

Assuming each inference segment represents100ms audio Choose a batch size that will maximize throughput while staying within latency budget

FP16 on T4

10 20 30 40 50 60 70 257 / 32 441 / 64 692 / 128 913 / 256 977 / 512

Latency (ms) Inference Instances Served by a GPU / Batch Size

Latency Percentiles

50% Latency 90% Latency 95% Latency 99% Latency

slide-25
SLIDE 25

25

NVIDIA TENSORRT INFERENCE SERVER

Production Data Center Inference Server

Maximize real-time inference performance of GPUs Quickly deploy and manage multiple models per GPU per node Easily scale to heterogeneous GPUs and multi GPU nodes Integrates with orchestration systems and auto scalers via latency and health metrics Open source for thorough customization and integration

TensorRT Inference Server Tesla T4 Tesla T4

TensorRT Inference Server

Tesla V100 Tesla V100

TensorRT Inference Server

Tesla P4 Tesla P4

slide-26
SLIDE 26

26

INFERENCE SERVER ARCHITECTURE

Models supported

  • TensorFlow GraphDef/SavedModel
  • TensorFlow and TensorRT GraphDef
  • TensorRT Plans
  • Caffe2 NetDef (ONNX import)
  • Custom backends

Multi-GPU support Concurrent model execution Server HTTP REST API/gRPC Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

slide-27
SLIDE 27

27

INFERENCE SERVER BATCHERS

Dynamic Batching TensorRT Inference Server (TRTIS) groups inference requests based on customer defined metrics for optimal performance Customer defines 1) batch size and/or 2) latency requirements Sequence Batching TRTIS can keep track of the inference requests belonging to a stateful model The client application assigns a correlation ID for a stream of inferences belonging to the same sequence Use together with a custom backend to store and restore the internal states of the model

Dynamic and Sequence Batching

slide-28
SLIDE 28

28

SUMMARY

Designed and implemented the Streaming Inference API Automatically batches the RNN inference requests together to achieve high throughput Code written for batch size = 1 achieves ≥66% throughput of batched execution (FP32) Allows utilizing the Tensor Cores on Volta and Turing architectures Hit latency targets by choosing the right batch size Generalizes to sequence models with interdependent inference streams TRTIS sequence batcher and custom backends for high-performance real-time inferencing S9438 - Maximizing Utilization for Data Center Inference with TensorRT Inference Server

slide-29
SLIDE 29

29

RESOURCES

TRTIS blog post and documentation: https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/ https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/

slide-30
SLIDE 30