Murat Efe Guney – Developer Technology Engineer, NVIDIA March 20, 2019
S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - - PowerPoint PPT Presentation
S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - - PowerPoint PPT Presentation
S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer Technology Engineer, NVIDIA March 20, 2019 REAL-TIME INFERENCE Sequence Models Based on RNNs Sequence models for automatic speech recognition (ASR),
2
REAL-TIME INFERENCE
Sequence Models Based on RNNs
Sequence models for automatic speech recognition (ASR), translation, and speech generation Real-time applications have a stream of inference requests from multiple users Challenge is to perform inferencing with low latency and high throughput
“Hello, my name is Alice” “I am Susan” “This is Bob”
ASR
3
BATCHING VS NON-BATCHING
Batch size = 1
- Run a single RNN inference task on a GPU
- Low-latency, but the GPU is underutilized
Batch size = N
- Group RNN inference instances together
- High throughput and GPU utilization
- Allows employing Tensor Cores in Volta and Turing
Batching: Grouping Inference Requests Together
W W W W W
4
BATCHING VS NON-BATCHING
Performance Data on T4
1.2 23.0 27.6 32.9 31.5 1.8 51.4 83.8 116.5 138.4
1 2 3 4 5 6 7 8 9 20 40 60 80 100 120 140 160 180 Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256
Latency (ms per a timestep) Throughput (timesteps per ms)
RNN Inference Throughput and Latency
FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency
5
RNN BATCHING
Existing real-time codes are written for inferencing many instances with batch size = 1 Real-time batching requires extra programming effort A naïve implementation can suffer from significant increase in latency An ideal solution will allow making a tradeoff between latency and throughput RNN cells provide an opportunity to merge inference tasks at different timesteps
Challenges and Opportunities
6
RNN BATCHING
Combining RNNs at Different Timesteps
Inference Tasks Arrival Time t0 t1 t0 t2 t1 t0 t0 t2 t1 t1 Batch Size = 4 fill with a new inference task Batched Execution of Timesteps Common model parameters Time steps →
7
RNN CELLS
RNN Cells Supported in TensorRT and cuDNN
TANH LSTM GRU
it = σ(Wixt + Riht-1 + bWi + bRi) ft = σ(Wfxt + Rfht-1 + bWf + bRf)
- t = σ(Woxt + Roht-1 + bWo + bRo)
c't = tanh(Wcxt + Rcht-1 + bWc + bRc) ct = ft ◦ ct-1 + it ◦ c't ht = ot ◦ tanh(ct) ht = tanh(Wixt + Riht-1 + bWi + bRi) ht = ReLU(Wixt + Riht-1 + bWi + bRi) it = σ(Wixt + Riht-1 + bWi + bRu) rt = σ(Wrxt + Rrht-1 + bWr + bRr) h't = tanh(Whxt + rt◦(Rhht-1 + bRh) + bWh) ht = (1 - it) ◦ h't + it ◦ ht-1
RELU
8
HIGH-PERFORMANCE RNN INFERENCING
High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells An arbitrary batch size and number of timesteps can be executed Easy access to internal and hidden states of the RNN cells for each timestep Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0) LSTMs with recurrent projections to reduce the op count Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4)
cuDNN Features
9
UTILIZING TENSOR CORES
cuDNN
cuDNN, cuBLAS and TensorRT
// input, output and weight data types are FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH); // input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION);
cuBLAS and cuBLASLt
cublasGemmEx(...); cublasLtMatmul(...);
TensorRT
builder->setFp16Mode(true);
10
RNN INFERENCING WITH CUDNN
cudnnCreateRNNDescriptor(&rnnDesc); // creates an RNN descriptor cudnnSetRNNDescriptor(rnnDesc, … ); // sets the RNN descriptor cudnnGetRNNLinLayerMatrixParams(cudnnHandle, rnnDesc, …); // set weights cudnnGetRNNLinLayerBiasParams(cudnnHandle, rnnDesc, …); // set bias cudnnRNNForwardInference(cudnnHandle, rnnDesc, … ); // perform inferencing cudnnDestroyRnnDescriptor(rnnDesc); // destroy the RNN descriptor
Key Functions
11
AUTO-BATCHING FOR HIGH THROUGHPUT
Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation Input, hidden states and outputs are tracked automatically with a new API Exploits optimization opportunities by overlapping compute, transfer and host computations Similar ideas explored at:
- Low‐latency RNN inference using cellular batching (Jinyang Li et. al., GTC 2018)
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et.
al., CoRR, 2015)
Automatically Group Inference Instances
12
STREAMING INFERENCE API
Non-blocking function calls with a mechanism to wait on completion Inferencing can be performed in segments with multiple timesteps for real-time processing A background thread that combines and executes the single inference tasks
An Auto-batching Solution
t0 t1 t2 t3 t0 t1 t2 t3 t4 t5 t6 t7 t4 t5 t6
Submit 4 steps Submit 4 steps Submit 4 steps Submit 3 steps Wait for t7 completion Inference-0 Inference-1 Inference-0 Inference-1
13
STREAMING INFERENCE API
List of Functions
streamHandle = CreateRNNInferenceStream(modelDesc); rnnHandle = CreateRNNInference (streamHandle); RNNInference (rnnHandle, pInput, pOutput, seqLength); timeStep = WaitRNNInferenceTimeStep(rnnHandle, timeStep); timeStep = GetRNNInferenceProgress(rnnHandle); DestroyRNNInference (rnnHandle); DestroyRNNInferenceStream(streamHandle);
14
EXAMPLE USAGE
// Create the inference stream with shared model parameters streamHandle = CreateRNNInferenceStream(modelDesc); // Create two RNN inference instances rnnHandle[0] = CreateRNNInference(streamHandle); rnnHandle[1] = CreateRNNInference(streamHandle); // Request inferencing for each inference instance with 10 timesteps (non-blocking call) RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instance RNNInference(rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5); // Wait for the completion of lastly added inferencing job WaitRNNInferenceTimeStep(rnnHandle[1], 15); // Destroy the two inferencing tasks and the inference stream DestroyRNNInference(rnnHandle[0]); DestroyRNNInference(rnnHandle[1]); DestroyRNNStream(streamHandle);
Two Inference Instances
15
RNN INFERENCING WITH SEGMENTS
Execution Queue and Task Switching for Batch Size = 2
t0 t1 t2 t3
Inference-0: i0
i1
- 1
i2
- 2
i3
- 3
…
t0 t1 t2 t3
Inference-1: i0 i1 i2 i3
- 1
- 2
- 3
Inference-2:
t0 t1 t2 t3
i0 i1 i2 i3
- 1
- 2
- 3
t4 t5
i4
- 4
i5
- 5
Execution Queue
t2 t3 t0 t1 t2 t3
Task Arrival Time
t6 t7
i4
- 6
i5
- 7
…
t4 t5
i4
- 4
i5
- 5
t6 t7
i4
- 6
i5
- 7
…
t4 t5
i4
- 4
i5
- 5
t6 t7
i4
- 6
i5
- 7
X Y Z Time X
t2 t3
Time Y
t0 t1 t2
Time Z
t6 t7 t4 t5 t6 t7 t3
Store inference-0 states Store inference-2 states Restore Inference-0 states
t5
16
IMPLEMENTATION
- 1. Find the inference tasks ready to execute time steps
- 2. Determine the batch slots for each inference task
- 3. Send the inputs to GPU for batched processing
- 4. Restore hidden states as needed (+ cell states for LSTMs)
- 5. Batched execution on the GPU
- 6. Store hidden states as needed (+ cell states for LSTMs)
- 7. Send the batched results back to host
- 8. De-batch the results on the host
Auto-batching and GPU Execution
17
IMPLEMENTATION
Batching, Executing, and De-batching Inference Tasks
LSTM Batching Projection Output DtoH Input HtoD De-batching
Inference Tasks Input Inference Tasks Output
Host Op GPU Op
Background thread accepting inference tasks At each timestep inference tasks are batched, executed on the GPU and de-batched
Top K Data Transfer Restore Store
18
PERFORMANCE OPTIMIZATIONS
Hiding Host Processing, Data transfers, and State Management
Overlapping opportunities between timesteps for compute, batching, de-batching and transfer Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU
- verlap
Employ three CUDA streams and triple-buffering of the output to better exploit concurrency
LSTM Batching Projection Output DtoH Input HtoD De-batching Top K Restore Store LSTM Batching Projection Output DtoH Input HtoD De-batching Top K Restore Store LSTM Batching Projection Input HtoD Top K Restore Store
…
t t+1 t+2
LSTM Batching Input HtoD Restore
t+3
…
CUDA Stream 0 CUDA Stream 1 CUDA Stream 2 CUDA Stream 0
19
PERFORMANCE EXPERIMENTS
Input size = 128 7 LSTM layers with 1024 hidden cells A final projection layer with 1024 output Timestep per each inference segment = 10 Total sequence length = 1000 Experiments are performed on T4 and GV100 End-to-end time: task submission to results arriving to host
An Example LSTM Model
128 1024 1024 128 1024 128 … One inference request has 10 timesteps
20
PERFORMANCE EXPERIMENTS
Benchmarking Code
// Queue up inferencing tasks with 10 timesteps each time[0] = time(); RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); time[1] = time(); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); ... RNNInference(rnnHandle[N-1], pInput[N-1], pOutput[N-1], 10); // Wait for the completion of first inferencing task WaitRNNInferenceTimeStep(rnnHandle[0], 10); time[0] = time[0] - time(); RNNInference(rnnHandle[N], pInput[N], pOutput[N], 10); // Wait for the completion of second inferencing task WaitRNNInferenceTimeStep(rnnHandle[1], 10); time[1] = time[1] - time(); RNNInference(rnnHandle[N+1], pInput[N+1], pOutput[N+1], 10); ...
There is at most N inference requests on the fly at a given time. Measure the time required to finish each inference request including the data transfer time.
21
COMPARISION AGAINST BATCHED CUDNN
FP32 Model, GV100 Numbers
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 80 90 100 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512 Throughput as % of Batched cuDNN Throughput (Timesteps per ms)
Throughput of Streaming Inference API vs. Batched cuDNN
Streaming API Batched % of Batched
22
PERFORMANCE WITH TENSOR CORES
FP16 on GV100
2 4 6 8 10 12 14 16 18 20 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512
Throughput (TFlop/sec)
Streaming Inference Performance on GV100
FP32 FP16 w/TCs
23
PERFORMANCE WITH TENSOR CORES
FP16 on T4
2 4 6 8 10 12 14 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512
Throughput (TFlop/sec)
Streaming Inference Performance on GV100
FP32 FP16 w/TCs
24
LATENCY VS THROUGHPUT TRADEOFF
Assuming each inference segment represents100ms audio Choose a batch size that will maximize throughput while staying within latency budget
FP16 on T4
10 20 30 40 50 60 70 257 / 32 441 / 64 692 / 128 913 / 256 977 / 512
Latency (ms) Inference Instances Served by a GPU / Batch Size
Latency Percentiles
50% Latency 90% Latency 95% Latency 99% Latency
25
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference performance of GPUs Quickly deploy and manage multiple models per GPU per node Easily scale to heterogeneous GPUs and multi GPU nodes Integrates with orchestration systems and auto scalers via latency and health metrics Open source for thorough customization and integration
TensorRT Inference Server Tesla T4 Tesla T4
TensorRT Inference Server
Tesla V100 Tesla V100
TensorRT Inference Server
Tesla P4 Tesla P4
26
INFERENCE SERVER ARCHITECTURE
Models supported
- TensorFlow GraphDef/SavedModel
- TensorFlow and TensorRT GraphDef
- TensorRT Plans
- Caffe2 NetDef (ONNX import)
- Custom backends
Multi-GPU support Concurrent model execution Server HTTP REST API/gRPC Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates
27
INFERENCE SERVER BATCHERS
Dynamic Batching TensorRT Inference Server (TRTIS) groups inference requests based on customer defined metrics for optimal performance Customer defines 1) batch size and/or 2) latency requirements Sequence Batching TRTIS can keep track of the inference requests belonging to a stateful model The client application assigns a correlation ID for a stream of inferences belonging to the same sequence Use together with a custom backend to store and restore the internal states of the model
Dynamic and Sequence Batching
28
SUMMARY
Designed and implemented the Streaming Inference API Automatically batches the RNN inference requests together to achieve high throughput Code written for batch size = 1 achieves ≥66% throughput of batched execution (FP32) Allows utilizing the Tensor Cores on Volta and Turing architectures Hit latency targets by choosing the right batch size Generalizes to sequence models with interdependent inference streams TRTIS sequence batcher and custom backends for high-performance real-time inferencing S9438 - Maximizing Utilization for Data Center Inference with TensorRT Inference Server
29
RESOURCES
TRTIS blog post and documentation: https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/ https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/