S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - PowerPoint PPT Presentation

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney – Developer Technology Engineer, NVIDIA March 20, 2019

REAL-TIME INFERENCE Sequence Models Based on RNNs Sequence models for automatic speech recognition (ASR), translation, and speech generation Real-time applications have a stream of inference requests from multiple users Challenge is to perform inferencing with low latency and high throughput “Hello, my name is Alice” “I am Susan” ASR “This is Bob” 2

BATCHING VS NON-BATCHING Batching: Grouping Inference Requests Together Batch size = 1 • Run a single RNN inference task on a GPU W W W W Low-latency, but the GPU is underutilized • Batch size = N Group RNN inference instances together • High throughput and GPU utilization • W Allows employing Tensor Cores in Volta and Turing • 3

BATCHING VS NON-BATCHING Performance Data on T4 RNN Inference Throughput and Latency FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency 180 9 Throughput (timesteps per ms) 160 8 Latency (ms per a timestep) 138.4 140 7 116.5 120 6 100 5 83.8 80 4 60 3 51.4 40 32.9 2 31.5 27.6 23.0 20 1 1.2 1.8 0 0 Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 4

RNN BATCHING Challenges and Opportunities Existing real-time codes are written for inferencing many instances with batch size = 1 Real-time batching requires extra programming effort A naïve implementation can suffer from significant increase in latency An ideal solution will allow making a tradeoff between latency and throughput RNN cells provide an opportunity to merge inference tasks at different timesteps 5

RNN BATCHING Combining RNNs at Different Timesteps Time steps → Batch Size = 4 t0 Common model t1 t0 parameters t2 t1 t0 t0 t2 t1 t1 fill with a new inference task Inference Tasks Arrival Time Batched Execution of Timesteps 6

RNN CELLS RNN Cells Supported in TensorRT and cuDNN i t = σ( W i x t + R i h t-1 + b Wi + b Ri ) i t = σ( W i x t + R i h t-1 + b Wi + b Ru ) f t = σ( W f x t + R f h t-1 + b Wf + b Rf ) r t = σ( W r x t + R r h t-1 + b Wr + b Rr ) o t = σ( W o x t + R o h t-1 + b Wo + b Ro ) h t = ReLU(W i x t + R i h t-1 + b Wi + b Ri ) h t = tanh(W i x t + R i h t-1 + b Wi + b Ri ) h' t = tanh(W h x t + r t ◦ (R h h t-1 + b Rh ) + c' t = tanh(W c x t + R c h t-1 + b Wc + b Rc ) b Wh ) c t = f t ◦ c t-1 + i t ◦ c' t h t = (1 - i t ) ◦ h' t + i t ◦ h t-1 h t = o t ◦ tanh(c t ) RELU TANH LSTM GRU 7

HIGH-PERFORMANCE RNN INFERENCING cuDNN Features High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells An arbitrary batch size and number of timesteps can be executed Easy access to internal and hidden states of the RNN cells for each timestep Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0) LSTMs with recurrent projections to reduce the op count Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4) 8

UTILIZING TENSOR CORES cuDNN, cuBLAS and TensorRT cuDNN // input, output and weight data types are FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH ); // input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION ); cuBLAS and cuBLASLt cublasGemmEx(...); cublasLtMatmul(...); TensorRT builder->setFp16Mode(true); 9

RNN INFERENCING WITH CUDNN Key Functions cudnnCreateRNNDescriptor (&rnnDesc); // creates an RNN descriptor cudnnSetRNNDescriptor (rnnDesc , … ); // sets the RNN descriptor cudnnGetRNNLinLayerMatrixParams (cudnnHandle, rnnDesc , …); // set weights cudnnGetRNNLinLayerBiasParams (cudnnHandle, rnnDesc , …); // set bias cudnnRNNForwardInference (cudnnHandle, rnnDesc , … ); // perform inferencing cudnnDestroyRnnDescriptor (rnnDesc); // destroy the RNN descriptor 10

AUTO-BATCHING FOR HIGH THROUGHPUT Automatically Group Inference Instances Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation Input, hidden states and outputs are tracked automatically with a new API Exploits optimization opportunities by overlapping compute, transfer and host computations Similar ideas explored at: Low‐latency RNN inference using cellular batching ( Jinyang Li et. al., GTC 2018) • Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et. • al., CoRR, 2015) 11

STREAMING INFERENCE API An Auto-batching Solution Non-blocking function calls with a mechanism to wait on completion Inferencing can be performed in segments with multiple timesteps for real-time processing A background thread that combines and executes the single inference tasks Inference-0 t0 t1 t2 t3 t4 t5 t6 t7 Inference-1 t0 t1 t2 t3 t4 t5 t6 Submit 4 steps Wait for t7 completion Inference-0 Submit 4 steps Submit 3 steps Inference-1 Submit 4 steps 12

STREAMING INFERENCE API List of Functions streamHandle = CreateRNNInferenceStream (modelDesc); rnnHandle = CreateRNNInference (streamHandle); RNNInference (rnnHandle, pInput, pOutput, seqLength); timeStep = WaitRNNInferenceTimeStep (rnnHandle, timeStep); timeStep = GetRNNInferenceProgress (rnnHandle); DestroyRNNInference (rnnHandle); DestroyRNNInferenceStream (streamHandle); 13

EXAMPLE USAGE Two Inference Instances // Create the inference stream with shared model parameters streamHandle = CreateRNNInferenceStream (modelDesc); // Create two RNN inference instances rnnHandle[0] = CreateRNNInference (streamHandle); rnnHandle[1] = CreateRNNInference (streamHandle); // Request inferencing for each inference instance with 10 timesteps (non-blocking call) RNNInference (rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference (rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instance RNNInference (rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5); // Wait for the completion of lastly added inferencing job WaitRNNInferenceTimeStep (rnnHandle[1], 15); // Destroy the two inferencing tasks and the inference stream DestroyRNNInference (rnnHandle[0]); DestroyRNNInference (rnnHandle[1]); DestroyRNNStream (streamHandle); 14

RNN INFERENCING WITH SEGMENTS Execution Queue and Task Switching for Batch Size = 2 Execution Queue o0 o1 o2 o3 o4 o5 o6 o7 Time X … Inference-0: t0 t1 t2 t3 t4 t5 t6 t7 t2 t3 i0 i1 i2 i3 i4 i5 i4 i5 t0 t1 t2 t3 o4 o5 o6 o7 Store inference-0 states o0 o1 o2 o3 … t4 t5 t6 t7 t0 t1 t2 t3 Inference-1: Time Y i4 i5 i4 i5 i0 i1 i2 i3 t2 t3 t0 t1 t2 t3 o4 o5 o6 o7 o0 o1 o2 o3 Store inference-2 states Restore Inference-0 states … t0 t1 t2 t3 t4 t5 t6 t7 Inference-2: Time Z i0 i1 i2 i3 i4 i5 i4 i5 t5 t6 t7 Task Arrival Time t4 t5 t6 t7 Z Y X 15

IMPLEMENTATION Auto-batching and GPU Execution 1. Find the inference tasks ready to execute time steps 2. Determine the batch slots for each inference task 3. Send the inputs to GPU for batched processing 4. Restore hidden states as needed (+ cell states for LSTMs) 5. Batched execution on the GPU 6. Store hidden states as needed (+ cell states for LSTMs) 7. Send the batched results back to host 8. De-batch the results on the host 16

IMPLEMENTATION Batching, Executing, and De-batching Inference Tasks Host Op GPU Op Inference Tasks Output Inference Tasks Input Data Transfer Batching Input HtoD Restore LSTM Store Projection Top K Output DtoH De-batching Background thread accepting inference tasks At each timestep inference tasks are batched, executed on the GPU and de-batched 17

PERFORMANCE OPTIMIZATIONS Hiding Host Processing, Data transfers, and State Management t Batching Input HtoD Restore LSTM Store Projection Top K Output DtoH De-batching CUDA Stream 0 t+1 Batching Input HtoD Restore LSTM Store Projection Top K Output DtoH De-batching CUDA Stream 1 … t+2 Batching Input HtoD Restore LSTM Store Projection Top K CUDA Stream 2 … t+3 Batching Input HtoD Restore LSTM CUDA Stream 0 Overlapping opportunities between timesteps for compute, batching, de-batching and transfer Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU overlap Employ three CUDA streams and triple-buffering of the output to better exploit concurrency 18

PERFORMANCE EXPERIMENTS An Example LSTM Model 1024 1024 1024 Input size = 128 7 LSTM layers with 1024 hidden cells A final projection layer with 1024 output … Timestep per each inference segment = 10 Total sequence length = 1000 Experiments are performed on T4 and GV100 128 128 128 End-to-end time: task submission to results arriving to host One inference request has 10 timesteps 19

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - PowerPoint PPT Presentation

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer Technology Engineer, NVIDIA March 20, 2019 REAL-TIME INFERENCE Sequence Models Based on RNNs Sequence models for automatic speech recognition (ASR),

RESTFUL API BEST PRACTICES By Malwina Nowakowska STX NEXT talented developers | flexible teams

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

API Ruby on Rails UI ES API Hedtek Wijiti API API Elasticsearch Depositing user Build

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

API Connect Arnauld Desprets - arnauld_desprets@fr.ibm.com Technical Sale 0 Agenda 1. API

Spock Data driven testing RESTful API What is a RESTful API ? A RESTful API is an application

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Study of an API Migration for two XML APIs Thiago Bartholomei Krzysztof Czarnecki Ralf Lmmel

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

API Gateway API Gateway Gateway ESB At present tooling for API

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

Construction OS&H Project planning and control for safe OS&H Preparing Site-specific

GREAT HALL UPDATE DENVER INTERNATIONAL AIRPORT JUNE 2017 DEN IS GROWING 2016 was record

On the Effect of Protected Entry Servicing Policies on the Response Time of Ada Tasks Jorge

Service Enhancement to Route 501 April 24, 2018, 6:00 p.m. RTA Administration Building, Suite

PRESENTATION ACCOLADE INDUSTRIAL FUND We invest in smart Industrial Parks for a sustainable

Portfolio Acquisition Portfolio Acquisition Portfolio Acquisition from from from Safe Harbor

October 2006 October 2006 October 2006 Safe Harbor Safe Harbor In keeping with the SECs

Gr e ate r Baltimor e Committe e Ne wsmake r Br e akfast 1 Presentation Overview Four

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat - PowerPoint PPT Presentation

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE Murat Efe Guney Developer Technology Engineer, NVIDIA March 20, 2019 REAL-TIME INFERENCE Sequence Models Based on RNNs Sequence models for automatic speech recognition (ASR),

RESTFUL API BEST PRACTICES By Malwina Nowakowska STX NEXT talented developers | flexible teams

Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

KODA AUTO University KODA AUTO University Agenda on KODA AUTO University Enterprise

API Ruby on Rails UI ES API Hedtek Wijiti API API Elasticsearch Depositing user Build

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

API Connect Arnauld Desprets - arnauld_desprets@fr.ibm.com Technical Sale 0 Agenda 1. API

Spock Data driven testing RESTful API What is a RESTful API ? A RESTful API is an application

Introduction to the SAGA API Outline SAGA Standardization API Structure and Scope (C++)

Study of an API Migration for two XML APIs Thiago Bartholomei Krzysztof Czarnecki Ralf Lmmel

On the way to Omniledger: adding transaction batching and ByzcoinX to skipchains Raphal Dunant

EFFICIENT USE OF ALUMINUM SCRAP IN EFFICIENT USE OF ALUMINUM SCRAP IN BATCHING SECONDARY ALLOYS

Investigating scalability of recurrent network using dynamic batching in PyTorch Devin Taylor

API Gateway API Gateway Gateway ESB At present tooling for API

The Korean Auto &amp; Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

GB Auto The Ghabbour Group of Companies Everything on Wheels GB Auto, S.A.E I nitial

Construction OS&amp;H Project planning and control for safe OS&amp;H Preparing Site-specific

GREAT HALL UPDATE DENVER INTERNATIONAL AIRPORT JUNE 2017 DEN IS GROWING 2016 was record

On the Effect of Protected Entry Servicing Policies on the Response Time of Ada Tasks Jorge

Service Enhancement to Route 501 April 24, 2018, 6:00 p.m. RTA Administration Building, Suite

PRESENTATION ACCOLADE INDUSTRIAL FUND We invest in smart Industrial Parks for a sustainable

Portfolio Acquisition Portfolio Acquisition Portfolio Acquisition from from from Safe Harbor

October 2006 October 2006 October 2006 Safe Harbor Safe Harbor In keeping with the SECs

Gr e ate r Baltimor e Committe e Ne wsmake r Br e akfast 1 Presentation Overview Four

The Korean Auto & Auto Parts Industry Chapter 1. The Status of Korean Auto Industry 2 1

Construction OS&H Project planning and control for safe OS&H Preparing Site-specific