Micah Villmow Senior TensorRT Software Engineer
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - - PowerPoint PPT Presentation
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - - PowerPoint PPT Presentation
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY
2
100倍以上速く、 本当に可能ですか?
2
3
Neural Machine Translation Unit
DOUGLAS ADAMS – BABEL FISH
4
OVER 100X FASTER, IS IT REALLY POSSIBLE?
Over 200 years 4
5
NVIDIA TENSORRT
Programmable Inference Accelerator
developer.nvidia.com/tensorrt
DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4 TESLA V100
FRAMEWORKS GPU PLATFORMS TensorRT
Optimizer Runtime
6
- Convolution
- LSTM and GRU
- Activation: ReLU, tanh, sigmoid
- Pooling: max and average
- Scaling
- Element wise operations
- LRN
- Fully-connected
- SoftMax
- Deconvolution
TENSORRT LAYERS
Built-in Layer Support Custom Layer API
CUDA Runtime Deployed Application
TensorRT Runtime Custom Layer
7
TENSORRT OPTIMIZATIONS
Kernel Auto-Tuning Layer & Tensor Fusion Dynamic Tensor Memory Weights & Activation Precision Calibration
140 305 5700
14 ms 6.67 ms 6.83 ms
5 10 15 20 25 30 35 40 1,000 2,000 3,000 4,000 5,000 6,000 CPU-Only V100 + TensorFlow V100 + TensorRT
Latency (ms) Images/sec
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50)
8
Agenda
- What is NMT?
- What is current state?
- What are the problems?
- How did we solve it?
- What perf is possible?
9
ACRONYMS AND DEFINITIONS
NMT: Neural Machine Translation OpenNMT: Open source NMT project for academia and industry Token: The minimum representation used for encoding(symbol, word, character, subword) Sequence: A number of tokens wrapped by special start and end sequence tokens. Beam Search: directed partial breadth-first tree search algorithm TopK: Partial sort resulting in N min/max elements Unk: Special token that represents unknown translations.
10
OPENNMT INFERENCE
Decoder Encoder Beam Search EncoderRNN Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input Setup Input
11
DECODER EXAMPLE
Decoder RNN Attention Model Projection TopK Output Embedding Beam Search Beam Shuffle Batch Reduction Beam Scoring Input Embedding
Iteration 0 <S>
This The He What The
Iteration 1+
is house ran time cow This The He What The
12
TRAINING VS INFERENCE
Decoder Encoder EncoderRNN Decoder RNN Attention Model Projection Output Input Setup Input Decoder Encoder Beam Search EncoderRNN Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input Setup Input
13
Agenda
- What is NMT?
- What is current state?
- What are the problems?
- How did we solve it?
- What perf is possible?
14
INFERENCE TIME IS BEAM SEARCH TIME
- Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System:
Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144
- Sharan Narang, Jun, 2017, Baidu’s DeepBench -
https://github.com/baidu-research/DeepBench
- Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than
training.’ - https://github.com/tensorflow/nmt/issues/204
- David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance
across hardware platforms.’
15
Agenda
- What is NMT?
- What is current state?
- What are the problems?
- How did we solve it?
- What perf is possible?
16
PERF ANALYSIS
17
KERNEL ANALYSIS
18
Agenda
- What is NMT?
- What is current state?
- What are the problems?
- How did we solve it?
- What perf is possible?
19
ENCODER
Encoder EncoderRNN Input Setup
20
Sequence Length Buffer PrefixSumPlugin Input Hello. This is a test. Bye. Tokenization Hello . This is a test . Bye . Gather Encoder Input
42 23 0 73 3 8 19 23 0 98 23 0 2 5 2
Decoder Start Tokens Constant Zero State buffer Setup
21
Sequence Lengths Encoder Input Encoder
42 23 0 73 3 8 19 23 0 98 23 0 2 5 2
PackedRNN Trained Hidden State Trained Cell State Encoder Hidden State Encoder Cell State Context Vector
.1 .35 .123 .93 1.4 1 .01 .42 .20
Embedding Plugin
22
DECODER
Decoder Beam Search Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input
23
Decoder, 1st Iteration RNN Encoder Hidden State Encoder Cell State Decode Hidden State Decode Cell State Decoder Input
Batch0 <S> BatchN <S>
Decoder Output
Batch0 .124 BatchN .912
Start Sentence Token Embedding Plugin
24
Decoder, 2nd+ Iteration RNN Prev Hidden State Prev Cell State Next Hidden State Next Cell State Decoder Input Batch Beam 0 Beam1 Beam2 Beam3 Beam4 こ ん に ち は N さ よ う な ら
Batch0 .124 BatchN .912
Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 Embedding Plugin
25
Global Attention Model Context Vector Sequence Length Buffer Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 FullyConnected Weights Weights BatchedGemm RaggedSoftmax Concat FullyConnected TanH BatchedGemm
26
Projection Attention Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4
[.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]
N
[.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]
FullyConnected Weights Softmax Projection Output Log Batch Beam 0 Beam1 Beam2 Beam3 Beam4 N
27
TopK Part 1 Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 N TopK Intra-beam Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 Index Prob
[1,3] [.9,.8] [2,4] [.99,.5] [9,0] [.3,.8] [5,0] [.1,.93] [7,6] [.85,.99]
Gather
28
TopK Part 2 Gather Output Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99] TopK Inter-beam Output Indices [2,9,7,0,8] Prob [.99,.99,.93,.9,.85] Beam Mapping Plugin Intra-beam Output Output
Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
29
Beam Search – Beam Shuffle Output
Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
Beam0 State Beam1 State Beam2 State Beam3 State Beam4 State Beam1 State+1 Beam4 State+1 Beam3 State+1 Beam0 State+1 Beam4 State+1 Beam Shuffle Plugin
30
Beam Search – Beam Scoring Output
Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
Beam0 State+1 Beam1 State+1 Beam2 State+1 Beam3 State+1 Beam4 State+1 Beam0 State Beam1 State Beam2 State Beam3 State Beam4 State Beam Scoring Plugin EOS Detection Sentence Probability Update Backtrack State Storage Sequence Length Increment End of Beam/Batch Heuristic Batch Finished Bitmap [0001100011…010]
31
Beam Search – Batch Reduction Batch Finished Bitmap [0001100011…010] Reduce Operation(Sum) Transfer 32bit to Host as new batch size. TopK Gather Encoder Output Encoder Output Encoder/State Reduction Plugin Beam State Beam State
32
Output Output
Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7
All Done? No Decoder Input Yes Beam State Device To Host On Host: Beam State Output こんにちは。 これはテストです。 さようなら。
33
TENSORRT ANALYSIS
34
TENSORRT KERNEL ANALYSIS
35
Agenda
- What is NMT?
- What is current state?
- What are the problems?
- How did we solve it?
- What perf is possible?
36
RESULTS
4 25 550
280 ms 153 ms 117 ms
50 100 150 200 250 300 350 400 450 500 100 200 300 400 500 600 CPU-Only + Torch V100 + Torch V100 + TensorRT
Latency (ms) Sentences/sec
Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT)
37
SUMMARY
- Show that topK no longer dominates sequence inference time.
- Show that RNN Inference is compute bound, not memory bound.
- TensorRT accelerates sequence inferencing.
- Over two orders of magnitude higher throughput over CPU.
- Latency reduction by more than half over CPU.
developer.nvidia.com/tensorrt
PRODUCT PAGE
38
LEARN MORE
developer.nvidia.com/tensorrt
PRODUCT PAGE
docs.nvidia.com/deeplearning/sdk
DOCUMENTATION
nvidia.com/dli
TRAINING
39