S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - PowerPoint PPT Presentation

S8822 – OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

2 100 倍以上速く、本当に可能ですか？ 2

DOUGLAS ADAMS – BABEL FISH Neural Machine Translation Unit 3

4 OVER 100X FASTER, IS IT REALLY POSSIBLE? Over 200 years 4

NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 5 developer.nvidia.com/tensorrt

TENSORRT LAYERS Built-in Layer Support Custom Layer API Deployed Application Convolution • TensorRT Runtime LSTM and GRU • Custom Layer • Activation: ReLU, tanh, sigmoid Pooling: max and average • Scaling • • Element wise operations LRN • Fully-connected • • SoftMax Deconvolution • CUDA Runtime 6

TENSORRT OPTIMIZATIONS 40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50) Layer & Tensor Fusion 40 5700 6,000 35 5,000 30 Weights & Activation Latency (ms) 4,000 Images/sec 25 Precision Calibration 20 3,000 14 ms 15 2,000 10 6.83 ms 6.67 ms 1,000 5 Kernel Auto-Tuning 305 140 0 0 CPU-Only V100 + V100 + TensorRT TensorFlow Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512. Dynamic Tensor Memory 7

• What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 8

ACRONYMS AND DEFINITIONS NMT: Neural Machine Translation OpenNMT: Open source NMT project for academia and industry Token: The minimum representation used for encoding(symbol, word, character, subword) Sequence: A number of tokens wrapped by special start and end sequence tokens. Beam Search: directed partial breadth-first tree search algorithm TopK: Partial sort resulting in N min/max elements Unk: Special token that represents unknown translations. 9

OPENNMT INFERENCE Decoder Output Beam TopK Search Batch Reduction Projection Beam Scoring Beam Shuffle Encoder Attention EncoderRNN Model Decoder RNN Setup Input Input 10

Output Embedding Input Embedding Projection Attention Decoder Model TopK RNN DECODER EXAMPLE Beam Search Batch Beam Beam Reduction Shuffle Scoring This The Iteration 0 <S> He What The This is The house Iteration 1+ He ran What time The cow 11

TRAINING VS INFERENCE Decoder Decoder Output Output Beam Projection TopK Search Attention Model Encoder Batch Reduction Projection Beam Scoring Beam Shuffle Encoder Attention EncoderRNN Decoder EncoderRNN Model RNN Input Decoder Setup RNN Setup Input Input Input 12

INFERENCE TIME IS BEAM SEARCH TIME • Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144 • Sharan Narang, Jun, 2017, Baidu’s DeepBench - https://github.com/baidu-research/DeepBench • Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204 • David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’ 14

PERF ANALYSIS 16

KERNEL ANALYSIS 17

Encoder EncoderRNN ENCODER Setup Input 19

Input Setup Hello. Hello . This is a test. This is a test . Tokenization Bye. Bye . PrefixSumPlugin Constant Gather 2 42 23 0 0 0 0 Sequence Decoder Zero Encoder 5 73 3 8 19 23 0 Length Start State Input Buffer Tokens buffer 98 23 0 0 0 0 2 20

Encoder 42 23 0 0 0 0 2 Encoder Sequence 73 3 8 19 23 0 5 Lengths Input 2 98 23 0 0 0 0 Trained Encoder Hidden Hidden Embedding Plugin State State PackedRNN Trained Encoder .1 .35 0 0 0 0 Cell Cell Context State State .123 .93 1.4 1 .01 0 Vector .42 .20 0 0 0 0 21

Decoder Output Beam TopK Search Batch Reduction Projection Beam Scoring Beam Shuffle DECODER Attention Model Decoder RNN Input 22

Decoder, 1 st Iteration Start Sentence Token Batch0 <S> Decoder Input BatchN <S> Embedding Plugin Encoder Decode Hidden Hidden State State RNN Encoder Decode Cell Cell State State Batch0 .124 Decoder Output BatchN .912 23

Decoder, 2 nd + Iteration Decoder Input Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こんにちは N さようなら Prev Embedding Plugin Next Hidden Hidden State State RNN Prev Next Cell Cell State State Decoder Output Batch0 .124 Batch Beam 0 Beam1 Beam2 Beam3 Beam4 BatchN .912 0 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 24

Global Attention Model Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 Sequence Length FullyConnected Weights Buffer BatchedGemm RaggedSoftmax BatchedGemm Context Vector Concat FullyConnected Weights TanH 25

Projection Attention Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2] N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9] FullyConnected Weights Softmax Log Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 N 26

TopK Part 1 Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 N TopK Output Intra-beam Batch Beam 0 Beam1 Beam2 Beam3 Beam4 Index [1,3] [2,4] [9,0] [5,0] [7,6] [.9,.8] [.99,.5] [.3,.8] [.1,.93] [.85,.99] Prob Gather 27

TopK Part 2 Output Gather Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99] TopK Output Inter-beam Indices [2,9,7,0,8] Intra-beam Output Prob [.99,.99,.93,.9,.85] Beam Mapping Plugin Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 28

Beam Search – Beam Shuffle Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 Beam0 Beam1 State State+1 Beam1 Beam4 State State+1 Beam Shuffle Beam2 Beam3 Plugin State State+1 Beam3 Beam0 State State+1 Beam4 Beam4 State State+1 29

Beam Search – Beam Scoring Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 Beam0 Beam0 State State+1 Beam Scoring Plugin EOS Beam1 Sentence Beam1 Detection State Probability State+1 Update Backtrack Beam2 Beam2 State State Sequence State+1 Storage Length Increment End of Beam3 Beam3 Beam/Batch State State+1 Heuristic Beam4 Beam4 State State+1 Batch Finished [0001100011…010] Bitmap 30

Beam Search – Batch Reduction Batch Finished [0001100011…010] Bitmap Transfer 32bit to Host Reduce Operation(Sum) as new batch size. TopK Gather Encoder/State Reduction Beam Beam Encoder Encoder Plugin State State Output Output 31

Output Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 No Decoder Input All Done? Yes On Host: こんにちは。 Beam Device To Host Beam State これはテストです。 Output State さようなら。 32

TENSORRT ANALYSIS 33

TENSORRT KERNEL ANALYSIS 34

RESULTS 140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT) 600 500 550 450 500 400 350 Sentences/sec Latency (ms) 400 300 280 ms 300 250 200 200 153 ms 150 117 ms 100 100 50 25 4 0 0 CPU-Only + Torch V100 + Torch V100 + TensorRT Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On 36

SUMMARY Show that topK no longer dominates sequence inference time. • Show that RNN Inference is compute bound, not memory bound. • PRODUCT PAGE TensorRT accelerates sequence inferencing. • developer.nvidia.com/tensorrt • Over two orders of magnitude higher throughput over CPU. Latency reduction by more than half over CPU. • 37

LEARN MORE PRODUCT PAGE DOCUMENTATION TRAINING nvidia.com/dli developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk 38

Q&A 39

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - PowerPoint PPT Presentation

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY

TensorRT 2. Setup of the TensorRT inference engine 2. Setup of the TensorRT inference engine 3. I/O

DEA PMU NMT Content Introduction Project Planning NMT Friendly Policy and

Fast Neural Network Inference with TensorRT on Autonomous Vehicles Zejia Zheng (zheng@zoox.com)

NMT Structure Terry Kuzma NMT Instructor Outline Program Mission Logistics / Schedule

D.O.T. HAZMAT / DANGEROUS GOODS TRAINING FOR HEALTHCARE WORKERS including the Nuclear

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang,

MAXIMIZING UTILIZATION FOR DATA CENTER INFERENCE WITH TENSORRT INFERENCE SERVER David Goodwin,

INTEGRATION OF DALI WITH TENSORRT ON XAVIER Josh Park (joshp@nvidia.com), Manager - Automotive Deep

DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT Shashank Prasanna Deep Learning in Production -

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

Meta-Learning for Low Resource NMT Introduction Historically Statistical Translation

Analysis of NMT Systems Yonatan Belinkov Guest lecture CMU CS 11-731: Machine Translation and

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Machine Translation 3: Linguistics in SMT and NMT Ond rej Bojar bojar@ufal.mff.cuni.cz

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

8-bit Inference with TensorRT Szymon Migacz, NVIDIA May 8, 2017 Intro Goal: Convert FP32

Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Huadong Chen ! ,

Exemplar Encoder Decoder for Neural Conversation Generation By Gaurav Pandey, Danish

Travis Perkins plc Travis Perkins plc Financial Results Financial Results Year ended 31

Bright Horizons Delivering on the Plan FY17: Interim Results Presentation 21 February 2017

Deep Learning for Language Understanding (at Google Scale) Anjuli Kannan Software Engineer,

N EU G EN Text Generation from Meaning Representations Yannis Konstas Joint work

F i n a n c i a l R e s u l t s P r e s e n t a t i o n Aug 3, 2015

Appendix 24 Securities Code :9438 Business Overview: docomo d-menu