S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - - PowerPoint PPT Presentation

s8822 optimizing nmt with tensorrt
SMART_READER_LITE
LIVE PREVIEW

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - - PowerPoint PPT Presentation

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY


slide-1
SLIDE 1

Micah Villmow Senior TensorRT Software Engineer

S8822 – OPTIMIZING NMT WITH TENSORRT

slide-2
SLIDE 2

2

100倍以上速く、 本当に可能ですか?

2

slide-3
SLIDE 3

3

Neural Machine Translation Unit

DOUGLAS ADAMS – BABEL FISH

slide-4
SLIDE 4

4

OVER 100X FASTER, IS IT REALLY POSSIBLE?

Over 200 years 4

slide-5
SLIDE 5

5

NVIDIA TENSORRT

Programmable Inference Accelerator

developer.nvidia.com/tensorrt

DRIVE PX 2 JETSON TX2 NVIDIA DLA TESLA P4 TESLA V100

FRAMEWORKS GPU PLATFORMS TensorRT

Optimizer Runtime

slide-6
SLIDE 6

6

  • Convolution
  • LSTM and GRU
  • Activation: ReLU, tanh, sigmoid
  • Pooling: max and average
  • Scaling
  • Element wise operations
  • LRN
  • Fully-connected
  • SoftMax
  • Deconvolution

TENSORRT LAYERS

Built-in Layer Support Custom Layer API

CUDA Runtime Deployed Application

TensorRT Runtime Custom Layer

slide-7
SLIDE 7

7

TENSORRT OPTIMIZATIONS

Kernel Auto-Tuning Layer & Tensor Fusion Dynamic Tensor Memory Weights & Activation Precision Calibration

140 305 5700

14 ms 6.67 ms 6.83 ms

5 10 15 20 25 30 35 40 1,000 2,000 3,000 4,000 5,000 6,000 CPU-Only V100 + TensorFlow V100 + TensorRT

Latency (ms) Images/sec

Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.

40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50)

slide-8
SLIDE 8

8

Agenda

  • What is NMT?
  • What is current state?
  • What are the problems?
  • How did we solve it?
  • What perf is possible?
slide-9
SLIDE 9

9

ACRONYMS AND DEFINITIONS

NMT: Neural Machine Translation OpenNMT: Open source NMT project for academia and industry Token: The minimum representation used for encoding(symbol, word, character, subword) Sequence: A number of tokens wrapped by special start and end sequence tokens. Beam Search: directed partial breadth-first tree search algorithm TopK: Partial sort resulting in N min/max elements Unk: Special token that represents unknown translations.

slide-10
SLIDE 10

10

OPENNMT INFERENCE

Decoder Encoder Beam Search EncoderRNN Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input Setup Input

slide-11
SLIDE 11

11

DECODER EXAMPLE

Decoder RNN Attention Model Projection TopK Output Embedding Beam Search Beam Shuffle Batch Reduction Beam Scoring Input Embedding

Iteration 0 <S>

This The He What The

Iteration 1+

is house ran time cow This The He What The

slide-12
SLIDE 12

12

TRAINING VS INFERENCE

Decoder Encoder EncoderRNN Decoder RNN Attention Model Projection Output Input Setup Input Decoder Encoder Beam Search EncoderRNN Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input Setup Input

slide-13
SLIDE 13

13

Agenda

  • What is NMT?
  • What is current state?
  • What are the problems?
  • How did we solve it?
  • What perf is possible?
slide-14
SLIDE 14

14

INFERENCE TIME IS BEAM SEARCH TIME

  • Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System:

Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144

  • Sharan Narang, Jun, 2017, Baidu’s DeepBench -

https://github.com/baidu-research/DeepBench

  • Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than

training.’ - https://github.com/tensorflow/nmt/issues/204

  • David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance

across hardware platforms.’

slide-15
SLIDE 15

15

Agenda

  • What is NMT?
  • What is current state?
  • What are the problems?
  • How did we solve it?
  • What perf is possible?
slide-16
SLIDE 16

16

PERF ANALYSIS

slide-17
SLIDE 17

17

KERNEL ANALYSIS

slide-18
SLIDE 18

18

Agenda

  • What is NMT?
  • What is current state?
  • What are the problems?
  • How did we solve it?
  • What perf is possible?
slide-19
SLIDE 19

19

ENCODER

Encoder EncoderRNN Input Setup

slide-20
SLIDE 20

20

Sequence Length Buffer PrefixSumPlugin Input Hello. This is a test. Bye. Tokenization Hello . This is a test . Bye . Gather Encoder Input

42 23 0 73 3 8 19 23 0 98 23 0 2 5 2

Decoder Start Tokens Constant Zero State buffer Setup

slide-21
SLIDE 21

21

Sequence Lengths Encoder Input Encoder

42 23 0 73 3 8 19 23 0 98 23 0 2 5 2

PackedRNN Trained Hidden State Trained Cell State Encoder Hidden State Encoder Cell State Context Vector

.1 .35 .123 .93 1.4 1 .01 .42 .20

Embedding Plugin

slide-22
SLIDE 22

22

DECODER

Decoder Beam Search Decoder RNN Attention Model Projection TopK Output Beam Shuffle Batch Reduction Beam Scoring Input

slide-23
SLIDE 23

23

Decoder, 1st Iteration RNN Encoder Hidden State Encoder Cell State Decode Hidden State Decode Cell State Decoder Input

Batch0 <S> BatchN <S>

Decoder Output

Batch0 .124 BatchN .912

Start Sentence Token Embedding Plugin

slide-24
SLIDE 24

24

Decoder, 2nd+ Iteration RNN Prev Hidden State Prev Cell State Next Hidden State Next Cell State Decoder Input Batch Beam 0 Beam1 Beam2 Beam3 Beam4 こ ん に ち は N さ よ う な ら

Batch0 .124 BatchN .912

Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 Embedding Plugin

slide-25
SLIDE 25

25

Global Attention Model Context Vector Sequence Length Buffer Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 FullyConnected Weights Weights BatchedGemm RaggedSoftmax Concat FullyConnected TanH BatchedGemm

slide-26
SLIDE 26

26

Projection Attention Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4

[.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2]

N

[.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9]

FullyConnected Weights Softmax Projection Output Log Batch Beam 0 Beam1 Beam2 Beam3 Beam4 N

slide-27
SLIDE 27

27

TopK Part 1 Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 N TopK Intra-beam Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 Index Prob

[1,3] [.9,.8] [2,4] [.99,.5] [9,0] [.3,.8] [5,0] [.1,.93] [7,6] [.85,.99]

Gather

slide-28
SLIDE 28

28

TopK Part 2 Gather Output Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99] TopK Inter-beam Output Indices [2,9,7,0,8] Prob [.99,.99,.93,.9,.85] Beam Mapping Plugin Intra-beam Output Output

Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

slide-29
SLIDE 29

29

Beam Search – Beam Shuffle Output

Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0 State Beam1 State Beam2 State Beam3 State Beam4 State Beam1 State+1 Beam4 State+1 Beam3 State+1 Beam0 State+1 Beam4 State+1 Beam Shuffle Plugin

slide-30
SLIDE 30

30

Beam Search – Beam Scoring Output

Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

Beam0 State+1 Beam1 State+1 Beam2 State+1 Beam3 State+1 Beam4 State+1 Beam0 State Beam1 State Beam2 State Beam3 State Beam4 State Beam Scoring Plugin EOS Detection Sentence Probability Update Backtrack State Storage Sequence Length Increment End of Beam/Batch Heuristic Batch Finished Bitmap [0001100011…010]

slide-31
SLIDE 31

31

Beam Search – Batch Reduction Batch Finished Bitmap [0001100011…010] Reduce Operation(Sum) Transfer 32bit to Host as new batch size. TopK Gather Encoder Output Encoder Output Encoder/State Reduction Plugin Beam State Beam State

slide-32
SLIDE 32

32

Output Output

Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7

All Done? No Decoder Input Yes Beam State Device To Host On Host: Beam State Output こんにちは。 これはテストです。 さようなら。

slide-33
SLIDE 33

33

TENSORRT ANALYSIS

slide-34
SLIDE 34

34

TENSORRT KERNEL ANALYSIS

slide-35
SLIDE 35

35

Agenda

  • What is NMT?
  • What is current state?
  • What are the problems?
  • How did we solve it?
  • What perf is possible?
slide-36
SLIDE 36

36

RESULTS

4 25 550

280 ms 153 ms 117 ms

50 100 150 200 250 300 350 400 450 500 100 200 300 400 500 600 CPU-Only + Torch V100 + Torch V100 + TensorRT

Latency (ms) Sentences/sec

Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On

140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT)

slide-37
SLIDE 37

37

SUMMARY

  • Show that topK no longer dominates sequence inference time.
  • Show that RNN Inference is compute bound, not memory bound.
  • TensorRT accelerates sequence inferencing.
  • Over two orders of magnitude higher throughput over CPU.
  • Latency reduction by more than half over CPU.

developer.nvidia.com/tensorrt

PRODUCT PAGE

slide-38
SLIDE 38

38

LEARN MORE

developer.nvidia.com/tensorrt

PRODUCT PAGE

docs.nvidia.com/deeplearning/sdk

DOCUMENTATION

nvidia.com/dli

TRAINING

slide-39
SLIDE 39

39

Q&A

slide-40
SLIDE 40