FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster - PowerPoint PPT Presentation

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18

AGENDA What is Faster Transformer Introduce the Transformer and Faster Transformer 1.0 New Features in Faster Transformer 2.0 Introduce the Faster Transformer 2.0 Faster Transformer 2.0 performance Demonstrate the performance of Faster Network Pruning Q&A time 2

WHAT IS FASTER TRANSFORMER 3

WHAT IS FASTER TRANSFORMER What is Transformer Decoder Proposed in “Attention Is All You Need”[1] Feed Forward Only use attention mechanism Network Encoder Application: Feed Forward Encoder-Decoder QA Network Attention N layers N layers Online classification Search: Relationship of ads Self-Attention Self-Attention [1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit , J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). 4

WHAT IS FASTER TRANSFORMER What is Transformer Transformer is the major component in BERT BERT is proposed in 2018, and become the state-of-the-art method in the time However, the model is too large, and is hard to satisfy the latency requirement in real application [1] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 5

LONG STORY OF FASTER TRANSFORMER Attention only is not Plan to extend the Attention Is ALL Most customers were enough Faster Transformer to You Need asking training BERT . Optimize the transformer decoder layer entirely. 2018/12 2019/02 2019/08 2017/12 2019/01 2019/03 2019/09 Meituan (online classification) Complete the Ant Financial (QA): BERT Faster Transformer 1.0 Use in batch size 1 Optimize on BERT model Plan to optimize attention only 6

FASTER TRANSFORMER 1.0 FEATURES Optimize the encoder An equivalent forward implementation of the BERT transformer layer Single layer, forward only Based on top of CUDA + cuBLAS Support FP32/FP16 on NVIDIA Tesla P4/V100/T4 Arbitrary batch size, sequence length 32/64/128 Basic model (12 * 64 heads) or smaller (4 * 32 heads) Provide C++/TensorRT plugin/TensorFlow OP API 7

FASTER TRANSFORMER 1.0 DETAIL What we do in Faster Transformer 1.0? TensorFlow will split operation into many basic operation E.g. split layer norm into add, sub, mean, sqrt, … Kernel launch overhead Fuse the operations except GEMM as much as possible add bias + layer norm add bias + activation Transpose 3 matrices together in attention … 8

FASTER TRANSFORMER 1.0 DETAIL How to use Faster Transformer? Provide C, Tensorflow and TensorRT API Provide sample codes to demonstrate how to use In C: 9

FASTER TRANSFORMER 1.0 DETAIL How to use Faster Transformer? Provide C, Tensorflow and TensorRT API Provide sample codes to demonstrate how to use In TensorFlow: 10

FASTER TRANSFORMER 1.0 SUMMARY Faster Transformer 1.0 speedup about 1.5x compare to TensorFlow with XLA on FP16 Faster Transformer 1.0 is released in https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer Currently, we only optimize the encoder, what about decoder? 11

WHY WE NEED TO OPTIMIZE DECODER Encoder v.s. Decoder Encoder: Compute entire sentence in one time Few large matrix multiplication E.g., one time for a length 128 sentence Decoder: Compute word by word, sequence length times Many small matrix multiplication E.g., 128 times for a length 128 sentence 12

WHY WE NEED TO OPTIMIZE DECODER Translating Progress I love you . 13

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Embedding I love you . 14

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Embedding Encoder I love you . 15

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Embedding Encoder output I love you . 16

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder progress Encoder Embedding Encoder output I love you . 17

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder output I NULL love you . 18

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding output I NULL love you . 19

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output I NULL love you . 20

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL love you . 21

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我 love you . 22

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我 love you . 23

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love you . 24

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱 you . 25

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱 you . 26

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱你 you . 27

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱你 you . 你 28

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱你 you . 你 29

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱你 you 。 . 你 30

WHY WE NEED TO OPTIMIZE DECODER Translating Progress Encoder Decoder progress progress Encoder Embedding Encoder Embedding Decoder output 我 I NULL 我爱 love 爱你 you 。 . 你 31

WHY WE NEED TO OPTIMIZE DECODER Decoder consumes more time In Faster Transformer 1.0, we implement a highly optimized transformer layer for encoder. However, in a whole translating progress, most time is consumed in decoder. Encoder v.s. Decoder Encoder < 10 ms v.s. decoder > 100 ms in most time E.g., batch 1, sequence length 32 on NVIDIA Tesla T4 with FP32 Encoder: 12 layers, hidden units 768: 2.74 ms Decoding: Beam width 4, 6 layers, hidden units 512: 64.16 ms So, we optimize the decoder in the Faster Transformer 2.0 32

NEW FEATURES IN FASTER TRANSFORMER 2.0 33

NEW FEATURE IN FASTER TRANSFORMER 2.0 Summary Decoder We propose two components: Decoder and Decoding Feed Forward Both based on OpenNMT-tf [1] model Network Decoder contains two attention layer and a FFN, providing 1.4x ~ 2x speedup Encoder-Decoder Decoding contains whole translating process, providing 1.5x ~ 9x speedup Attention The smaller batch size, the larger speedup Self-Attention 34 [1] https://github.com/OpenNMT/OpenNMT-tf

NEW FEATURE IN FASTER TRANSFORMER 2.0 Decoder and Decoding Compute log probs Beam search Decoder Feed Forward Network Encoder Feed Forward Encoder-Decoder Network Attention N layers N layers Self-Attention Self-Attention Lookup embedding table 35

NEW FEATURE IN FASTER TRANSFORMER 2.0 Decoder and Decoding Decoding Compute log probs Beam search Decoder Feed Forward Network Encoder Feed Forward Encoder-Decoder Network Attention N layers N layers Self-Attention Self-Attention Lookup embedding table 36

NEW FEATURE IN FASTER TRANSFORMER 2.0 Decoder and Decoding decoding(encoder_result, start_id){ id = start_id while(finished == false){ decoder_input = lookup_embedding_table(id) decoder_output = decoder(decoder_input, encoder_output, num_layer) log_prob = dense(decoder_output) id = beamsearch(log_prob, candidate_number) } } 37

NEW FEATURE IN FASTER TRANSFORMER 2.0 Decoder and Decoding Compare to Decoder, Decoding is more efficient If we translate a 32 words sentence We need to call 32 times Decoder, and lead to 32 times of op launch overhead We only need to call 1 time Decoding Decoding also provides an optimized naïve beamsearch 38

NEW FEATURE IN FASTER TRANSFORMER 2.0 How to use decoder and decoding? Similar to Faster Transformer 1.0 Provide C and Tensorflow API Provide sample codes to demonstrate how to use Decoder in TensorFlow: 39

NEW FEATURE IN FASTER TRANSFORMER 2.0 How to use decoder and decoding? Similar to Faster Transformer 1.0 Provide C and Tensorflow API Provide sample codes to demonstrate how to use Decoding in TensorFlow: 40

FASTER TRANSFORMER 2.0 PERFORMANCE 41

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster - PowerPoint PPT Presentation

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the Transformer and Faster Transformer 1.0 New Features in Faster Transformer 2.0 Introduce the Faster Transformer 2.0 Faster Transformer 2.0 performance

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Faster Code Nicolas Limare 2014/11/19 faster? one task vs many speeds one operation vs many

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

EUCLIDEAN DISTANCE TRANSFORM ON XAVIER Vincent Bao, Stanley Tzeng, Ching Hung AGENDA This talk

Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network

Grasping the Finer Point: A Supervised Similarity Network for Metaphor Detection Marek Rei,

THE EMERGENCE OF DEEP LEARNING CLINICAL DECISION SUPPORT TRACK FELIX DOGBE (FDOGBE1@UMBC.EDU)

SpartanNashs Sustainability Journey April 26, 2016 | 2016 MiFOOD Summit Meredith Gremel |

11/12/2015 Walter Hawkins Eddie Fernandez Jose Fernandez Kevin Shaughnessy Stina D'Uva Maribel

Better for Babies: Improving State Early Care and Education Policies Stephanie Schmit Hannah

Framework for profiling Critical Path related Algorithms RP#66 Henri Trenquier Supervisors: Dr.

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster - PowerPoint PPT Presentation

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the Transformer and Faster Transformer 1.0 New Features in Faster Transformer 2.0 Introduce the Faster Transformer 2.0 Faster Transformer 2.0 performance

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Faster

WRITING FASTER CODE 1 . 1 WRITING FASTER CODE AND NOT HATING YOUR JOB AS A SOFTWARE DEVELOPER

Faster Code Nicolas Limare 2014/11/19 faster? one task vs many speeds one operation vs many

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 &amp; Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

EUCLIDEAN DISTANCE TRANSFORM ON XAVIER Vincent Bao, Stanley Tzeng, Ching Hung AGENDA This talk

Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network

Grasping the Finer Point: A Supervised Similarity Network for Metaphor Detection Marek Rei,

THE EMERGENCE OF DEEP LEARNING CLINICAL DECISION SUPPORT TRACK FELIX DOGBE (FDOGBE1@UMBC.EDU)

SpartanNashs Sustainability Journey April 26, 2016 | 2016 MiFOOD Summit Meredith Gremel |

11/12/2015 Walter Hawkins Eddie Fernandez Jose Fernandez Kevin Shaughnessy Stina D'Uva Maribel

Better for Babies: Improving State Early Care and Education Policies Stephanie Schmit Hannah

Framework for profiling Critical Path related Algorithms RP#66 Henri Trenquier Supervisors: Dr.

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer