Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - PowerPoint PPT Presentation

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang

Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are • technology staff, the highest in China’s Internet industry No. 2 Chinese Internet company in terms of user base 38% of employees hold graduate or • doctor degrees Robust revenue growth PC MAU 520MM , mobile MAU 560MM , • Revenue CAGR of 126% from 2011 to 2015, covering 96% of the Internet users in And In 2015 revenue reached $ 592 million, China profit of $ 110 million.

Rich Product Line Sogou search including Web Search and 24 Vertical Search Products. UGC Platform : Sogou Wenwen 、 Sogou Encyclopedia 、 Sogou Guidance Sogou Exclusive : WeChat search 、 Zhihu search 、 English search

Outline 1 . Neural Machine Translation 2 . Related application scenarios

Machine Translation Automatically translate one sentence of source language into target language  沙龙举行会谈布什与了 Bush held talks with Sharon Methods  Rule-based machine translation (RBMT)  Example-based machine translation (EBMT)  Statistical Machine Translation (SMT)  …  5

Neural Machine Translation – A New Era To model the direct mapping between source and target language by neural network  沙龙举行会谈布什与了 Bush held talks with Sharon Really amazing translation quality  Edinburgh’s WMT Results Over the Years 24.7 25 22.1 22 21.5 20.9 20.8 20.3 20.2 19.4 20 15 phrase-based SMT syntax-based SMT 10 neural MT 5 0 2013 2014 2015 2016 From ( Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf ) 6

Neural Machine Translation – A New Era Encoder-Decoder Framework  Encoder : represent the source sentence as a vector by neural network  Decoder : generate target words one by one based on the vector from Encoder  布什与沙龙举行了会谈 <\s> Bush held talks with Sharon <\s> What do we actually have in the encoded vector?  (Sutskever et al., 2014) 7

Neural Machine Translation – A New Era Attention Mechanism  For each target word to be generated, dynamically calculate the source language  information related to it <\s 布什与沙龙举行了会谈 > Weighted average Bush held talks 8

Sogou Neural Machine Translation Engine A pure neural-based commercial machine translation engine  Stacked encoders and decoders  Dual Learning  Residual network  Zero-shot Learning  Length normalization …   Domain adaptation  …  Bush held talks with Sharon Encoder hidden states Softmax … … … … … … … … … … … Attention Mechanism … … … 举行了会谈布什与沙龙 9

Sogou Neural Machine Translation Engine Keep optimizing our translation engine on translation model, bilingual data mining,  distributed training and decoding. Focus on Chinese-English and English-Chinese translation now  Good performance on Chinese-English and Engilsh-Chinese translation  Human Evaluation on Human Evaluation on Chinese-English Translation English-Chinese Translation 4.3 4.3 4.1 4.1 4.2 Sogou 3.9 3.9 3.9 3.7 3.7 Sogou 3.6 3.5 3.5 3.3 3.3 3.1 3.1 2.9 2.9 2.9 2.7 2.7 2.5 2.5 Initial performance Current performance Initial performance Current performance 10

Challenges in Real Application Training is too slow !!!!!  (Sutskever et al., (Wu et al., 2016) 2014) Decoding is slow  less than 200ms per translation request on average to meet the real time standard  Take a one layer GRU NMT system as an example  Vocabulary size: 80000 Word embedding: 620 Hidden state: 1000  Encoder(bidirection): ~ 16M MACs per word (just forward)  2*3*2000*1000 + 2*3*620*1000  Decoder: ~70M MACs per word (just forward)  For Training: 3*3620*1000 + 3*2000*1000 +80000*620  For BeamSearch inference: Decoder computation is BeamSize times larger!  We need fast training and decoding  11

Distributed Training • Parameter server – Keep current model parameters – Receive gradients from workers, and update parameters accordingly • Workers – Make use of GPU for model training – Communicate with Parameter server to update parameters 12

Distributed Training • Asynchronous – Each worker send local updated parameters to Parameter server – Parameter server averages the parameters from worker with its own version – Return the updated parameter to worker • Synchronous – Each worker send its gradients to Parameter server – Parameter server do parameter updating after it receives the gradients from all workers 13

Distributed Training • Acceleration ratio – Asynchronous • around 3x acceleration with 10 GPU cards – Synchronous • Acceleration ratio v.s. number of GPU – (same batchsize * number of GPU) 16 1 Acceleration efficiency 13.232 1 0.976 Acceleration ratio 12 0.9 0.926 7.408 8 0.8 0.827 3.904 4 0.7 1 0 0.6 1 4 8 16 number of GPU 14

Training acceleration • Acceleration on single card – Corpus shuffle • Global random shuffle • Local Sort – sort by sentence length inside each 20 mini-batches – in each mini-batch, sentence length is similar – Optimization function selection • Adadelta • Momentum • Adam – about 2 times faster than above 15

Training acceleration • Acceleration on single card – Use better GPU or newer CUDA if possible ☺ 1.8 2.5 2.26 batch 1.6 1.97 time(s) 2 speed up (X) 1.4 1.59 1.2 1.33 1.5 1 1 batch time 0.8 1 0.6 0.4 0.5 0.2 0 0 16

Decoding acceleration • Compute acceleration – fusion of Computations • fusion element wise operations together • fusion matrix multiplications to larger ones – also fusion parameter matrix ahead of time • fusion input embeding projection together – instead of at each step – CUDA function selection • for batchsize=1, use level 2 cuBLAS function instead of level 3 17

Decoding acceleration • Batch Processing – about 3x faster than single sentence • use batch mode if possible – Sentence reordering • sentence length may vary greatly • Encoder – reorder sentence by length – scale batchsize at each step • Decoder – rearrange beams at each step – also scale batchsize according to left beams 18

Decoding acceleration • Other acceleration methods – Use better GPU or newer CUDA if possible ☺ 0.6 3 2.67 batch time(s) 0.5 2.5 2.21 speed up (X) 1.81 0.4 2 1.35 0.3 1.5 batch time 1 0.2 1 0.1 0.5 0 0 19

comparison with training • P40 v.s. P100 P40 P100 TFLOPS 12T 9.3T Memory Bandwidth 346GB/s 732GB/s • batchsize – training: 80 or more • Computation dominate – inference: 10 or less • memory bandwidth also play an important role 20

Outline 1 . Neural Machine Translation 2 . Related application scenarios 21

Sogou translate related products Translation box Translation Vertical Translation    in search results channel with OCR 22

Sogou translate related products Oversea search  Chinese machine abstract translatio machine Chinese English English n translatio query query results n Chinese machine webpages translatio n 23

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu - PowerPoint PPT Presentation

Neural Machine Translation In Sogou, Inc. Feifei Zhai and Dongxu Yang Sogou Company Strong R&D Capabilities No. 2 2,100 employees, of which 76% are technology staff, the highest in Chinas Internet industry No. 2 Chinese Internet

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Using Coreference Links to Improve Spanish-to-English Machine Translation Lesly Miculicich

Managing numerical simulations using Python, prayers, and wizardry. Dr Allen: pure

for Machine Vision From AI to ML to DL Source: Whats the Difference Between Artificial

Beta Presentation Reducing Shoplifting Using Machine Learning The Capstone Experience Team

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh

Taking Advantage of Low Precision to Accelerate Training and Inference Using PyTorch Presented

Peter De Boos DRUPAL TRANSLATION MANAGEMENT: THE EASY WAY DRUPAL = MULTILINGUAL FRIENDLY

GlobalSight is an open-source translation management platform that manages the end-to-end