Moving to Neural Machine Translation at Google Mike Schuster, - PowerPoint PPT Presentation

Moving to Neural Machine Translation at Google Mike Schuster, Google Brain Team 12/18/2017 Confidential & Proprietary Confidential & Proprietary

Growing Use of Deep Learning at Google Across many products/areas: # of directories containing model description files Android Apps GMail Image Understanding Maps NLP Photos Speech Translation many research uses.. YouTube … many others ...

Why we care about translations ● 50% of Internet content is in English. ● Only 20% of the world’s population speaks English. To make the world’s information accessible, we need translations Confidential & Proprietary

Google Translate, a truly global product... 1B+ Translations every single day, that is 140 Billion Words 1B+ Monthly active users Google Translate Languages cover 99% of online 103 population Confidential & Proprietary

Agenda ● Quick History ● From Sequence to Sequence-to-Sequence Models BNMT (Brain Neural Machine Translation) ● ○ Architecture & Training ○ Segmentation Model ○ TPU and Quantization ● Multilingual Models What’s next? ● Confidential & Proprietary

Quick Research History ● Various people at Google tried to improve translation with neural networks ○ Brain team, Translate team Confidential & Proprietary

Quick Research History ● Various people at Google tried to improve translation with neural networks ○ Brain team, Translate team ● Sequence-To-Sequence models (NIPS 2014) ○ Based on many earlier approaches to estimate P(Y|X) directly ○ State-of-the-art on WMT En->Fr using custom software, very long training ○ Translation could be learned without explicit alignment! ○ Drawback: all information needs to be carried in internal state ■ Translation breaks down for long sentences! Confidential & Proprietary

Quick Research History ● Various people at Google tried to improve translation with neural networks ○ Brain team, Translate team ● Sequence-To-Sequence models (NIPS 2014) ○ Based on many earlier approaches to estimate P(Y|X) directly ○ State-of-the-art on WMT En->Fr using custom software, very long training ○ Translation could be learned without explicit alignment! ○ Drawback: all information needs to be carried in internal state ■ Translation breaks down for long sentences! Attention Models (2014) ● ○ Removes drawback by giving access to all encoder states Translation quality is now independent of sentence length! ■ Confidential & Proprietary

Old : Phrase-based translation New : Neural machine translation ● End-to-end learning Lots of individual pieces ● ● Simpler architecture ● Optimized somewhat independently ● Plus results are much better! Preprocess Neural Network Confidential & Proprietary

Expected time to launch: 3 years Actual time to launch: 13.5 months Sept 2015: Feb 2016: Sept 2016: Nov 2016: Mar 2017: Apr 2017: Jun/Aug 2017: Began project First zh->en 8 languages 7 more 26 more 36/20 more using production launched launched launched launched launched TensorFlow data results (16 pairs to/from (Hindi, Russian, (16 European, 8 Indish, English) Vietnamese, Thai, Indonesian, Afrikaans) Polish, Arabic, Hebrew) 97 launched! Confidential & Proprietary

Original Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude.

Original Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude. Back translation from Japanese (old) Kilimanjaro is 19,710 feet of the mountain covered with snow, and it is said that the highest mountain in Africa. Top of the west, “Ngaje Ngai” in the Maasai language, has been referred to as the house of God. The top close to the west, there is a dry, frozen carcass of a leopard. Whether the leopard had what the demand at that altitude, there is no that nobody explained.

Original Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude. Back translation from Japanese (old) Kilimanjaro is 19,710 feet of the mountain covered with snow, and it is said that the highest mountain in Africa. Top of the west, “Ngaje Ngai” in the Maasai language, has been referred to as the house of God. The top close to the west, there is a dry, frozen carcass of a leopard. Whether the leopard had what the demand at that altitude, there is no that nobody explained. Back translation from Japanese (new) Kilimanjaro is a mountain of 19,710 feet covered with snow, which is said to be the highest mountain in Africa. The summit of the west is called “Ngaje Ngai” God‘s house in Masai language. There is a dried and frozen carcass of a leopard near the summit of the west. No one can explain what the leopard was seeking at that altitude.

Translation Quality 6 = Perfect translation ● Asian languages improved the most + ● Some improvements as big as last 10 years of improvements combined △ Translation Quality Translation Quality 0.6-1.5 +0.6 >0.5 >0.1 0 Chinese to Significant Almost all Zh/Ja/Ko/Tr to 0 = Worst translation English change & language pairs English launchable Confidential & Proprietary

Relative Error Reduction Confidential & Proprietary

Does quality matter? +75% Increase in daily English - Korean translations on Android over the past six months Confidential & Proprietary

Neural Recurrent Sequence Models ● Predict next token: P(Y) = P(Y1) * P(Y2|Y1) * P(Y3|Y1,Y2) * ... Language Models, state-of-the-art on public benchmark ○ Exploring the limits of language modeling ■ Y1 Y2 Y3 EOS EOS Y1 Y2 Y3 Confidential & Proprietary

Applications ● Speech Recognition ○ Estimate state posterior probabilities per 10ms frame Video Recommendations ● ○ With hierarchical softmax and MaxEnt model for top 500k YouTube videos Confidential & Proprietary

Image Captioning ● Combine image classification and sequence model ○ Feed output from image classifier and let it predict text ○ Show and Tell: A Neural Image Caption Generator “A close up of a child holding a stuffed animal” Confidential & Proprietary

Sequence to Sequence ● Learn to map: X1, X2, EOS -> Y1, Y2, Y3, EOS ● Encoder/Decoder framework (decoder by itself just neural LM) Theoretically any sequence length for input/output works ● Y1 Y2 Y3 EOS X1 X2 EOS Y1 Y2 Y3 Confidential & Proprietary

Sequence to Sequence in 1999... ● NN for estimating directly P(Y|X) for equal length X and Y ● Encoder (BRNN)/Decoder framework but in a single NN NIPS 1999 ● ○ Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks Y1 Y2 Y3 EOS X1 X2 X3 EOS Y1 Y2 Y3 Confidential & Proprietary

Deep Sequence to Sequence Y1 Y2 </s> SoftMax Encoder LSTMs Decoder LSTMs X3 X2 </s> <s> Y1 Y3 Confidential & Proprietary

Attention Mechanism ● Addresses the information bottleneck problem ○ All encoder states accessible instead of only final one e ij T i S j Confidential & Proprietary

BNMT Model Architecture Confidential & Proprietary

Model Training ● Runs on ~100 GPUs (12 replicas, 8 GPUs each) ○ Because softmax size only 32k, can be fully calculated (no sampling or HSM) Confidential & Proprietary

Model Training ● Runs on ~100 GPUs (12 replicas, 8 GPUs each) ○ Because softmax size only 32k, can be fully calculated (no sampling or HSM) Optimization ● ○ Combination of Adam & SGD with delayed exponential decay 128/256 sentence pairs combined into one batch (run in one ‘step’) ○ Confidential & Proprietary

Model Training ● Runs on ~100 GPUs (12 replicas, 8 GPUs each) ○ Because softmax size only 32k, can be fully calculated (no sampling or HSM) Optimization ● ○ Combination of Adam & SGD with delayed exponential decay 128/256 sentence pairs combined into one batch (run in one ‘step’) ○ ● Training time ○ ~1 week for 2.5M steps = ~300M sentence pairs ○ For example, on English->French we use only 15% of available data! Confidential & Proprietary

Wordpiece Model (WPM) ● Dictionary too big (~100M unique words!) ○ Cut words into smaller units Confidential & Proprietary

Moving to Neural Machine Translation at Google Mike Schuster, - PowerPoint PPT Presentation

Moving to Neural Machine Translation at Google Mike Schuster, Google Brain Team 12/18/2017 Confidential & Proprietary Confidential & Proprietary Growing Use of Deep Learning at Google Across many products/areas: # of directories

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

BIOTECHNOLOGY: History, State of the art, Future Dr Marcel Daba BENGALY Universit Ouaga I Pr

Atlas a data warehouse for integrative bioinformatics Shorab P Shah, Yong Huang, Tao Xu, Macaire

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Bioinformatics CS300 Crash course: Transcription and Translation Running Python in Docker or

Computational Methods for Systems and Synthetic Biology Franois Fages The French National

Outline of Next 2 Weeks THIS WEEK What is Life Origin of Life & Man Genesis //

Prof.

Genetic Algorithms: An introductory Overview References:An introduction to Genetic Algorithms by

Moving to Neural Machine Translation at Google Mike Schuster, - PowerPoint PPT Presentation

Moving to Neural Machine Translation at Google Mike Schuster, Google Brain Team 12/18/2017 Confidential & Proprietary Confidential & Proprietary Growing Use of Deep Learning at Google Across many products/areas: # of directories

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

BIOTECHNOLOGY: History, State of the art, Future Dr Marcel Daba BENGALY Universit Ouaga I Pr

Atlas a data warehouse for integrative bioinformatics Shorab P Shah, Yong Huang, Tao Xu, Macaire

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Bioinformatics CS300 Crash course: Transcription and Translation Running Python in Docker or

Computational Methods for Systems and Synthetic Biology Franois Fages The French National

Outline of Next 2 Weeks THIS WEEK What is Life Origin of Life &amp; Man Genesis //

Prof.

Genetic Algorithms: An introductory Overview References:An introduction to Genetic Algorithms by

Outline of Next 2 Weeks THIS WEEK What is Life Origin of Life & Man Genesis //