The Best of Both Worlds Combining Recent Advances in Neural Machine - PowerPoint PPT Presentation

The Best of Both Worlds Combining Recent Advances in Neural Machine Translation Mia Xu Chen* Orhan Firat * Ankur Bapna * Melvin Johnson Wolfgang Macherey George Foster Llion Jones Mike Schuster Noam Shazeer Niki Parmar Ashish Vaswani Jakob Uszkoreit Lukasz Kaiser Zhifeng Chen Yonghui Wu Macduff Hughes July 16, 2018 ACL’18 Mebourne *Equal Contribution

This is NOT an architecture search paper! The Best of Both Worlds P 2

A Brief History of NMT Models 2014 2015 2016 2017 2018 Sutskever et al. Wu et al. Vaswani et al. Cho et al. (Google-NMT) (Transformer) (Seq2Seq) Chen et al. (RNMT+ and Hybrids) Bahdanau et al. Gehring et al. (Attention) (Conv-Seq2Seq) : Data : Model : Hyperparameters The Best of Both Worlds P 3

The Best of Both Worlds - I Each new approach is: accompanied by a set of modeling and training techniques. ● Goal: Tease apart architectures and their accompanying techniques. 1. Identify key modeling and training techniques. 2. Apply them on RNN based Seq2Seq → RNMT+ 3. Conclusion: RNMT+ outperforms all previous three approaches. ● The Best of Both Worlds P 4

The Best of Both Worlds - II Also, each new approach has: a fundamental architecture (signature wiring of neural network). ● Goal: Analyse properties of each architecture. 1. Combine their strengths. 2. Devise new hybrid architectures → Hybrids 3. Conclusion: Hybrids obtain further improvements over all the others. ● The Best of Both Worlds P 5

Building Blocks RNN Based NMT - RNMT ● Convolutional NMT - ConvS2S ● Conditional Transformation Based NMT - ● Transformer Project name P 6

GNMT - Wu et al. ● Core Components: ○ RNNs ○ Attention (Additive) ○ biLSTM + uniLSTM ○ Deep residuals ○ Async Training ● Pros: ○ De facto standard ○ Modelling state space ● Cons: ○ Temporal dependence ○ Not enough gradients The Best of Both Worlds P 7 *Figure from “Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” Wu et al. 2016

ConvS2S - Gehring et al. ● Core Components: ○ Convolution - GLUs ○ Multi-hop attention ○ Positional embeddings ○ Careful initialization ○ Careful normalization ○ Sync Training ● Pros: ○ No temporal dependence ○ More interpretable than RNN ○ Parallel decoder outputs during training ● Cons: ○ Need to stack more to increase the receptive field P 8 *Figure from “Convolutional Sequence to Sequence Learning” Gehring et al. 2017

Transformer - Vaswani et al. Core Components: ● Self-Attention ○ Multi-headed attention ○ Layout: N->f()->D->R ○ Careful normalization ○ Careful batching ○ Sync training ○ Label Smoothing ○ Per-token loss ○ Learning rate schedule ○ Checkpoint Averaging ○ Pros: ● Gradients everywhere - faster optimization ○ Parallel encoding both training/inference ○ Cons: ● Combines many advances at once ○ Fragile ○ P 9 *Figure from “Attention is All You Need” Vaswani et al. 2017

The Best of Both Worlds - I: RNMT+ The Architecture: ● Bi-directional encoder 6 x LSTM ○ Uni-directional decoder 8 x LSTM ○ Layer normalized LSTM cell ○ Per-gate normalization ■ Multi-head attention ○ 4 heads ■ Additive (Bahdanau) ■ attention The Best of Both Worlds P 10

Model Comparison - I : BLEU Scores WMT’14 En-Fr WMT’14 En-De (35M sentence pairs) (4.5M sentence pairs) RNMT+/ConvS2S: 32 GPUs, ● 4096 sentence pairs/batch. Transformer Base/Big: 16 GPUs, ● 65536 tokens/batch. The Best of Both Worlds P 11

Model Comparison - II : Speed and Size WMT’14 En-Fr WMT’14 En-De (35M sentence pairs) (4.5M sentence pairs) RNMT+/ConvS2S: 32 GPUs, ● 4096 sentence pairs/batch. Transformer Base/Big: 16 GPUs, ● 65536 tokens/batch. The Best of Both Worlds P 12

Stability: Ablations Evaluate importance of four key techniques: 1. Label smoothing WMT’14 En-Fr Significant for both ○ 2. Multi-head attention Significant for both ○ 3. Layer Normalization Critical to stabilize training ○ (especially with multi-head attention) * Indicates an unstable training run 4. Synchronous training Critical for Transformer ○ Significant quality drop for RNMT+ ○ Successful only with a tailored ○ learning-rate schedule The Best of Both Worlds P 13

The Best of Both Worlds - II: Hybrids Strengths of each architecture: RNMT+ ● Highly expressive - continuous state space representation. ○ Transformer ● Full receptive field - powerful feature extractor. ○ Combining individual architecture strengths: ● Capture complementary information - “Best of Both Worlds”. ○ Trainability - important concern with hybrids ● Connections between different types of layers need to be carefully designed. ○ The Best of Both Worlds P 14

Encoder - Decoder Hybrids Separation of roles: ● Decoder - conditional LM ● Encoder - build feature representations → Designed to contrast the roles. (last two rows) The Best of Both Worlds P 15

Encoder Layer Hybrids Improved feature extraction: Enrich stateful representations with global ● self-attention Increased capacity ● Details: Pre-trained components to improve trainability ● Layer normalization at layer boundaries ● Cascaded Hybrid - vertical combination Multi-Column Hybrid - horizontal combination The Best of Both Worlds P 16

Encoder Layer Hybrids The Best of Both Worlds P 17

Lessons Learnt Need to separate other improvements from the architecture itself: Your good ol’ architecture may shine with new modelling and training techniques ● Stronger baselines (Denkowski and Neubig, 2017) ● Dull Teachers - Smart Students “A model with a sufficiently advanced lr-schedule is indistinguishable from magic.” ● Understanding and Criticism Hybrids have the potential, more than duct taping. ● Game is on for the next generation of NMT architectures ● The Best of Both Worlds P 18

Thank You Open source implementation coming soon! https://ai.google/research/join-us/ https://ai.google/research/join-us/ai-residency/ The Best of Both Worlds

The Best of Both Worlds Combining Recent Advances in Neural Machine - PowerPoint PPT Presentation

The Best of Both Worlds Combining Recent Advances in Neural Machine Translation Mia Xu Chen* Orhan Firat * Ankur Bapna * Melvin Johnson Wolfgang Macherey George Foster Llion Jones Mike Schuster Noam Shazeer Niki Parmar

Accountability 2.0 and the Worlds Best WorkforceWhat Does it Mean? Worlds Best Workforce

Shared Vision: Worlds Best Luxury Family Resort 1 Vision: Worlds Best Luxury Family Resort

2017-18 Worlds Best Workforce Results October 29, 2018 What is the Worlds Best Workforce

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

THE WORLDS FIRST ALL-ELECTRONIC OPEN-ACCESS TOLL HIGHWAY October 26, 2012 WORLDS 1 ST

THE AWARD CATEGORIES Best House Best Apartment Best Alteration and Renovation

41 1 Sustainable Performance US Dollar Best Trade Best Customer The BIZZ Qatar Corporate

Worlds Best Workforce Annual Report 2018-2019 Academic Year Annual Advisory Committee

Best of both worlds? How can this be? MHC-assortative facial preferences in humans Psyc

Controlled studies and naturalistic driving the best of both worlds? Katja Kircher, Christer

YIELD plus GROWTH The Best of Both Worlds Jim Bertram, President & CEO National Bank

Local at Heart Global by Nature Delivering the Best of Both Worlds 2015 Interim Results 3

Best of both worlds: Human-machine collaboration for object annotation Fei-Fei Li Olga

Good Clinical Practice Guidance and Pragmatic Trials: Balancing the Best of Both Worlds in the

The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for

LIO and the TCMU Userspace Passthrough: The Best of Both Worlds Andy Grover

Company Presentation 17th March 2015 Company History 1916 Tamini starts its activities 1961

The Changing Landscape of Risk November 17, 2015 Rob Schimek CEO, AIG Americas Region Standing

High Throughput Transactional Stream Processing Terence Yim (@chtyim) Who

Introduction to the Lego NXT Vasilis Spiliopoulos <

1H 2020 Results 1 September 2020 Disclaimer By attending the meeting where this presentation is

THE WAY LINGGO PROJECT UNLOCKING THE POTENTIAL EXPLORATION UPDATE Kingsrose Mining Limited

De Density nsity an and d La Large rge Sc Scale: ale: A 2 240 40-GHz, GHz, 32 32-Unit

Regional Service Councils All-Regions Public Hearing August 2020 + 25-30 % +15-20 % Secure

The Best of Both Worlds Combining Recent Advances in Neural Machine - PowerPoint PPT Presentation

The Best of Both Worlds Combining Recent Advances in Neural Machine Translation Mia Xu Chen* Orhan Firat * Ankur Bapna * Melvin Johnson Wolfgang Macherey George Foster Llion Jones Mike Schuster Noam Shazeer Niki Parmar

Accountability 2.0 and the Worlds Best WorkforceWhat Does it Mean? Worlds Best Workforce

Shared Vision: Worlds Best Luxury Family Resort 1 Vision: Worlds Best Luxury Family Resort

2017-18 Worlds Best Workforce Results October 29, 2018 What is the Worlds Best Workforce

City of Piedmont Best Best &amp; Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

THE WORLDS FIRST ALL-ELECTRONIC OPEN-ACCESS TOLL HIGHWAY October 26, 2012 WORLDS 1 ST

THE AWARD CATEGORIES Best House Best Apartment Best Alteration and Renovation

41 1 Sustainable Performance US Dollar Best Trade Best Customer The BIZZ Qatar Corporate

Worlds Best Workforce Annual Report 2018-2019 Academic Year Annual Advisory Committee

Best of both worlds? How can this be? MHC-assortative facial preferences in humans Psyc

Controlled studies and naturalistic driving the best of both worlds? Katja Kircher, Christer

YIELD plus GROWTH The Best of Both Worlds Jim Bertram, President &amp; CEO National Bank

Local at Heart Global by Nature Delivering the Best of Both Worlds 2015 Interim Results 3

Best of both worlds: Human-machine collaboration for object annotation Fei-Fei Li Olga

Good Clinical Practice Guidance and Pragmatic Trials: Balancing the Best of Both Worlds in the

The Best of Both Worlds: Combining Hand-Tuned and Word-Embedding-Based Similarity Measures for

LIO and the TCMU Userspace Passthrough: The Best of Both Worlds Andy Grover

Company Presentation 17th March 2015 Company History 1916 Tamini starts its activities 1961

The Changing Landscape of Risk November 17, 2015 Rob Schimek CEO, AIG Americas Region Standing

High Throughput Transactional Stream Processing Terence Yim (@chtyim) Who

Introduction to the Lego NXT Vasilis Spiliopoulos &lt;

1H 2020 Results 1 September 2020 Disclaimer By attending the meeting where this presentation is

THE WAY LINGGO PROJECT UNLOCKING THE POTENTIAL EXPLORATION UPDATE Kingsrose Mining Limited

De Density nsity an and d La Large rge Sc Scale: ale: A 2 240 40-GHz, GHz, 32 32-Unit

Regional Service Councils All-Regions Public Hearing August 2020 + 25-30 % +15-20 % Secure

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

YIELD plus GROWTH The Best of Both Worlds Jim Bertram, President & CEO National Bank

Introduction to the Lego NXT Vasilis Spiliopoulos <