Sparse and Constrained Attention for Neural Machine Translation - PowerPoint PPT Presentation

Sparse and Constrained Attention   for   Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , André F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Técnico, 3 Unbabel 1

Adequacy in Neural Machine Translation Source: und wir benutzen dieses wort mit solcher verachtung . Repetitions Reference: and we say that word with such contempt . Translation: and we use this word with such contempt contempt . Ein 28-jähriger Koch, der kürzlich nach Pittsburgh   Source: gezogen war, wurde diese Woche im Treppenhaus   eines örtlichen Einkaufszentrums tot aufgefunden . A 28-year-old chef who recently moved to   Dropped words Pittsburgh was found dead in the   Reference: staircase of a local shopping mall this week . A 28-year-old chef who recently moved to   Translation: Pittsburgh was found dead in the   staircase this week . 2

Previous Work • Conditioning on coverage vectors to track   attention history (Mi, 2016 ; Tu, 2016). • Gating architectures and adaptive attention to control   amount of source context (Tu, 2017; Li & Zhu, 2017). • Reconstruction Loss (Tu, 2017). • Coverage penalty during decoding (Wu, 2016). 3

Main Contributions J'ai mangé le sandwich 1. Fertility-based Neural Machine Translation Model   (Bounds on source attention weights) 2. Novel attention transform function: Constrained Sparsemax (Enforces these bounds) 3. Evaluation Metrics: REP-Score and DROP-Score 4

NMT + Attention Architecture 5

I ate the sandwich e1 e2 e3 e4 c 1 c 2 c 3 c 4 g 1 g 2 g 3 g 4 attn_transform attn_score h 2 h 3 h 4 h 1 attn_score:   • dot-product (Luong, 2015) • bilinear function • MLP (Bahdanau, 2014) attn_transform:   • traditional softmax f1 f2 f3 f4 • constrained softmax (Martins & Kreutzer, 2017) • sparsemax (Martins & Astudillo, 2016) J'ai mangé le sandwich • constrained sparsemax ( this work ) 6

Attention Transform Functions • Sparsemax: Euclidean projection of z provides sparse probability distributions. • Constrained Softmax: Returns the distribution closest   to softmax whose attention probabilities are bounded by upper bounds u. 7

Attention Transform Functions Sparse and Constrained? 8

Constrained Sparsemax • Provides sparse and bounded probability distributions. • This transformation has two levels of sparsity:   over time steps & over attended words at each step. • Efficient linear and sublinear time algorithms for   forward and backward propagation. 9

Visualization: Attention transform functions t=0 t=1 t=2 • csparsemax provides sparse and constrained   probabilities. 10

Fertility-based NMT Model 11

Fertility-based NMT • Allocate fertilities for each source word as attention   budgets that exhaust over decoding. • Fertility Predictor : Train biLSTM model supervised by fertilities from fast_align (IBM Model 2). 12

Fertility-based NMT • Fertilities incorporated as: • Exhaustion strategy to encourage more attention for   words with larger credit remaining: 13

Experiments 14

Experiments • Experiments performed on 3 language pairs:   De-En (IWSLT 2014), Ro-En (Europarl), Ja-En (KFTT). • Joint BPE with 32K merge operations. • Default hyperparameter settings in OpenNMT-Py. • Baselines: Softmax , + CovPenalty (Wu, 2016) and   + CovVector (Tu, 2016) 15

Evaluation Metrics: REP-Score & DROP-Score REP Score: • Penalizes n-gram repetitions in predicted translations. • Normalize by number of words in reference corpus. DROP Score: • Find word alignments from source to reference & source to predicted. • % of source words aligned with some word in reference,   but not with any word in predicted translation. 16

Results BLEU Scores 31 30.08 29.85 29.81 29.77 29.69 29.67 29.63 29.51 27.8 24.6 21.4 21.53 21.31 20.7 20.36 18.2 15 De-En Ja-En Ro-En softmax softmax+CovPenalty softmax+CovVector csparsemax 17

Lower is better! REP Scores 16.0 12.8 14.12 13.48 11.4 11.07 9.6 6.4 3.2 3.47 3.37 2.93 2.67 2.45 2.48 2.42 1.98 0.0 De-En Ja-En Ro-En softmax softmax+CovPenalty softmax+CovVector csparsemax DROP Scores 24.0 23.3 22.79 22.18 21.59 19.2 14.4 9.6 4.8 5.89 5.74 5.65 5.59 5.49 5.47 5.44 5.23 0.0 De-En Ja-En Ro-En 18

• csparsemax   yields sparse set of   alignments and   avoids repetitions. softmax csparsemax 19

Examples of Translations 20

Sparse and Constrained Attention for Neural Machine Translation - PowerPoint PPT Presentation

Sparse and Constrained Attention for Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , Andr F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Tcnico, 3 Unbabel 1 Adequacy in Neural Machine

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Minimax Rates for Memory-Constrained Sparse Linear Regression Jacob Steinhardt John Duchi

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Bag of Pursuits and Neural Gas for Improved Sparse Coding Manifold Learning with Sparse Coding

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parent Meetings February 3 rd & 4 th , 2015 Grafton High School Presenters: Ken

Phase Mathematics June 2014 Dr Lynn Bowie Team Members Dr Lynn Bowie, Mathematics Education

High School Course Planning Rising 10-12 th Grades 2018-2019 Welcome! School Counseling Services

Administration emails listed in program Emails also listed under Faculty & Staff

Chehalis Basin Strategy: Reducing Flood Damage and Enhancing Aquatic Species Comparison of

An Analysis of Immigrant Earnings and Welfare Usage in Ireland 25 th October 2006 Alan Barrett

Sinking, Swimming, or Learning to Swim in Medicare Part D Jonathan D. Ketcham 1 Claudio Lucarelli

TRADE AND LABOUR MARKET ADJUSTMENT Trade and Employment in a Globalised World Borobudur Hotel,

Sparse and Constrained Attention for Neural Machine Translation - PowerPoint PPT Presentation

Sparse and Constrained Attention for Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , Andr F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Tcnico, 3 Unbabel 1 Adequacy in Neural Machine

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

Advanced Neural Machine Translation Gongbo Tang 23 September 2019 Outline NMT with Attention

Advanced Neural Machine Translation Gongbo Tang 21 September 2020 Outline NMT with Attention

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Minimax Rates for Memory-Constrained Sparse Linear Regression Jacob Steinhardt John Duchi

Effective Approaches to Attention-based Neural Machine Translation Thang Luong Hieu Pham and Chris

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Bag of Pursuits and Neural Gas for Improved Sparse Coding Manifold Learning with Sparse Coding

Visual Attention FEF V4 spatial attention: simultaneous neural recordings in V4

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parent Meetings February 3 rd &amp; 4 th , 2015 Grafton High School Presenters: Ken

Phase Mathematics June 2014 Dr Lynn Bowie Team Members Dr Lynn Bowie, Mathematics Education

High School Course Planning Rising 10-12 th Grades 2018-2019 Welcome! School Counseling Services

Administration emails listed in program Emails also listed under Faculty &amp; Staff

Chehalis Basin Strategy: Reducing Flood Damage and Enhancing Aquatic Species Comparison of

An Analysis of Immigrant Earnings and Welfare Usage in Ireland 25 th October 2006 Alan Barrett

Sinking, Swimming, or Learning to Swim in Medicare Part D Jonathan D. Ketcham 1 Claudio Lucarelli

TRADE AND LABOUR MARKET ADJUSTMENT Trade and Employment in a Globalised World Borobudur Hotel,

Parent Meetings February 3 rd & 4 th , 2015 Grafton High School Presenters: Ken

Administration emails listed in program Emails also listed under Faculty & Staff