Sparse and Constrained Attention for Neural Machine Translation - - PowerPoint PPT Presentation

sparse and constrained attention for neural machine
SMART_READER_LITE
LIVE PREVIEW

Sparse and Constrained Attention for Neural Machine Translation - - PowerPoint PPT Presentation

Sparse and Constrained Attention for Neural Machine Translation Chaitanya Malaviya 1 , Pedro Ferreira 2 , Andr F.T. Martins 2,3 1 Carnegie Mellon University, 2 Instituto Superior Tcnico, 3 Unbabel 1 Adequacy in Neural Machine


slide-1
SLIDE 1

Sparse and Constrained Attention 
 for 
 Neural Machine Translation

Chaitanya Malaviya1, Pedro Ferreira2, André F.T. Martins2,3

1Carnegie Mellon University, 2Instituto Superior Técnico, 3Unbabel

1

slide-2
SLIDE 2

Repetitions

Adequacy in Neural Machine Translation

Dropped words

Source: und wir benutzen dieses wort mit solcher verachtung . Translation: and we use this word with such contempt contempt . Reference: and we say that word with such contempt . Ein 28-jähriger Koch, der kürzlich nach Pittsburgh 
 gezogen war, wurde diese Woche im Treppenhaus 
 eines örtlichen Einkaufszentrums tot aufgefunden . A 28-year-old chef who recently moved to 
 Pittsburgh was found dead in the 
 staircase this week . A 28-year-old chef who recently moved to 
 Pittsburgh was found dead in the 
 staircase of a local shopping mall this week . Source: Reference: Translation:

2

slide-3
SLIDE 3

Previous Work

  • Conditioning on coverage vectors to track 


attention history (Mi, 2016 ; Tu, 2016).

  • Gating architectures and adaptive attention to control 


amount of source context (Tu, 2017; Li & Zhu, 2017).

  • Reconstruction Loss (Tu, 2017).
  • Coverage penalty during decoding (Wu, 2016).

3

slide-4
SLIDE 4

Main Contributions

  • 2. Novel attention transform function: Constrained Sparsemax

(Enforces these bounds)

  • 3. Evaluation Metrics: REP-Score and DROP-Score

J'ai mangé le sandwich

  • 1. Fertility-based Neural Machine Translation Model 


(Bounds on source attention weights)

4

slide-5
SLIDE 5

NMT + Attention Architecture

5

slide-6
SLIDE 6

e1 e2 e3 e4 f1 f2 f3 f4

h1 h2 h3 h4

attn_score attn_transform

J'ai mangé le sandwich I ate the sandwich

attn_score: 


  • dot-product (Luong, 2015)
  • bilinear function
  • MLP (Bahdanau, 2014)

attn_transform:


  • traditional softmax
  • constrained softmax (Martins & Kreutzer, 2017)
  • sparsemax (Martins & Astudillo, 2016)
  • constrained sparsemax (this work)

g1 c1 g2 c2 g3 c3 g4 c4

6

slide-7
SLIDE 7

Attention Transform Functions

  • Sparsemax: Euclidean projection of z provides sparse

probability distributions.

  • Constrained Softmax: Returns the distribution closest 


to softmax whose attention probabilities are bounded by upper bounds u.

7

slide-8
SLIDE 8

Attention Transform Functions

8

Sparse and Constrained?

slide-9
SLIDE 9

Constrained Sparsemax

  • Provides sparse and bounded probability distributions.
  • This transformation has two levels of sparsity: 

  • ver time steps & over attended words at each step.
  • Efficient linear and sublinear time algorithms for 


forward and backward propagation.

9

slide-10
SLIDE 10

Visualization: Attention transform functions

  • csparsemax provides sparse and constrained 


probabilities. t=0 t=1 t=2

10

slide-11
SLIDE 11

Fertility-based NMT Model

11

slide-12
SLIDE 12

Fertility-based NMT

  • Allocate fertilities for each source word as attention 


budgets that exhaust over decoding.

  • Fertility Predictor : Train biLSTM model supervised

by fertilities from fast_align (IBM Model 2).

12

slide-13
SLIDE 13

13

  • Fertilities incorporated as:

Fertility-based NMT

  • Exhaustion strategy to encourage more attention for 


words with larger credit remaining:

slide-14
SLIDE 14

Experiments

14

slide-15
SLIDE 15

Experiments

  • Experiments performed on 3 language pairs: 


De-En (IWSLT 2014), Ro-En (Europarl), Ja-En (KFTT).

  • Joint BPE with 32K merge operations.
  • Default hyperparameter settings in OpenNMT-Py.
  • Baselines: Softmax, + CovPenalty (Wu, 2016) and 


+ CovVector (Tu, 2016)

15

slide-16
SLIDE 16

Evaluation Metrics: REP-Score & DROP-Score

REP Score:

  • Penalizes n-gram repetitions in predicted translations.
  • Normalize by number of words in reference corpus.

DROP Score:

  • Find word alignments from source to reference &

source to predicted.

  • % of source words aligned with some word in reference, 


but not with any word in predicted translation.

16

slide-17
SLIDE 17

Results

17

15 18.2 21.4 24.6 27.8 31 De-En Ja-En Ro-En 29.77 21.31 29.85 30.08 21.53 29.63 29.81 20.7 29.69 29.67 20.36 29.51

softmax softmax+CovPenalty softmax+CovVector csparsemax

BLEU Scores

slide-18
SLIDE 18

18

0.0 3.2 6.4 9.6 12.8 16.0 De-En Ja-En Ro-En 1.98 11.4 2.67 2.42 11.07 2.93 2.48 14.12 3.47 2.45 13.48 3.37

softmax softmax+CovPenalty softmax+CovVector csparsemax

REP Scores

0.0 4.8 9.6 14.4 19.2 24.0 De-En Ja-En Ro-En 5.44 21.59 5.23 5.47 22.18 5.65 5.49 22.79 5.74 5.59 23.3 5.89

DROP Scores

Lower is better!

slide-19
SLIDE 19
  • csparsemax 


yields sparse set of 
 alignments and 
 avoids repetitions. csparsemax softmax

19

slide-20
SLIDE 20

Examples of Translations

20

slide-21
SLIDE 21

21

More in the paper…

slide-22
SLIDE 22

Thank You! Code: www.github.com/Unbabel/ sparse_constrained_attention Questions?

22