Saliency-driven Word Alignment Interpretation for NMT Shuoyang Ding - PowerPoint PPT Presentation

Saliency-driven Word Alignment Interpretation for NMT Shuoyang Ding Hainan Xu Philipp Koehn The Fourth Conference on Machine Translation Florence, Italy August 1st, 2019

Revisiting Six Challenges • poor out-of-domain performance • poor low-resource performance • low frequency words • long sentences • attention is not word alignment • large beam does not help [Koehn and Knowles 2017] Saliency-driven Word Alignment Interpretation for NMT 2

Revisiting Six Challenges • poor out-of-domain performance • poor low-resource performance • low frequency words • long sentences • attention is not word alignment • large beam does not help [Koehn and Knowles 2017] Saliency-driven Word Alignment Interpretation for NMT 3

A Model Interpretation Problem Saliency-driven Word Alignment Interpretation for NMT 4

A Model Interpretation Problem Saliency-driven Word Alignment Interpretation for NMT 5

Related Findings Outside MT • “ Attention is not Explanation ”   [Jain and Wallace NAACL 2019] • “ Is Attention Interpretable? ” (Spoiler: No) [Serrano and Smith ACL 2019] • We also have empirical results that corroborate these findings. • … and we have method that works better! Saliency-driven Word Alignment Interpretation for NMT 6

Saliency: Identifying Important Features

Recap Saliency-driven Word Alignment Interpretation for NMT 8

Recap Saliency-driven Word Alignment Interpretation for NMT 9

Focus on solten Saliency-driven Word Alignment Interpretation for NMT 10

Perturbation Saliency-driven Word Alignment Interpretation for NMT 11

Perturbation Saliency-driven Word Alignment Interpretation for NMT 12

Assumption The output score is more sensitive to perturbations in important features . Saliency-driven Word Alignment Interpretation for NMT 13

E.g. Saliency-driven Word Alignment Interpretation for NMT 14

Saliency Saliency-driven Word Alignment Interpretation for NMT 17

Saliency when : Saliency-driven Word Alignment Interpretation for NMT 18

Saliency Saliency-driven Word Alignment Interpretation for NMT 19

What’s good about this? 1. Derivatives are easy to obtain for any DL toolkit 2. Model-agnostic 3. Adapts with the choice of output words Saliency-driven Word Alignment Interpretation for NMT 20

Prior Work on Saliency • Widely used and studied in Computer Vision! [Simonyan et al. 2013][Springenberg et al. 2014]   [Smilkov et al. 2017] • Also in a few NLP work for qualitative analysis   [Aubakirova and Bansal 2016][Li et al. 2016][Ding et al. 2017] [Arras et al. 2016;2017][Mudrakarta et al. 2018] Saliency-driven Word Alignment Interpretation for NMT 21

SmoothGrad • Gradients are very local measure of sensitivity. • Highly non-linear models may have pathological points where the gradients are noisy . • Solution: calculate saliency for multiple copies of the same input corrupted with gaussian noise , and average the saliency of copies.   [Smilkov et al. 2017] Saliency-driven Word Alignment Interpretation for NMT 22

Establishing Saliency for Words

“Feature” in Computer Vision Photo Credit: Hainan Xu Saliency-driven Word Alignment Interpretation for NMT 24

“Feature” in NLP It’s straight-forward to compute saliency for   a single dimension of the word embedding. Saliency-driven Word Alignment Interpretation for NMT 25

“Feature” in NLP But how to compose the saliency of each dimension into the saliency of a word ? Saliency-driven Word Alignment Interpretation for NMT 26

Our Proposal Consider word embedding look-up as a dot product between the embedding matrix and an one-hot vector . Saliency-driven Word Alignment Interpretation for NMT 27

Our Proposal The 1 in the one-hot vector denotes the identity of the input word . Saliency-driven Word Alignment Interpretation for NMT 28

Our Proposal Let’s perturb that 1 like a real value ! i.e. take gradients with regard to the 1 . Saliency-driven Word Alignment Interpretation for NMT 29

Our Proposal e i ⋅ ∂ y ∑ ∂ e i i range: ( −∞ , ∞ ) Saliency-driven Word Alignment Interpretation for NMT 30

Experiment

Evaluation • Evaluation of interpretations is tricky ! • Fortunately, there’s human judgments to rely on. • Need to do force decoding with NMT model. Saliency-driven Word Alignment Interpretation for NMT 32

Setup • Architecture: Convolutional S2S, LSTM, Transformer (with fairseq default hyper- parameters) • Dataset: Following Zenkel et al. [2019], which covers de-en , fr-en and ro-en . • SmoothGrad hyper-parameters: N=30 and σ =0.15 Saliency-driven Word Alignment Interpretation for NMT 33

Baselines • Attention weights • Smoothed Attention : forward pass on multiple corrupted input samples, then average the attention weights over samples • [Li et al. 2016] : compute element-wise absolute value of embedding gradients, then average over embedding dimensions • [Li et al. 2016] + SmoothGrad Saliency-driven Word Alignment Interpretation for NMT 34

Convolutional S2S on de-en AER Saliency-driven Word Alignment Interpretation for NMT 15 20 25 30 35 40 45 Attention Smoothed Attention Li+Grad Li+SmoothGrad Ours+Grad Ours+SmoothGrad fast-align Zenkel et al. [2019] GIZA++ 35

Attention on de-en 65 55 45 AER 35 25 15 Conv LSTM Transformer fast-align Zenkel et al. [2019] GIZA++ Saliency-driven Word Alignment Interpretation for NMT 36

Ours+SmoothGrad on de-en 65 55 45 AER 35 25 15 Conv LSTM Transformer fast-align Zenkel et al. [2019] GIZA++ Saliency-driven Word Alignment Interpretation for NMT 37

Li vs. Ours Saliency-driven Word Alignment Interpretation for NMT 38

Li vs. Ours Saliency-driven Word Alignment Interpretation for NMT 39

Conclusion

          Conclusion • Saliency + proper word-level score formulation is a better interpretation method than attention • NMT models do learn interpretable alignments . We just need to properly uncover them !   Paper Code Slides https://github.com/shuoyangd/meerkat Saliency-driven Word Alignment Interpretation for NMT 41

Saliency-driven Word Alignment Interpretation for NMT Shuoyang Ding - PowerPoint PPT Presentation

Saliency-driven Word Alignment Interpretation for NMT Shuoyang Ding Hainan Xu Philipp Koehn The Fourth Conference on Machine Translation Florence, Italy August 1st, 2019 Revisiting Six Challenges poor out-of-domain performance

Saliency Prof. Xavier Gir, Prof. Kevin McGuinness Student: Junting Pan Elisa Sayrol Saliency

DEA PMU NMT Content Introduction Project Planning NMT Friendly Policy and

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Gradient-Induced Co-Saliency Detection Zhao Zhang, Wenda Jin, Jun Xu, Ming-Ming Cheng Nankai

Discriminative word alignment by learning the Discriminative word alignment by learning the

NMT Structure Terry Kuzma NMT Instructor Outline Program Mission Logistics / Schedule

D.O.T. HAZMAT / DANGEROUS GOODS TRAINING FOR HEALTHCARE WORKERS including the Nuclear

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 2013 M.

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Compilers Alignment Alex Aiken Alignment Most modern machines are 32 or 64 bit 8 bits in

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Learning video saliency from human gaze using candidate selection Rudoy, Goldman, Shechtman,

Modeling the Temporality of Visual Saliency and Its Application to Action Recognition Luo Ye

Skills, innovation, and interactive capabilities: the case of the square kilometre array telescope

S9307 Artificial Intelligence in Search of Extraterrestrial Intelligence Yunfan Gerry Zhang PhD

National Likely General Election Voter Survey April 24 th , 2017 On the web

Directors update Sarah Pearce ATUC| 2 June 2015

KEEP KIDS FREE SYSTEMS-LEVEL CHANGE TO DISRUPT THE TRAUMA-TO-PRISON PIPELINE JAMES BRAXTON KATE

KwaZulu-Natal By 2035 KwaZulu-Natal will be a prosperous Province with a healthy, secure and

Evolution of cold gas in active galaxies Brenda Namumba University of KwaZulu-Natal

1 Background Overview National Department of Transport Recap: Government Programme of