Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - - PowerPoint PPT Presentation

hint based training for non autoregressive translation
SMART_READER_LITE
LIVE PREVIEW

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - - PowerPoint PPT Presentation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019 Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN


slide-1
SLIDE 1

Hint-Based Training for Non-Autoregressive Translation

EMNLP-IJCNLP 2019

Zhuohan Li Zi Lin Di He Fei Tian Tao Qin Liwei Wang Tie-Yan Liu

slide-2
SLIDE 2

Autoregressive MT models

Hint-Based Training for Non-Autoregressive Translation

Masked Multi-head Self Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 <sos> y1 y2 y3

×N

Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN

×M

Encoder Decoder

slide-3
SLIDE 3

Non-autoregressive MT models

Hint-Based Training for Non-Autoregressive Translation

Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4

Decoder

Copy

Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN

Encoder

×N ×M

slide-4
SLIDE 4

Previous works

  • n non-autoregressive MT

Hint-Based Training for Non-Autoregressive Translation

Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4

Decoder

Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN

Encoder

×N ×M

Fertilities [Gu et al.]

×R [Lee et al.]

slide-5
SLIDE 5

Quality-speedup trade-off

Hint-Based Training for Non-Autoregressive Translation Gu et al. Gu et al. Kaiser et al. Lee et al. Gu et al. Autoregressive … 0x 2x 4x 6x 8x 10x 12x 14x 16x 18x 15 17 19 21 23 25 27 29 Speedup BLUE Score (WMT14 En-De)

slide-6
SLIDE 6

Hidden states similarity

Hint-Based Training for Non-Autoregressive Translation

Autoregressive model Non-autoregressive model

Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En.

slide-7
SLIDE 7

Attention distribution

Hint-Based Training for Non-Autoregressive Translation

Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En.

Autoregressive model Non-autoregressive model

slide-8
SLIDE 8

Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 Masked Multi-head Self Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 <sos> y1 y2 y3

M× N× ×N

Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN

×M

Encoder Decoder Decoder

Copy

Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN

Encoder

Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student

Hint-Based Training for Non-Autoregressive Translation

Autoregressive model Non-autoregressive model

Hints

slide-9
SLIDE 9

Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student

Hint-Based Training for Non-Autoregressive Translation

Hints on hidden states Hints on word alignments

slide-10
SLIDE 10
  • Directly regression fails because of the discrepancy

between two models

  • We penalize hidden states that are highly similar:

ℒ!"# = 2 𝑈

$ − 1 𝑈 $𝑂 ( %&' (

!)'

(

*&%+' (

!

(

,&'

  • 𝜚(𝑒%*, 𝑒*.)
  • 𝑒!" and 𝑒"# are cosine similarities 𝑡-th and 𝑢-th of

hidden states at layer 𝑚 of the student and teacher models

  • 𝜚 is a fixed function only penalizes when the student’s

hidden states are similar while the teacher’s not

Hint-Based Training for Non-Autoregressive Translation

Hints on hidden states

slide-11
SLIDE 11
  • We minimize the KL-divergence between the per-

head encoder-to-decoder attention distribution of the student and teacher models

ℒ/,"01 = 1 𝑈

$𝑂𝐼 ( *&' (

!

(

,&%+'

  • (

!&' 2

𝐸34(𝑏*,,,!

*.

∥ 𝑏*,,,!

%* )

Hint-Based Training for Non-Autoregressive Translation

Hints on word alignments Total Loss

ℒ = ℒ233 + 𝜇ℒ456 + 𝜈ℒ73582

slide-12
SLIDE 12

Experimental settings

Hint-Based Training for Non-Autoregressive Translation

WMT14 En-De WMT14 De-En IWSLT14 De-En Transformer-base Transformer-small Non-autoregressive model Non-autoregressive model with teacher reranking

Datasets Models Inference

slide-13
SLIDE 13

Experimental results

Hint-Based Training for Non-Autoregressive Translation

slide-14
SLIDE 14

Quality-speedup trade-off

Hint-Based Training for Non-Autoregressive Translation Gu et al. Gu et al. Kaiser et al. Lee et al. Gu et al. Autoregressive … Ours Ours (with reranking) 0x 5x 10x 15x 20x 25x 30x 35x 15 17 19 21 23 25 27 29 Speedup BLUE Score (WMT14 En-De)

slide-15
SLIDE 15

Hidden states similarity

Hint-Based Training for Non-Autoregressive Translation

Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En.

Autoregressive model Non-autoregressive model without hints Non-autoregressive model with hints

slide-16
SLIDE 16

Attention distribution

Hint-Based Training for Non-Autoregressive Translation

Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En.

Autoregressive model Non-autoregressive model without hints Non-autoregressive model with hints

slide-17
SLIDE 17

Ablation studies

Hint-Based Training for Non-Autoregressive Translation

Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring.

slide-18
SLIDE 18

Summary

Hint-Based Training for Non-Autoregressive Translation

Instead of adding new modules that can slow down the model, we proposed a method to leverage the hints from the autoregressive model to help the training of the non- autoregressive model.

slide-19
SLIDE 19

Thanks!

Q&A Zhuohan Li (zhuohan@cs.berkeley.edu)

Hint-Based Training for Non-Autoregressive Translation