Hint-Based Training for Non-Autoregressive Translation
EMNLP-IJCNLP 2019
Zhuohan Li Zi Lin Di He Fei Tian Tao Qin Liwei Wang Tie-Yan Liu
Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - - PowerPoint PPT Presentation
Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019 Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN
EMNLP-IJCNLP 2019
Zhuohan Li Zi Lin Di He Fei Tian Tao Qin Liwei Wang Tie-Yan Liu
Hint-Based Training for Non-Autoregressive Translation
Masked Multi-head Self Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 <sos> y1 y2 y3
×N
Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN
×M
Encoder Decoder
Hint-Based Training for Non-Autoregressive Translation
Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4
Decoder
Copy
Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN
Encoder
×N ×M
Hint-Based Training for Non-Autoregressive Translation
Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4
Decoder
Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN
Encoder
×N ×M
Fertilities [Gu et al.]
×R [Lee et al.]
Hint-Based Training for Non-Autoregressive Translation Gu et al. Gu et al. Kaiser et al. Lee et al. Gu et al. Autoregressive … 0x 2x 4x 6x 8x 10x 12x 14x 16x 18x 15 17 19 21 23 25 27 29 Speedup BLUE Score (WMT14 En-De)
Hint-Based Training for Non-Autoregressive Translation
Autoregressive model Non-autoregressive model
Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En.
Hint-Based Training for Non-Autoregressive Translation
Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En.
Autoregressive model Non-autoregressive model
Multi-Head Self Attention Multi-Head Positional Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 Masked Multi-head Self Attention Multi-Head Encoder-to-Decoder Attention FFN FFN FFN FFN Soft Max Soft Max Soft Max Soft Max Emb Emb Emb Emb y1 y2 y3 y4 <sos> y1 y2 y3
M× N× ×N
Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN
×M
Encoder Decoder Decoder
Copy
Multi-Head Self Attention Context Emb Emb Emb x1 x2 x3 FFN FFN FFN
Encoder
Hint-Based Training for Non-Autoregressive Translation
Autoregressive model Non-autoregressive model
Hint-Based Training for Non-Autoregressive Translation
Hints on hidden states Hints on word alignments
between two models
ℒ!"# = 2 𝑈
$ − 1 𝑈 $𝑂 ( %&' (
!)'
(
*&%+' (
!
(
,&'
hidden states at layer 𝑚 of the student and teacher models
hidden states are similar while the teacher’s not
Hint-Based Training for Non-Autoregressive Translation
head encoder-to-decoder attention distribution of the student and teacher models
ℒ/,"01 = 1 𝑈
$𝑂𝐼 ( *&' (
!
(
,&%+'
!&' 2
𝐸34(𝑏*,,,!
*.
∥ 𝑏*,,,!
%* )
Hint-Based Training for Non-Autoregressive Translation
ℒ = ℒ233 + 𝜇ℒ456 + 𝜈ℒ73582
Hint-Based Training for Non-Autoregressive Translation
WMT14 En-De WMT14 De-En IWSLT14 De-En Transformer-base Transformer-small Non-autoregressive model Non-autoregressive model with teacher reranking
Datasets Models Inference
Hint-Based Training for Non-Autoregressive Translation
Hint-Based Training for Non-Autoregressive Translation Gu et al. Gu et al. Kaiser et al. Lee et al. Gu et al. Autoregressive … Ours Ours (with reranking) 0x 5x 10x 15x 20x 25x 30x 35x 15 17 19 21 23 25 27 29 Speedup BLUE Score (WMT14 En-De)
Hint-Based Training for Non-Autoregressive Translation
Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En.
Autoregressive model Non-autoregressive model without hints Non-autoregressive model with hints
Hint-Based Training for Non-Autoregressive Translation
Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En.
Autoregressive model Non-autoregressive model without hints Non-autoregressive model with hints
Hint-Based Training for Non-Autoregressive Translation
Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring.
Hint-Based Training for Non-Autoregressive Translation
Instead of adding new modules that can slow down the model, we proposed a method to leverage the hints from the autoregressive model to help the training of the non- autoregressive model.
Q&A Zhuohan Li (zhuohan@cs.berkeley.edu)
Hint-Based Training for Non-Autoregressive Translation