 
              Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019
Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Masked Multi-head FFN FFN FFN Self Attention ×N Multi-Head Emb Emb Emb Emb Self Attention ×M <sos> y 1 y 2 y 3 Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation
Non-autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation
Previous works on non-autoregressive MT Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Encoder Max Max Max Max Fertilities [Gu et al.] FFN FFN FFN FFN Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M ×R [Lee et al.] Emb Emb Emb Emb Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation
Quality-speedup trade-off 18x Gu et al. 16x 14x 12x 10x Speedup Gu et al. 8x Kaiser et al. 6x 4x Lee et al. Gu et al. Autoregressive … 2x 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation
Hidden states similarity Autoregressive model Non-autoregressive model Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation
Attention distribution Autoregressive model Non-autoregressive model Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation
Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Decoder Decoder y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 Soft Soft Soft Soft Soft Soft Soft Soft Max Max Max Max Max Max Max Max FFN FFN FFN FFN FFN FFN FFN FFN Encoder Encoder Hints Multi-Head Multi-Head Context Context Encoder-to-Decoder Attention Encoder-to-Decoder Attention Masked Multi-head Multi-Head FFN FFN FFN FFN FFN FFN Self Attention Positional Attention ×N Multi-Head Multi-Head Multi-Head Emb Emb Emb Emb Self Attention Self Attention Self Attention N× ×M M× <sos> y 1 y 2 y 3 Emb Emb Emb Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 x 1 x 2 x 3 Autoregressive model Non-autoregressive model Hint-Based Training for Non-Autoregressive Translation
Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Hints on hidden states Hints on word alignments Hint-Based Training for Non-Autoregressive Translation
Hints on hidden states • Directly regression fails because of the discrepancy between two models • We penalize hidden states that are highly similar: ( ! )' ( - ! 2 ℒ !"# = $ 𝑂 ( ( ( 𝜚(𝑒 %* , 𝑒 *. ) 𝑈 $ − 1 𝑈 %&' *&%+' ,&' • 𝑒 !" and 𝑒 "# are cosine similarities 𝑡 -th and 𝑢 -th of hidden states at layer 𝑚 of the student and teacher models • 𝜚 is a fixed function only penalizes when the student’s hidden states are similar while the teacher’s not Hint-Based Training for Non-Autoregressive Translation
Hints on word alignments • We minimize the KL-divergence between the per- head encoder-to-decoder attention distribution of the student and teacher models ( - 2 ! 1 %* ) *. ℒ /,"01 = $ 𝑂𝐼 ( ( ( 𝐸 34 (𝑏 *,,,! ∥ 𝑏 *,,,! 𝑈 *&' ,&%+' !&' Total Loss ℒ = ℒ 233 + 𝜇ℒ 456 + 𝜈ℒ 73582 Hint-Based Training for Non-Autoregressive Translation
Experimental settings Datasets Models Inference WMT14 En-De Non-autoregressive Transformer-base WMT14 De-En model Non-autoregressive IWSLT14 De-En Transformer-small model with teacher reranking Hint-Based Training for Non-Autoregressive Translation
Experimental results Hint-Based Training for Non-Autoregressive Translation
Quality-speedup trade-off 35x Ours 30x 25x Ours (with reranking) 20x Speedup Gu et al. 15x 10x Gu et al. Kaiser et al. 5x Gu et al. Lee et al. Autoregressive … 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation
Hidden states similarity Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation
Attention distribution Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation
Ablation studies Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring. Hint-Based Training for Non-Autoregressive Translation
Summary Instead of adding new modules that can slow down the model, we proposed a method to leverage the hints from the autoregressive model to help the training of the non- autoregressive model. Hint-Based Training for Non-Autoregressive Translation
Thanks! Q&A Zhuohan Li ( zhuohan@cs.berkeley.edu ) Hint-Based Training for Non-Autoregressive Translation
Recommend
More recommend