hint based training for non autoregressive translation
play

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li - PowerPoint PPT Presentation

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019 Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN


  1. Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin Liwei Wang Tie-Yan Liu Di He EMNLP-IJCNLP 2019

  2. Autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Masked Multi-head FFN FFN FFN Self Attention ×N Multi-Head Emb Emb Emb Emb Self Attention ×M <sos> y 1 y 2 y 3 Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

  3. Non-autoregressive MT models Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Max Max Max Max FFN FFN FFN FFN Encoder Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

  4. Previous works on non-autoregressive MT Decoder y 1 y 2 y 3 y 4 Soft Soft Soft Soft Encoder Max Max Max Max Fertilities [Gu et al.] FFN FFN FFN FFN Multi-Head Context Encoder-to-Decoder Attention Multi-Head FFN FFN FFN Positional Attention Multi-Head Multi-Head Self Attention Self Attention ×N ×M ×R [Lee et al.] Emb Emb Emb Emb Emb Emb Emb x 1 x 2 x 3 Hint-Based Training for Non-Autoregressive Translation

  5. Quality-speedup trade-off 18x Gu et al. 16x 14x 12x 10x Speedup Gu et al. 8x Kaiser et al. 6x 4x Lee et al. Gu et al. Autoregressive … 2x 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation

  6. Hidden states similarity Autoregressive model Non-autoregressive model Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

  7. Attention distribution Autoregressive model Non-autoregressive model Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

  8. Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Decoder Decoder y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 Soft Soft Soft Soft Soft Soft Soft Soft Max Max Max Max Max Max Max Max FFN FFN FFN FFN FFN FFN FFN FFN Encoder Encoder Hints Multi-Head Multi-Head Context Context Encoder-to-Decoder Attention Encoder-to-Decoder Attention Masked Multi-head Multi-Head FFN FFN FFN FFN FFN FFN Self Attention Positional Attention ×N Multi-Head Multi-Head Multi-Head Emb Emb Emb Emb Self Attention Self Attention Self Attention N× ×M M× <sos> y 1 y 2 y 3 Emb Emb Emb Emb Emb Emb Emb Emb Emb Emb Copy x 1 x 2 x 3 x 1 x 2 x 3 Autoregressive model Non-autoregressive model Hint-Based Training for Non-Autoregressive Translation

  9. Hi Hint-bas based d trai traini ning ng from autoregressive teacher to non-autoregressive student Hints on hidden states Hints on word alignments Hint-Based Training for Non-Autoregressive Translation

  10. Hints on hidden states • Directly regression fails because of the discrepancy between two models • We penalize hidden states that are highly similar: ( ! )' ( - ! 2 ℒ !"# = $ 𝑂 ( ( ( 𝜚(𝑒 %* , 𝑒 *. ) 𝑈 $ − 1 𝑈 %&' *&%+' ,&' • 𝑒 !" and 𝑒 "# are cosine similarities 𝑡 -th and 𝑢 -th of hidden states at layer 𝑚 of the student and teacher models • 𝜚 is a fixed function only penalizes when the student’s hidden states are similar while the teacher’s not Hint-Based Training for Non-Autoregressive Translation

  11. Hints on word alignments • We minimize the KL-divergence between the per- head encoder-to-decoder attention distribution of the student and teacher models ( - 2 ! 1 %* ) *. ℒ /,"01 = $ 𝑂𝐼 ( ( ( 𝐸 34 (𝑏 *,,,! ∥ 𝑏 *,,,! 𝑈 *&' ,&%+' !&' Total Loss ℒ = ℒ 233 + 𝜇ℒ 456 + 𝜈ℒ 73582 Hint-Based Training for Non-Autoregressive Translation

  12. Experimental settings Datasets Models Inference WMT14 En-De Non-autoregressive Transformer-base WMT14 De-En model Non-autoregressive IWSLT14 De-En Transformer-small model with teacher reranking Hint-Based Training for Non-Autoregressive Translation

  13. Experimental results Hint-Based Training for Non-Autoregressive Translation

  14. Quality-speedup trade-off 35x Ours 30x 25x Ours (with reranking) 20x Speedup Gu et al. 15x 10x Gu et al. Kaiser et al. 5x Gu et al. Lee et al. Autoregressive … 0x 15 17 19 21 23 25 27 29 BLUE Score (WMT14 En-De) Hint-Based Training for Non-Autoregressive Translation

  15. Hidden states similarity Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Hidden states cosine-similarity of a sampled sentence in IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

  16. Attention distribution Autoregressive model Non-autoregressive model Non-autoregressive model without hints with hints Encoder-to-decoder attention distribution of an informative head of a sampled sentence from IWSLT14 De-En. Hint-Based Training for Non-Autoregressive Translation

  17. Ablation studies Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring. Hint-Based Training for Non-Autoregressive Translation

  18. Summary Instead of adding new modules that can slow down the model, we proposed a method to leverage the hints from the autoregressive model to help the training of the non- autoregressive model. Hint-Based Training for Non-Autoregressive Translation

  19. Thanks! Q&A Zhuohan Li ( zhuohan@cs.berkeley.edu ) Hint-Based Training for Non-Autoregressive Translation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend