sequence to sequence learning as beam search optimization
play

Sequence-to-Sequence Learning as Beam-Search Optimization Sam - PowerPoint PPT Presentation

Sequence-to-Sequence Learning as Beam-Search Optimization Sam Wiseman and Alexander M. Rush Seq2Seq as a General-purpose NLP/Text Generation Tool Machine Translation ???? Luong et al. [2015] Question Answering ? Conversation ? Parsing Vinyals et


  1. Sequence-to-Sequence Learning as Beam-Search Optimization Sam Wiseman and Alexander M. Rush

  2. Seq2Seq as a General-purpose NLP/Text Generation Tool Machine Translation ???? Luong et al. [2015] Question Answering ? Conversation ? Parsing Vinyals et al. [2015] Sentence Compression Filippova et al. [2015] Summarization ? Caption Generation ? Video-to-Text ? Grammar Correction ?

  3. Room for Improvement? Despite its tremendous success, there are some potential issues with standard Seq2Seq [Ranzato et al. 2016; Bengio et al. 2015] : (1) Train/Test mismatch (2) Seq2Seq models next-words, rather than whole sequences Goal of the talk : describe a simple variant of Seq2Seq — and corresponding beam-search training scheme — to address these issues.

  4. Review: Sequence-to-sequence (Seq2Seq) Models Encoder RNN (red) encodes source into a representation x Decoder RNN (blue) generates translation word-by-word

  5. Review: Seq2Seq Generation Details h 3 = RNN( w 3 , h 2 ) h 1 h 2 w 1 w 2 w 3 Probability of generating t ’th word: p ( w t | w 1 , . . . , w t − 1 , x ; θ ) = softmax( W out h t − 1 + b out )

  6. Review: Train and Test Train Objective : Given source-target pairs ( x, y 1: T ) , minimize NLL of each word independently, conditioned on gold history y 1: t − 1 � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t Test Objective : Structured prediction � y 1: T = arg max ˆ ln p ( w t | w 1: t − 1 , x ; θ ) w 1: T t Typical to approximate the arg max with beam-search

  7. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  8. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  9. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  10. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  11. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  12. Review: Beam Search at Test Time ( K = 3 ) a the red For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  13. Review: Beam Search at Test Time ( K = 3 ) a red the dog red blue For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  14. Review: Beam Search at Test Time ( K = 3 ) a red dog the dog dog red blue cat For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  15. Review: Beam Search at Test Time ( K = 3 ) a red dog smells the dog dog barks red blue cat walks For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  16. Review: Beam Search at Test Time ( K = 3 ) a red dog smells home the dog dog barks quickly red blue cat walks straight For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  17. Review: Beam Search at Test Time ( K = 3 ) a red dog smells home today the dog dog barks quickly Friday red blue cat walks straight now For t = 1 . . . T : For all k and for all possible output words w : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) Update beam: y (1: K ) y ( k ) ˆ ← K-arg max s ( w t , ˆ 1: t − 1 ) 1: t w 1: t

  18. Seq2Seq Issues Revisited Issue #1: Train/Test Mismatch (cf., Ranzato et al. [2016] ) � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t (a) Training conditions on true history (“Exposure Bias”) (b) Train with word-level NLL, but evaluate with BLEU-like metrics Idea #1: Train with beam-search Use a loss that incorporates (sub)sequence-level costs

  19. Seq2Seq Issues Revisited Issue #1: Train/Test Mismatch (cf., Ranzato et al. [2016] ) � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t (a) Training conditions on true history (“Exposure Bias”) (b) Train with word-level NLL, but evaluate with BLEU-like metrics Idea #1: Train with beam-search Use a loss that incorporates (sub)sequence-level costs

  20. Seq2Seq Issues Revisited Issue #1: Train/Test Mismatch (cf., Ranzato et al. [2016] ) � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t (a) Training conditions on true history (“Exposure Bias”) (b) Train with word-level NLL, but evaluate with BLEU-like metrics Idea #1: Train with beam-search Use a loss that incorporates (sub)sequence-level costs

  21. Seq2Seq Issues Revisited Issue #1: Train/Test Mismatch (cf., Ranzato et al. [2016] ) � NLL ( θ ) = − ln p ( w t = y t | y 1: t − 1 , x ; θ ) t (a) Training conditions on true history (“Exposure Bias”) (b) Train with word-level NLL, but evaluate with BLEU-like metrics Idea #1: Train with beam-search Use a loss that incorporates (sub)sequence-level costs

  22. Idea #1: Train with Beam Search Replace NLL with loss that penalizes search-error: � � y ( K ) y ( K ) y ( K ) � L ( θ ) = ∆(ˆ 1: t ) 1 − s ( y t , y 1: t − 1 ) + s (ˆ , ˆ 1: t − 1 ) t t y ( K ) y 1: t is the gold prefix; ˆ is the K ’th prefix on the beam 1: t y ( k ) y ( k ) y ( k ) y ( k ) s (ˆ , ˆ 1: t − 1 ) is the score of history (ˆ , ˆ 1: t − 1 ) t t y ( K ) y ( K ) ∆(ˆ 1: t ) allows us to scale loss by badness of predicting ˆ 1: t

  23. Idea #1: Train with Beam Search Replace NLL with loss that penalizes search-error: � � y ( K ) y ( K ) y ( K ) � L ( θ ) = ∆(ˆ 1: t ) 1 − s ( y t , y 1: t − 1 ) + s (ˆ , ˆ 1: t − 1 ) t t y ( K ) y 1: t is the gold prefix; ˆ is the K ’th prefix on the beam 1: t y ( k ) y ( k ) y ( k ) y ( k ) s (ˆ , ˆ 1: t − 1 ) is the score of history (ˆ , ˆ 1: t − 1 ) t t y ( K ) y ( K ) ∆(ˆ 1: t ) allows us to scale loss by badness of predicting ˆ 1: t

  24. Idea #1: Train with Beam Search Replace NLL with loss that penalizes search-error: � � y ( K ) y ( K ) y ( K ) � L ( θ ) = ∆(ˆ 1: t ) 1 − s ( y t , y 1: t − 1 ) + s (ˆ , ˆ 1: t − 1 ) t t y ( K ) y 1: t is the gold prefix; ˆ is the K ’th prefix on the beam 1: t y ( k ) y ( k ) y ( k ) y ( k ) s (ˆ , ˆ 1: t − 1 ) is the score of history (ˆ , ˆ 1: t − 1 ) t t y ( K ) y ( K ) ∆(ˆ 1: t ) allows us to scale loss by badness of predicting ˆ 1: t

  25. Idea #1: Train with Beam Search Replace NLL with loss that penalizes search-error: � � y ( K ) y ( K ) y ( K ) � L ( θ ) = ∆(ˆ 1: t ) 1 − s ( y t , y 1: t − 1 ) + s (ˆ , ˆ 1: t − 1 ) t t y ( K ) y 1: t is the gold prefix; ˆ is the K ’th prefix on the beam 1: t y ( k ) y ( k ) y ( k ) y ( k ) s (ˆ , ˆ 1: t − 1 ) is the score of history (ˆ , ˆ 1: t − 1 ) t t y ( K ) y ( K ) ∆(ˆ 1: t ) allows us to scale loss by badness of predicting ˆ 1: t

  26. Seq2Seq Issues Revisited Issue #2: Seq2Seq models next-word probabilities : y ( k ) y ( k ) y ( k ) s ( w t = w, ˆ 1: t − 1 ) ← ln p (ˆ 1: t − 1 | x ) + ln p ( w t = w | ˆ 1: t − 1 , x ) (a) Sequence score is sum of locally normalized word-scores; gives rise to “Label Bias” [Lafferty et al. 2001] (b) What if we want to train with sequence-level constraints? Idea #2: Don’t locally normalize

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend