non autoregressive decoding
play

Non-Autoregressive Decoding Xiachong Feng Outline Transformer - PowerPoint PPT Presentation

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of Generation Order in Language Modeling EMNLP18 Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 Non-Monotonic


  1. Non-Autoregressive Decoding Xiachong Feng

  2. Outline • Transformer • The Importance of Generation Order in Language Modeling EMNLP18 • Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 • Non-Monotonic Sequential Text Generation ICML19 • Insertion-based Decoding with automatically Inferred Generation Order • Levenshtein Transformer • Conclusion • Paper List • Reference

  3. Transformer

  4. Transformer

  5. Scaled Dot-Product Attention https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

  6. Example https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

  7. Mul=-Head A?en=on https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

  8. Transformer hEps://cips-upload.bj.bcebos.com/ssaE2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

  9. The Importance of Generation Order in Language Modeling Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George E. Dahl Google Brain EMNLP18

  10. Overview • Linguistic intuition might suggest that we should first generate some abstract representation of what we want to say and then serialize it. • The best ordering we tried generates function words first and content words last, which cuts against the idea of committing to the general topic of a sentence first and only then deciding exactly how to phrase it.

  11. Two-pass Language Models • Produces partially-filled sentence “templates” and then fills in missing tokens • Partitioning of the vocabulary into a set of first-pass and second-pass tokens to generate sentences. Template:first-pass tokens + a special placeholder token 𝑧 (#) 𝑧 Second-pass tokens 𝑧 (%)

  12. Two-pass Language Models • Two copies of the Transformer model • Neural language model 𝒒 𝟐 : The first copy just generates the template, so it has no encoder. • Condi=onal transla=on model 𝒒 𝟑 : The second copy is a sequence-to-sequence model that translates the template into the complete sentence. Sentence à template template à final Seq2Seq no encoder

  13. Two-pass Language Models template

  14. Results • It is easier to first decide something about its syntacWc structure. • It is preferable to delay commiXng to a rare token for as long as possible as all subsequent decisions will then be condiWoning on a low-probability event.

  15. Insertion Transformer: Flexible Sequence Generation via Insertion Operations Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit Google Brain, University of California, Berkeley ICML19

  16. Inser=on Transformer • 𝑦 : source canvas (sequence) • 𝑧 : target canvas (sequence) • * 𝑧 + : hypothesis canvas at time t • 𝒟 : content vocabulary (token vocabulary for sequences) • 𝑚 : locations ∈ [0, | * 𝑧 + |]

  17. Insertion Transformer Model • Full Decoder Self-Attention • Remove causal self attention • Slot Representations via Concatenated Outputs • Adding special marker tokens at the beginning and end of the decoder input to extend the sequence length by two. • Take the resulting n + 2 vectors in the final layer and concatenate each adjacent pair to obtain n + 1 slot representations.

  18. Model • Joint content-loca=on distribu=on matrix of slot representations flatten this matrix into a vector • Joint distribu=on using a condi=onal factoriza=on learnable query vector 𝑚 -th row of H

  19. Contextualized Vocabulary Bias context vector shared bias Global bias

  20. Training and Loss Functions • LeQ-to-Right • Example : (x, y) • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧 • Create a new data point ((x, = 𝑧 = (𝑧 # , … , 𝑧 @ ) ), 𝑧 @A# ) • Loss : classificaWon loss (negaWve log-likelihood) • Note : only concerns about the last posiLon to insert

  21. Balanced Binary Tree • Parallelism

  22. Balanced Binary Tree • Example : (𝑦, 𝑧) • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧 • Sample a random subsequence of 𝑧 of length 𝑙 : = 𝑧 1. Shuffle 𝑧 2. Extract the first 𝑙 3. Reorder

  23. Soft binary tree loss 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 ↓ 𝑥 T (𝑗) ↑ span of tokens from the target output yet to be produced 𝑧 G H 𝑧 G HIJ 𝑧 K H … 𝑚 = 0 𝑚 = 1 𝑚 = 2 𝑚 = 3 𝑚 = 4 𝑚 = 5

  24. Uniform

  25. Balanced binary tree and uniform losses

  26. Greedy Decoding • Choose the action with the highest probability • sequence finalization • until an end-of-sequence token gets selected • slot finalization • restrict the argmax to locations whose maximum-probability decision is not end-of-slot • Until the model predicts an end-of-slot token for every location.

  27. Parallel Decoding • For each location 𝑚 : joint distribuWon factorization • slot finalization

  28. Non-Monotonic Sequential Text Generation Sean Welleck , Kiante ́ Brantley, Hal Daume ́ III, Kyunghyun Cho New York University, University of Maryland, College Park Microsoe Research, Facebook AI Research CIFAR Azrieli Global Scholar ICML19

  29. Overview • Recursively generaWng words to its lee and then words to its right, yielding a binary tree. • Learning is framed as imitaWon learning, including a coaching method which moves from imitaWng an oracle to reinforcing the policy’s own preferences Level-order In-order

  30. Imita=on Learning • Imitation Learning with Recurrent Neural Networks • Learning to Search Better than Your Teacher ICML15 • https://zhuanlan.zhihu.com/p/25688750 • https://blog.csdn.net/WASEFADG/article/details/83651126 • https://www.quora.com/What-is-imitation-learning

  31. Notation • Vocabulary V 𝑊 = 𝑊 ∪ < 𝑓𝑜𝑒 > • State space V 𝑊 ∗ • State 𝑡 ∈ 𝑇 corresponds to a sequence of tokens from V 𝑊 • Init state: empty sequence <> • End state: < 𝑓𝑜𝑒 > • AcWon 𝑏 : select an element from vocab and append to the state • 𝜐(𝑢) : maps from in-order to level order • Policy 𝜌(𝑏|𝑡)

  32. Challenge • The sequences 𝑍 alone only tell us what the final output sequences of words should be, but not what tree(s) should be used to get there.

  33. Imitation Learning • The first step, an oracle policy’s acWon is to produce any word 𝑥 that appears anywhere in 𝑍 . • All words to the lee of 𝑥 in 𝑍 are generated recursively on the lee (following the same procedure), and all words to the right of 𝑥 in 𝑍 are generated recursively on the right. • The oracle is non-determinisWc (many “correct” acWons are available at any given Wme), we inform this oracle policy with the current learned policy, encouraging it to favor acWons that are preferred by the current policy.

  34. Background: Learning to Search Learning to Search Better than Your Teacher ICML15

  35. Loss • 3 𝔽 • draw states 𝑡 according to the state distribuWon induced by 𝜌 Gb • compute cost-to-go under 𝜌 cd+ , for all possible acWons 𝑏 at that state. • 2 𝔽 • running 𝜌 for t-many steps • 1 𝔽 • for one instance 1 2 3

  36. Cost Measurement • when dealing with recurrent neural network policies using a cost funcWon more analogous to a cross-entropy loss can be preferred • use a KL-divergence type loss, measuring the difference between the acWon distribuWon produced by 𝜌 and the acWon distribuWon preferred by 𝜌 cd+ . • first sampling one training sequence, running the roll-in policy for t steps, and compuWng the KL divergence at that state using 𝜌 ∗ ( reference 𝑝𝑠 oracle ) as 𝜌 cd+ . Learning corresponds to minimizing this KL divergence iteraWvely with respect to the parameters of 𝜌 .

  37. Roll-In Policies • In most formal analyses, the roll-in policy is a stochastic mixture of the learned policy 𝜌 and the oracle policy 𝜌 ∗ • Experimentally , it has often been found that simply using the oracle’s state distribution is optimal Learning to Search BeEer than Your Teacher ICML15

  38. Oracle Policies • Uniform Oracle. 𝑞 f = 1/𝑜 • Coaching Oracle • preferring acWons that are preferred by the current parameterized policy • Annealed Coaching Oracle( 𝛾 from 1 to 0)

  39. Word Reordering Examples

  40. Inser=on-based Decoding with automa=cally Inferred Genera=on Order JiataoGu, QiLiu, KyunghyunCho Facebook AI Research New York University

  41. Motivation • L2R is not necessarily the optimal option for generating sequences. • For instance, people sometimes tend to think of central phrases first before building up a whole sentence.

  42. Orders as Latent Variables • 𝑄 j is the set of all the permutations of (1, … , 𝑈 ) • 𝜌 = (𝑨 % , 𝑨 m , … 𝑨 j , 𝑨 jA# ) ∈ 𝑄 j • 𝑧 n = { 𝑧 % , 𝑨 % , … , (𝑧 jA# , 𝑨 jA# )} , (𝑧 j , 𝑨 j ) represents the 𝑢 − 𝑢ℎ generated token and its absolute position • Two special tokens • 𝑧 s , 𝑨 s = < 𝑡 >, 0 、 𝑧 # , 𝑨 # = (</𝑡 >, 𝑈 + 1) • Object 𝑧 jA% =< 𝑓𝑝𝑒 >

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend