Non-Autoregressive Decoding
Xiachong Feng
Non-Autoregressive Decoding Xiachong Feng Outline Transformer - - PowerPoint PPT Presentation
Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of Generation Order in Language Modeling EMNLP18 Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 Non-Monotonic
Xiachong Feng
Flexible Sequence Generation via Insertion Operations ICML19
Order
https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf
https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf
https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf
hEps://cips-upload.bj.bcebos.com/ssaE2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf
Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George E. Dahl Google Brain EMNLP18
abstract representation of what we want to say and then serialize it.
words last, which cuts against the idea of committing to the general topic of a sentence first and only then deciding exactly how to phrase it.
tokens
tokens to generate sentences.
𝑧 𝑧(#) 𝑧(%)
Template:first-pass tokens + a special placeholder token Second-pass tokens
the template, so it has no encoder.
sequence-to-sequence model that translates the template into the complete sentence.
Sentenceàtemplate templateàfinal
no encoder Seq2Seq
template
subsequent decisions will then be condiWoning on a low-probability event.
Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit Google Brain, University of California, Berkeley ICML19
𝑧+ : hypothesis canvas at time t
𝑧+|]
decoder input to extend the sequence length by two.
each adjacent pair to obtain n + 1 slot representations.
matrix of slot representations
learnable query vector 𝑚-th row of H
flatten this matrix into a vector
context vector shared bias
Global bias
𝑧 = (𝑧#, … , 𝑧@)), 𝑧@A#)
𝑧
𝑚 = 0 𝑚 = 1 𝑚 = 2 𝑚 = 3 𝑚 = 4 𝑚 = 5
𝑧GH 𝑧GHIJ 𝑧KH
…
span of tokens from the target
𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 ↓ 𝑥T(𝑗) ↑
decision is not end-of-slot
:
joint distribuWon factorization
Sean Welleck , Kiante ́ Brantley, Hal Daume ́ III, Kyunghyun Cho New York University, University of Maryland, College Park Microsoe Research, Facebook AI Research CIFAR Azrieli Global Scholar ICML19
yielding a binary tree.
which moves from imitaWng an oracle to reinforcing the policy’s own preferences
Level-order In-order
𝑊 = 𝑊 ∪ < 𝑓𝑜𝑒 >
𝑊∗
𝑊
words should be, but not what tree(s) should be used to get there.
procedure), and all words to the right of 𝑥 in 𝑍 are generated recursively on the right.
inform this oracle policy with the current learned policy, encouraging it to favor acWons that are preferred by the current policy.
Learning to Search Better than Your Teacher ICML15
state.
3 2 1
funcWon more analogous to a cross-entropy loss can be preferred
acWon distribuWon produced by 𝜌 and the acWon distribuWon preferred by 𝜌cd+.
steps, and compuWng the KL divergence at that state using 𝜌∗ (reference 𝑝𝑠 oracle )as 𝜌cd+. Learning corresponds to minimizing this KL divergence iteraWvely with respect to the parameters of 𝜌.
the learned policy 𝜌 and the oracle policy 𝜌∗
state distribution is optimal
Learning to Search BeEer than Your Teacher ICML15
JiataoGu, QiLiu, KyunghyunCho Facebook AI Research New York University
before building up a whole sentence.
𝑢ℎ generated token and its absolute position
𝑧jA% =< 𝑓𝑝𝑒 >
G +: the relaWve-posiWon representaWons of token 𝑗 at decode step 𝑢
G + is a vector
s +, 𝑠 # +, … , 𝑠 + +] shows the relaWve-posiWon
representaWons of all the words in the sequence.
s:+
+A#
𝑈! permutations of tokens.
by introducing an approximate posterior distribution of generation
sequence
Jiatao Gu, Changhan Wang, and Jake Zhao (Junbo) Facebook AI Research New York University Tigerobo Inc
4 = Levenshtein Distance(Saturday, Sundays)
generated text.
length changes.
Agent Environment
𝜁
Edit acWon 𝑏 ∈ 𝐵: {𝑗𝑜𝑡𝑓𝑠𝑢, 𝑒𝑓𝑚𝑓𝑢𝑓}
𝑧 ∈ 𝑊{|}~: 𝑔𝑠𝑝𝑛 𝑡𝑑𝑠𝑏𝑢𝑑ℎ, 𝑣𝑜𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓𝑒 𝑡𝑓𝑟 , 𝑧s ∈ 𝑧 𝑗𝑜𝑗𝑢 𝑡𝑓𝑟 𝑄𝑝𝑚𝑗𝑑𝑧 𝜌: y → 𝑄(𝐵) Reward function : 𝑆 𝑧 = −𝐸(𝑧, 𝑧∗)
𝑠
𝑏 ∈ 𝐵: {𝑗𝑜𝑡𝑓𝑠𝑢, 𝑒𝑓𝑚𝑓𝑢𝑓}
<s> </s>
0 (keep it)
placeholders with actual tokens in the vocabulary
tokens
boundaries) and predict “deleted” (0) or “kept” (1) for each token posiWon
at every consecuWve posiWon pairs
Roll-in policy Roll-in policy Expert policy Expert policy suggested actions
initial input
mixture factor any sequence ready to insert tokens
doing argmax 𝑣~𝑣𝑜𝑗𝑔𝑝𝑠𝑛[0,1]
𝑣~𝑣𝑜𝑗𝑔𝑝𝑠𝑛[0,1] random word dropping sequence of the round-truth deletion output
datasets and then replace the ground-truth sequence 𝑧∗ by the beam-search result of this teacher-model, 𝑧•‚
Levenshtein distance
InserWon transformer Non-Monotonic InDIGO Levenshtein
hwps://zhuanlan.zhihu.com/p/73417154
系统_唐都钰_段楠.pdf