Non-Autoregressive Decoding Xiachong Feng Outline Transformer - PowerPoint PPT Presentation

Non-Autoregressive Decoding Xiachong Feng

Outline • Transformer • The Importance of Generation Order in Language Modeling EMNLP18 • Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 • Non-Monotonic Sequential Text Generation ICML19 • Insertion-based Decoding with automatically Inferred Generation Order • Levenshtein Transformer • Conclusion • Paper List • Reference

Transformer

Scaled Dot-Product Attention https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

Example https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

Mul=-Head A?en=on https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

Transformer hEps://cips-upload.bj.bcebos.com/ssaE2019/CIPS_SSATT_2019_ 问答系统 _ 唐都钰 _ 段楠 .pdf

The Importance of Generation Order in Language Modeling Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George E. Dahl Google Brain EMNLP18

Overview • Linguistic intuition might suggest that we should first generate some abstract representation of what we want to say and then serialize it. • The best ordering we tried generates function words first and content words last, which cuts against the idea of committing to the general topic of a sentence first and only then deciding exactly how to phrase it.

Two-pass Language Models • Produces partially-filled sentence “templates” and then fills in missing tokens • Partitioning of the vocabulary into a set of first-pass and second-pass tokens to generate sentences. Template:first-pass tokens + a special placeholder token 𝑧 (#) 𝑧 Second-pass tokens 𝑧 (%)

Two-pass Language Models • Two copies of the Transformer model • Neural language model 𝒒 𝟐 : The first copy just generates the template, so it has no encoder. • Condi=onal transla=on model 𝒒 𝟑 : The second copy is a sequence-to-sequence model that translates the template into the complete sentence. Sentence à template template à final Seq2Seq no encoder

Two-pass Language Models template

Results • It is easier to first decide something about its syntacWc structure. • It is preferable to delay commiXng to a rare token for as long as possible as all subsequent decisions will then be condiWoning on a low-probability event.

Insertion Transformer: Flexible Sequence Generation via Insertion Operations Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit Google Brain, University of California, Berkeley ICML19

Inser=on Transformer • 𝑦 : source canvas (sequence) • 𝑧 : target canvas (sequence) • * 𝑧 + : hypothesis canvas at time t • 𝒟 : content vocabulary (token vocabulary for sequences) • 𝑚 : locations ∈ [0, | * 𝑧 + |]

Insertion Transformer Model • Full Decoder Self-Attention • Remove causal self attention • Slot Representations via Concatenated Outputs • Adding special marker tokens at the beginning and end of the decoder input to extend the sequence length by two. • Take the resulting n + 2 vectors in the final layer and concatenate each adjacent pair to obtain n + 1 slot representations.

Model • Joint content-loca=on distribu=on matrix of slot representations flatten this matrix into a vector • Joint distribu=on using a condi=onal factoriza=on learnable query vector 𝑚 -th row of H

Contextualized Vocabulary Bias context vector shared bias Global bias

Training and Loss Functions • LeQ-to-Right • Example : (x, y) • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧 • Create a new data point ((x, = 𝑧 = (𝑧 # , … , 𝑧 @ ) ), 𝑧 @A# ) • Loss : classificaWon loss (negaWve log-likelihood) • Note : only concerns about the last posiLon to insert

Balanced Binary Tree • Parallelism

Balanced Binary Tree • Example : (𝑦, 𝑧) • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧 • Sample a random subsequence of 𝑧 of length 𝑙 ： = 𝑧 1. Shuffle 𝑧 2. Extract the first 𝑙 3. Reorder

Soft binary tree loss 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 ↓ 𝑥 T (𝑗) ↑ span of tokens from the target output yet to be produced 𝑧 G H 𝑧 G HIJ 𝑧 K H … 𝑚 = 0 𝑚 = 1 𝑚 = 2 𝑚 = 3 𝑚 = 4 𝑚 = 5

Uniform

Balanced binary tree and uniform losses

Greedy Decoding • Choose the action with the highest probability • sequence finalization • until an end-of-sequence token gets selected • slot finalization • restrict the argmax to locations whose maximum-probability decision is not end-of-slot • Until the model predicts an end-of-slot token for every location.

Parallel Decoding • For each location 𝑚 ： joint distribuWon factorization • slot finalization

Non-Monotonic Sequential Text Generation Sean Welleck , Kiante ́ Brantley, Hal Daume ́ III, Kyunghyun Cho New York University, University of Maryland, College Park Microsoe Research, Facebook AI Research CIFAR Azrieli Global Scholar ICML19

Overview • Recursively generaWng words to its lee and then words to its right, yielding a binary tree. • Learning is framed as imitaWon learning, including a coaching method which moves from imitaWng an oracle to reinforcing the policy’s own preferences Level-order In-order

Imita=on Learning • Imitation Learning with Recurrent Neural Networks • Learning to Search Better than Your Teacher ICML15 • https://zhuanlan.zhihu.com/p/25688750 • https://blog.csdn.net/WASEFADG/article/details/83651126 • https://www.quora.com/What-is-imitation-learning

Notation • Vocabulary V 𝑊 = 𝑊 ∪ < 𝑓𝑜𝑒 > • State space V 𝑊 ∗ • State 𝑡 ∈ 𝑇 corresponds to a sequence of tokens from V 𝑊 • Init state: empty sequence <> • End state: < 𝑓𝑜𝑒 > • AcWon 𝑏 : select an element from vocab and append to the state • 𝜐(𝑢) : maps from in-order to level order • Policy 𝜌(𝑏|𝑡)

Challenge • The sequences 𝑍 alone only tell us what the final output sequences of words should be, but not what tree(s) should be used to get there.

Imitation Learning • The first step, an oracle policy’s acWon is to produce any word 𝑥 that appears anywhere in 𝑍 . • All words to the lee of 𝑥 in 𝑍 are generated recursively on the lee (following the same procedure), and all words to the right of 𝑥 in 𝑍 are generated recursively on the right. • The oracle is non-determinisWc (many “correct” acWons are available at any given Wme), we inform this oracle policy with the current learned policy, encouraging it to favor acWons that are preferred by the current policy.

Background: Learning to Search Learning to Search Better than Your Teacher ICML15

Loss • 3 𝔽 • draw states 𝑡 according to the state distribuWon induced by 𝜌 Gb • compute cost-to-go under 𝜌 cd+ , for all possible acWons 𝑏 at that state. • 2 𝔽 • running 𝜌 for t-many steps • 1 𝔽 • for one instance 1 2 3

Cost Measurement • when dealing with recurrent neural network policies using a cost funcWon more analogous to a cross-entropy loss can be preferred • use a KL-divergence type loss, measuring the difference between the acWon distribuWon produced by 𝜌 and the acWon distribuWon preferred by 𝜌 cd+ . • first sampling one training sequence, running the roll-in policy for t steps, and compuWng the KL divergence at that state using 𝜌 ∗ ( reference 𝑝𝑠 oracle ) as 𝜌 cd+ . Learning corresponds to minimizing this KL divergence iteraWvely with respect to the parameters of 𝜌 .

Roll-In Policies • In most formal analyses, the roll-in policy is a stochastic mixture of the learned policy 𝜌 and the oracle policy 𝜌 ∗ • Experimentally , it has often been found that simply using the oracle’s state distribution is optimal Learning to Search BeEer than Your Teacher ICML15

Oracle Policies • Uniform Oracle. 𝑞 f = 1/𝑜 • Coaching Oracle • preferring acWons that are preferred by the current parameterized policy • Annealed Coaching Oracle( 𝛾 from 1 to 0)

Word Reordering Examples

Inser=on-based Decoding with automa=cally Inferred Genera=on Order JiataoGu, QiLiu, KyunghyunCho Facebook AI Research New York University

Motivation • L2R is not necessarily the optimal option for generating sequences. • For instance, people sometimes tend to think of central phrases first before building up a whole sentence.

Orders as Latent Variables • 𝑄 j is the set of all the permutations of (1, … , 𝑈 ) • 𝜌 = (𝑨 % , 𝑨 m , … 𝑨 j , 𝑨 jA# ) ∈ 𝑄 j • 𝑧 n = { 𝑧 % , 𝑨 % , … , (𝑧 jA# , 𝑨 jA# )} , (𝑧 j , 𝑨 j ) represents the 𝑢 − 𝑢ℎ generated token and its absolute position • Two special tokens • 𝑧 s , 𝑨 s = < 𝑡 >, 0 、 𝑧 # , 𝑨 # = (</𝑡 >, 𝑈 + 1) • Object 𝑧 jA% =< 𝑓𝑝𝑒 >

Non-Autoregressive Decoding Xiachong Feng Outline Transformer - PowerPoint PPT Presentation

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of Generation Order in Language Modeling EMNLP18 Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 Non-Monotonic

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Financial Econometrics Econ 40357 ARIMA Part 2: Autoregressive Models N.C. Mark University of

CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling

Decoding One Out of Many Nicolas Sendrier INRIA Paris-Rocquencourt, equipe-projet SECRET

Video Analysis using CUDA and OpenCV Sam Radhakrishnan Alphonso Labs Session Overview This

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

On List Decoding of Alternant Codes in the Hamming and Lee metrics Ido Tal Ron M. Roth Computer

Decoding of Linear Codes Arman Fazeli Alexander Vardy Hanwen Yao afazelic@ucsd.edu

Decoding in SMT Nitin Madnani February 8, 2006 The Decoding Problem Search Inputs:

Entrepreneurship 1 2 3 Peter Ward Ward Demolition Simon Gaines - Fletcher Construction

Deconstructing a Secure Processor Black Hat Washington D.C. Christopher Tarnovsky

Drawing Planar Cubic 3-Connected Graphs with Few Segments: Algorithms & Experiments Alex

Non-Autoregressive Decoding Xiachong Feng Outline Transformer - PowerPoint PPT Presentation

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of Generation Order in Language Modeling EMNLP18 Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 Non-Monotonic

By et al Siegfried Engelmann Decoding Strategies: Decoding B1- Teacher's Presentation Book

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

List Decoding of Algebraic Codes Peter Beelen, Kristian Brander and Johan S.R. Nielsen DTU

Observation Decoding with Sensor Models: Recognition Tasks via Classical Planning Diego Aineto,

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Financial Econometrics Econ 40357 ARIMA Part 2: Autoregressive Models N.C. Mark University of

CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321

Beyond Sequential decoding toward parallel decoding In the context of neural sequence modelling

Decoding One Out of Many Nicolas Sendrier INRIA Paris-Rocquencourt, equipe-projet SECRET

Video Analysis using CUDA and OpenCV Sam Radhakrishnan Alphonso Labs Session Overview This

Coding and decoding with convolutional codes. The Viterbi Algorithm. J.-M. Brossier 2008 J.-M.

On List Decoding of Alternant Codes in the Hamming and Lee metrics Ido Tal Ron M. Roth Computer

Decoding of Linear Codes Arman Fazeli Alexander Vardy Hanwen Yao afazelic@ucsd.edu

Decoding in SMT Nitin Madnani February 8, 2006 The Decoding Problem Search Inputs:

Entrepreneurship 1 2 3 Peter Ward Ward Demolition Simon Gaines - Fletcher Construction

Deconstructing a Secure Processor Black Hat Washington D.C. Christopher Tarnovsky

Drawing Planar Cubic 3-Connected Graphs with Few Segments: Algorithms &amp; Experiments Alex

Drawing Planar Cubic 3-Connected Graphs with Few Segments: Algorithms & Experiments Alex