Non-Autoregressive Decoding Xiachong Feng Outline Transformer - - PowerPoint PPT Presentation

non autoregressive decoding
SMART_READER_LITE
LIVE PREVIEW

Non-Autoregressive Decoding Xiachong Feng Outline Transformer - - PowerPoint PPT Presentation

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of Generation Order in Language Modeling EMNLP18 Insertion Transformer: Flexible Sequence Generation via Insertion Operations ICML19 Non-Monotonic


slide-1
SLIDE 1

Non-Autoregressive Decoding

Xiachong Feng

slide-2
SLIDE 2

Outline

  • Transformer
  • The Importance of Generation Order in Language Modeling EMNLP18
  • Insertion Transformer:

Flexible Sequence Generation via Insertion Operations ICML19

  • Non-Monotonic Sequential Text Generation ICML19
  • Insertion-based Decoding with automatically Inferred Generation

Order

  • Levenshtein Transformer
  • Conclusion
  • Paper List
  • Reference
slide-3
SLIDE 3

Transformer

slide-4
SLIDE 4

Transformer

slide-5
SLIDE 5

Scaled Dot-Product Attention

https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf

slide-6
SLIDE 6

Example

https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf

slide-7
SLIDE 7

Mul=-Head A?en=on

https://cips-upload.bj.bcebos.com/ssatt2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf

slide-8
SLIDE 8

Transformer

hEps://cips-upload.bj.bcebos.com/ssaE2019/CIPS_SSATT_2019_问答系统_唐都钰_段楠.pdf

slide-9
SLIDE 9

The Importance of Generation Order in Language Modeling

Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George E. Dahl Google Brain EMNLP18

slide-10
SLIDE 10

Overview

  • Linguistic intuition might suggest that we should first generate some

abstract representation of what we want to say and then serialize it.

  • The best ordering we tried generates function words first and content

words last, which cuts against the idea of committing to the general topic of a sentence first and only then deciding exactly how to phrase it.

slide-11
SLIDE 11

Two-pass Language Models

  • Produces partially-filled sentence “templates” and then fills in missing

tokens

  • Partitioning of the vocabulary into a set of first-pass and second-pass

tokens to generate sentences.

𝑧 𝑧(#) 𝑧(%)

Template:first-pass tokens + a special placeholder token Second-pass tokens

slide-12
SLIDE 12

Two-pass Language Models

  • Two copies of the Transformer model
  • Neural language model 𝒒𝟐 : The first copy just generates

the template, so it has no encoder.

  • Condi=onal transla=on model 𝒒𝟑 : The second copy is a

sequence-to-sequence model that translates the template into the complete sentence.

Sentenceàtemplate templateàfinal

no encoder Seq2Seq

slide-13
SLIDE 13

Two-pass Language Models

template

slide-14
SLIDE 14

Results

  • It is easier to first decide something about its syntacWc structure.
  • It is preferable to delay commiXng to a rare token for as long as possible as all

subsequent decisions will then be condiWoning on a low-probability event.

slide-15
SLIDE 15

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

Mitchell Stern, William Chan, Jamie Kiros, Jakob Uszkoreit Google Brain, University of California, Berkeley ICML19

slide-16
SLIDE 16

Inser=on Transformer

  • 𝑦 : source canvas (sequence)
  • 𝑧 : target canvas (sequence)
  • *

𝑧+ : hypothesis canvas at time t

  • 𝒟 : content vocabulary (token vocabulary for sequences)
  • 𝑚 : locations ∈ [0, | *

𝑧+|]

slide-17
SLIDE 17

Insertion Transformer Model

  • Full Decoder Self-Attention
  • Remove causal self attention
  • Slot Representations via Concatenated Outputs
  • Adding special marker tokens at the beginning and end of the

decoder input to extend the sequence length by two.

  • Take the resulting n + 2 vectors in the final layer and concatenate

each adjacent pair to obtain n + 1 slot representations.

slide-18
SLIDE 18

Model

  • Joint content-loca=on distribu=on
  • Joint distribu=on using a condi=onal factoriza=on

matrix of slot representations

learnable query vector 𝑚-th row of H

flatten this matrix into a vector

slide-19
SLIDE 19

Contextualized Vocabulary Bias

context vector shared bias

Global bias

slide-20
SLIDE 20

Training and Loss Functions

  • LeQ-to-Right
  • Example : (x, y)
  • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧
  • Create a new data point ((x, =

𝑧 = (𝑧#, … , 𝑧@)), 𝑧@A#)

  • Loss : classificaWon loss (negaWve log-likelihood)
  • Note : only concerns about the last posiLon to insert
slide-21
SLIDE 21

Balanced Binary Tree

  • Parallelism
slide-22
SLIDE 22

Balanced Binary Tree

  • Example : (𝑦, 𝑧)
  • Sample a length 𝑙~𝑣𝑜𝑗𝑔𝑝𝑠𝑛 0, 𝑧
  • Sample a random subsequence of 𝑧 of length 𝑙: =

𝑧

  • 1. Shuffle 𝑧
  • 2. Extract the first 𝑙
  • 3. Reorder
slide-23
SLIDE 23

Soft binary tree loss

𝑚 = 0 𝑚 = 1 𝑚 = 2 𝑚 = 3 𝑚 = 4 𝑚 = 5

𝑧GH 𝑧GHIJ 𝑧KH

span of tokens from the target

  • utput yet to be produced

𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 ↓ 𝑥T(𝑗) ↑

slide-24
SLIDE 24

Uniform

slide-25
SLIDE 25

Balanced binary tree and uniform losses

slide-26
SLIDE 26

Greedy Decoding

  • Choose the action with the highest probability
  • sequence finalization
  • until an end-of-sequence token gets selected
  • slot finalization
  • restrict the argmax to locations whose maximum-probability

decision is not end-of-slot

  • Until the model predicts an end-of-slot token for every location.
slide-27
SLIDE 27

Parallel Decoding

  • For each location 𝑚
  • slot finalization

joint distribuWon factorization

slide-28
SLIDE 28

Non-Monotonic Sequential Text Generation

Sean Welleck , Kiante ́ Brantley, Hal Daume ́ III, Kyunghyun Cho New York University, University of Maryland, College Park Microsoe Research, Facebook AI Research CIFAR Azrieli Global Scholar ICML19

slide-29
SLIDE 29

Overview

  • Recursively generaWng words to its lee and then words to its right,

yielding a binary tree.

  • Learning is framed as imitaWon learning, including a coaching method

which moves from imitaWng an oracle to reinforcing the policy’s own preferences

Level-order In-order

slide-30
SLIDE 30

Imita=on Learning

  • Imitation Learning with Recurrent Neural Networks
  • Learning to Search Better than Your Teacher ICML15
  • https://zhuanlan.zhihu.com/p/25688750
  • https://blog.csdn.net/WASEFADG/article/details/83651126
  • https://www.quora.com/What-is-imitation-learning
slide-31
SLIDE 31

Notation

  • Vocabulary V

𝑊 = 𝑊 ∪ < 𝑓𝑜𝑒 >

  • State space V

𝑊∗

  • State 𝑡 ∈ 𝑇 corresponds to a sequence of tokens from V

𝑊

  • Init state: empty sequence <>
  • End state: < 𝑓𝑜𝑒 >
  • AcWon 𝑏 : select an element from vocab and append to the state
  • 𝜐(𝑢): maps from in-order to level order
  • Policy 𝜌(𝑏|𝑡)
slide-32
SLIDE 32

Challenge

  • The sequences 𝑍 alone only tell us what the final output sequences of

words should be, but not what tree(s) should be used to get there.

slide-33
SLIDE 33

Imitation Learning

  • The first step, an oracle policy’s acWon is to produce any word 𝑥 that appears anywhere in 𝑍.
  • All words to the lee of 𝑥 in 𝑍 are generated recursively on the lee (following the same

procedure), and all words to the right of 𝑥 in 𝑍 are generated recursively on the right.

  • The oracle is non-determinisWc (many “correct” acWons are available at any given Wme), we

inform this oracle policy with the current learned policy, encouraging it to favor acWons that are preferred by the current policy.

slide-34
SLIDE 34

Background: Learning to Search

Learning to Search Better than Your Teacher ICML15

slide-35
SLIDE 35

Loss

  • 3 𝔽
  • draw states 𝑡 according to the state distribuWon induced by 𝜌Gb
  • compute cost-to-go under 𝜌cd+, for all possible acWons 𝑏 at that

state.

  • 2 𝔽
  • running 𝜌 for t-many steps
  • 1 𝔽
  • for one instance

3 2 1

slide-36
SLIDE 36

Cost Measurement

  • when dealing with recurrent neural network policies using a cost

funcWon more analogous to a cross-entropy loss can be preferred

  • use a KL-divergence type loss, measuring the difference between the

acWon distribuWon produced by 𝜌 and the acWon distribuWon preferred by 𝜌cd+.

  • first sampling one training sequence, running the roll-in policy for t

steps, and compuWng the KL divergence at that state using 𝜌∗ (reference 𝑝𝑠 oracle )as 𝜌cd+. Learning corresponds to minimizing this KL divergence iteraWvely with respect to the parameters of 𝜌.

slide-37
SLIDE 37

Roll-In Policies

  • In most formal analyses, the roll-in policy is a stochastic mixture of

the learned policy 𝜌 and the oracle policy 𝜌∗

  • Experimentally, it has often been found that simply using the oracle’s

state distribution is optimal

Learning to Search BeEer than Your Teacher ICML15

slide-38
SLIDE 38

Oracle Policies

  • Uniform Oracle. 𝑞f = 1/𝑜
  • Coaching Oracle
  • preferring acWons that are preferred by the current parameterized policy
  • Annealed Coaching Oracle(𝛾 from 1 to 0)
slide-39
SLIDE 39

Word Reordering Examples

slide-40
SLIDE 40

Inser=on-based Decoding with automa=cally Inferred Genera=on Order

JiataoGu, QiLiu, KyunghyunCho Facebook AI Research New York University

slide-41
SLIDE 41

Motivation

  • L2R is not necessarily the optimal option for generating sequences.
  • For instance, people sometimes tend to think of central phrases first

before building up a whole sentence.

slide-42
SLIDE 42

Orders as Latent Variables

  • 𝑄j is the set of all the permutations of (1, … , 𝑈 )
  • 𝜌 = (𝑨%, 𝑨m, … 𝑨j, 𝑨jA#) ∈ 𝑄j
  • 𝑧n = { 𝑧%, 𝑨% , … , (𝑧jA#, 𝑨jA#)}, (𝑧j, 𝑨j) represents the 𝑢 −

𝑢ℎ generated token and its absolute position

  • Two special tokens
  • 𝑧s, 𝑨s = < 𝑡 >, 0 、 𝑧#, 𝑨# = (</𝑡 >, 𝑈 + 1)
  • Object

𝑧jA% =< 𝑓𝑝𝑒 >

slide-43
SLIDE 43

Relative Representation of Positions

  • 𝑠

G +: the relaWve-posiWon representaWons of token 𝑗 at decode step 𝑢

  • 𝑠

G + is a vector

  • Value : 0, 1, -1
  • Matrix 𝑆+ = [𝑠

s +, 𝑠 # +, … , 𝑠 + +] shows the relaWve-posiWon

representaWons of all the words in the sequence.

  • Mapped back to the absolute posiWon
  • Update
slide-44
SLIDE 44

Inser=on-based Decoding

  • Given𝑧s:+ and 𝑠

s:+

  • Predict 𝑧+A# and 𝑠

+A#

  • Note : only concerns about the 𝑧@ which has been selected
  • 𝑡 = −1 if 𝑧+A# is on the left of 𝑧@, and 𝑡 = 1 otherwise.
slide-45
SLIDE 45

Insertion-based Decoding

slide-46
SLIDE 46

Transformer-InDIGO

  • Relative position-based self-attention
slide-47
SLIDE 47

Transformer-InDIGO

  • Word & Position Prediction
slide-48
SLIDE 48

Transformer-InDIGO

slide-49
SLIDE 49

Learning

  • This is intractable since we need to enumerate all of the

𝑈! permutations of tokens.

  • Maximize the evidence lower-bound (ELBO) of the original objective

by introducing an approximate posterior distribution of generation

  • rders 𝑟(𝜌|𝑦, 𝑧), which provides the probabilities of latent generation
  • rders based on the ground-truth sequences 𝑦 and 𝑧:
slide-50
SLIDE 50

Searched Adaptive Order (SAO)

  • beam-search in the space of all the permutaWons of the target

sequence

  • Sub-sequence:
  • Lee words:
  • corresponding posiWon
  • select top-B sub-sequences as the new set B for the next step.
slide-51
SLIDE 51

Levenshtein Transformer

Jiatao Gu, Changhan Wang, and Jake Zhao (Junbo) Facebook AI Research New York University Tigerobo Inc

slide-52
SLIDE 52

Levenshtein Distance

4 = Levenshtein Distance(Saturday, Sundays)

  • 1. Saturday → Sturday // delete the first a
  • 2. Sturday → Surday // delete the first t
  • 3. Surday → Sunday // replace r with n
  • 4. Sunday → Sundays // add s at the end
slide-53
SLIDE 53

Overview

  • Humans can revise, replace, revoke or delete any part of their

generated text.

  • Atomic operations : insertion and deletion
  • Not only generation but also sequence refinement allowing dynamic

length changes.

  • Partially autoregressive model
slide-54
SLIDE 54

Problem Formulation

  • Markov Decision Process (MDP)

Agent Environment

𝜁

Edit acWon 𝑏 ∈ 𝐵: {𝑗𝑜𝑡𝑓𝑠𝑢, 𝑒𝑓𝑚𝑓𝑢𝑓}

𝑧 ∈ 𝑊{|}~: 𝑔𝑠𝑝𝑛 𝑡𝑑𝑠𝑏𝑢𝑑ℎ, 𝑣𝑜𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓𝑒 𝑡𝑓𝑟 , 𝑧s ∈ 𝑧 𝑗𝑜𝑗𝑢 𝑡𝑓𝑟 𝑄𝑝𝑚𝑗𝑑𝑧 𝜌: y → 𝑄(𝐵) Reward function : 𝑆 𝑧 = −𝐸(𝑧, 𝑧∗)

𝑠

slide-55
SLIDE 55

Actions

𝑏 ∈ 𝐵: {𝑗𝑜𝑡𝑓𝑠𝑢, 𝑒𝑓𝑚𝑓𝑢𝑓}

<s> </s>

slide-56
SLIDE 56

Dele=on

  • makes a binary decision which is 1 (delete this token) or

0 (keep it)

  • Avoid sequence boundary being broken
slide-57
SLIDE 57

Insertion

  • placeholder predicWon and token predicWon
  • All locaWons
  • the possibility of adding one or several placeholders
  • for every placeholder predicted as above, replaces the

placeholders with actual tokens in the vocabulary

slide-58
SLIDE 58

Policy combination

  • delete tokens – insert placeholders - replace placeholders with new

tokens

  • parallelize the computaWon within each sub-tasks.
slide-59
SLIDE 59

Levenshtein Transformer

slide-60
SLIDE 60

Levenshtein Transformer

  • Decoder output : (ℎs, ℎ#,… ℎb), passed to three policy classifiers
  • 1. Dele=on Classifier: scans over the input tokens (except for the

boundaries) and predict “deleted” (0) or “kept” (1) for each token posiWon

  • 2. Placeholder Classifier: predicts the number of tokens to be inserted

at every consecuWve posiWon pairs

  • 3. Token Classifier: fill in tokens replacing all the placeholders.
slide-61
SLIDE 61

Dual-policy Learning

Roll-in policy Roll-in policy Expert policy Expert policy suggested actions

slide-62
SLIDE 62

Roll-in Policy

  • Learning to Delete

initial input

  • utput by applying insertion

mixture factor any sequence ready to insert tokens

  • btained by sampling instead of

doing argmax 𝑣~𝑣𝑜𝑗𝑔𝑝𝑠𝑛[0,1]

  • Learning to Insert

𝑣~𝑣𝑜𝑗𝑔𝑝𝑠𝑛[0,1] random word dropping sequence of the round-truth deletion output

slide-63
SLIDE 63

Expert Policy

  • Oracle:
  • Teacher Model:
  • first train an autoregressive teacher model using the same

datasets and then replace the ground-truth sequence 𝑧∗ by the beam-search result of this teacher-model, 𝑧•‚

Levenshtein distance

slide-64
SLIDE 64

Conclusion

InserWon transformer Non-Monotonic InDIGO Levenshtein

slide-65
SLIDE 65

Paper List

slide-66
SLIDE 66

Reference

  • 香侬读 | 按什么套路生成?基于插入和删除的序列生成方法

hwps://zhuanlan.zhihu.com/p/73417154

  • hwps://cips-upload.bj.bcebos.com/ssaw2019/CIPS_SSATT_2019_问答

系统_唐都钰_段楠.pdf

slide-67
SLIDE 67

Thanks!