Rethinking the Generation Orders of Sequence jcykcai Why - PowerPoint PPT Presentation

Rethinking the Generation Orders of Sequence jcykcai

Why left-to-right? • Humans do it • But humans also do • First generate some abstract of what to say • Then serialize them

The Importance of Generation Order in Language Modeling Nicolas Ford ∗ Daniel Duckworth Mohammad Norouzi George E. Dahl Google Brain { nicf,duckworthd,mnorouzi,gdahl } @google.com EMNLP18

Goal • Better generation order? • Wait! Does it really matter?

Framework • Two-pass language models • Vocabulary partition: first-pass and second-pass tokens • Y = Y^1 + Y^2 • Y^1 (template): only consist of first-pass tokens and special placeholders • Y^2 the rest second-pass tokens

Order Variants sentence common first rare first function first content first odd first ” all you need to do ” all you to if you need do ” all you to if you need do ” all you need if you want the na- the ’s on want nation the ’s on your want nation press you the nation ’s tion ’s press camped is to you had a press camped your is to you a camped doorstep press camped on your on your doorstep is to [UNK] in , ” he doorstep say in , ” he in his say once had doorstep say you say you once had a in his . [EOS] once 1947 . [EOS] [UNK] 1947 once had [UNK] in 1947 , ” noted memorably noted memorably ” noted his . he noted memorably in diary [EOS] diary [EOS] [EOS] his diary . [EOS] the team announced the that the , team announced the that the team announced the team announced thursday that the 6- [UNK] will in thursday 6-foot-1 , will in thursday 6-foot-1 the 6-foot-1 foot-1 , [UNK] starter the . [EOS] starter remain through the . [UNK] starter will remain will remain in detroit detroit through [EOS] remain detroit through the 2013 . through the 2013 sea- 2013 season [EOS] 2013 season [EOS] [EOS] son . [EOS] scotland ’s next game ’s is a the scotland next game ’s is a against scotland next game ’s next game is a friendly against at on . [EOS] friendly against the at on . friendly the czech republic at the czech republic at czech republic ham- [EOS] czech republic ham- hampden on 3 march . hampden on 3 march . pden 3 march pden 3 march [EOS] [EOS] [EOS] [EOS] of course , millions of of , of course millions of , of a course millions of of additional additional homeown- a : they of additional homeown- : they of ” additional home- big ers did make a big mis- ” ” and [UNK] ers did make big ” and to owners did make they advantage of take : they took ad- to they ’t . mistake took ad- they . [EOS] big mistake ” liar ” and other vantage of ” liar loans [EOS] vantage liar loans took advantage deals buy homes ” and other [UNK] other deals liar loans other they couldn afford . deals to buy homes buy homes couldn [UNK] deals buy [EOS] they couldn ’t afford . afford [EOS] homes couldn ’t [EOS] afford [EOS] Table 1: Some example sentences from the dataset and their corresponding templates. The placeholder token is

Language Models • The total probability of a sentence y is p ( y ) = p 1 ( y (1) ) p 2 ( y (2) | y (1) ) • The template y^1 is a deterministic function of y • Template decoder + Template encoder + second-phrase decoder

Experiments Model Train Validation Test odd first 39.925 45.377 45.196 rare first 38.283 43.293 43.077 content first 38.321 42.564 42.394 common first 36.525 41.018 40.895 function first 36.126 40.246 40.085 baseline 38.668 41.888 41.721 enhanced baseline 35.945 39.845 39.726 • PPL on LM1B • Content-dependent generation orders do have a large e ff ect on model quality • Function-first is the best (common-first is the second) • It is easier to first decide syntactic structure • Delay the rare tokens

Recent Advances https://arxiv.org/pdf/1902.01370.pdf https://arxiv.org/pdf/1902.02192.pdf https://arxiv.org/pdf/1902.03249.pdf

Insertion Transformer: Flexible Sequence Generation via Insertion Operations Mitchell Stern 1 2 William Chan 1 Jamie Kiros 1 Jakob Uszkoreit 1 ICML19

Model • Architecture • Transformer with full self-attention decoder • Slot representations • Content-location distribution • What to insert & where to insert • p ( c, l | x, ˆ y t ) = InsertionTransformer( x, ˆ y t ) . As an example, suppose our current hypothesis can

Termination • Termination conditions • Sequence finalization • Slot finalization (enable parallel inference)

Insertion Transformer: Flexible Sequence Generation via Insertion Operations Serial generation: Parallel generation: Canvas Insertion Canvas Insertions t t ( ate , 0) ( ate , 0) 0 [] 0 [] ( together , 1) ( friends , 0) , ( together , 1) 1 [ate] 1 [ate] ( friends , 0) ( three , 0) , ( lunch , 2) 2 [ate, together] 2 [friends, ate, together] ( three , 0) ( h EOS i , 5) 3 [friends, ate, together] 3 [three, friends, ate, lunch, together] ( lunch , 3) 4 [three, friends, ate, together] ( h EOS i , 5) 5 [three, friends, ate, lunch, together] Figure 1. Examples demonstrating how the clause “three friends ate lunch together” can be generated using our insertion framework. On the left, a serial generation process is used in which one insertion is performed at a time. On the right, a parallel generation process is used with multiple insertions being allowed per time step. Our model can either be trained to follow specific orderings or to maximize entropy over all valid actions. Some options permit highly efficient parallel decoding, as shown in our experiments.

Training • The form of single training instances • Sample generation steps (partial sentences) • Variants • Left-to-right • Balanced Binary Tree • Uniform

Results Loss Termination BLEU (+EOS) BLEU (+EOS) BLEU (+EOS) +Distillation +Distillation, +Parallel Left-to-Right Sequence 20.92 (20.92) 23.29 (23.36) - Binary Tree ( τ = 0 . 5 ) Slot 20.35 (21.39) 24.49 (25.55) 25.33 (25.70) Binary Tree ( τ = 1 . 0 ) Slot 21.02 (22.37) 24.36 (25.43) 25.43 (25.76) Binary Tree ( τ = 2 . 0 ) Slot 20.52 (21.95) 24.59 (25.80) 25.33 (25.80) Uniform Sequence 19.34 (22.64) 22.75 (25.45) - Uniform Slot 18.26 (22.16) 22.39 (25.58) 24.31 (24.91) • +Parallel is even better! • Greedy search may su ff er from issues related to local search that are circumvented by making multiple updates to the hypothesis at once.

Results Model BLEU Iterations Autoregressive Left-to-Right n Transformer (Vaswani et al., 2017) 27.3 Semi-Autoregressive Left-to-Right n/ 6 SAT (Wang et al., 2018) 24.83 ⇡ n/ 5 Blockwise Parallel (Stern et al., 2018) 27.40 Non-Autoregressive NAT (Gu et al., 2018) 17.69 1 Iterative Refinement (Lee et al., 2018) 21.61 10 Our Approach (Greedy) n Insertion Transformer + Left-to-Right 23.94 n Insertion Transformer + Binary Tree 27.29 n Insertion Transformer + Uniform 27.12 Our Approach (Parallel) ⇡ log 2 n Insertion Transformer + Binary Tree 27.41 ⇡ log 2 n Insertion Transformer + Uniform 26.72 • Comparable performance • Fewer generation iteration => faster?

Limitations • Must recompute the decoder hidden stat for each position after each insertion • Auto-regressive vs. non-autoregressive • Expressive power vs. parallel decoding

Non-Monotonic Sequential Text Generation Sean Welleck 1 Kiant´ e Brantley 2 Hal Daum´ e III 2 3 Kyunghyun Cho 1 4 5 4 1 are 3 8 2 2 ICML19 ? how 6 6 7 9 you <end> 4 1 5 3 <end> <end> 9 7 8 5 <end> <end>

Goal • Learn a good order without • specifying an order in advance. • additional annotation

Formulation 4 1 are 3 8 2 2 ? how 6 6 7 9 you <end> 4 1 5 3 <end> <end> 9 7 8 5 <end> <end> Figure 1. A sequence, “how are you ?”, generated by the proposed • Generating a word at an arbitrary position, then recursively generating words to its left and words to its right.

Formulation 4 1 are 3 8 2 2 ? how 6 6 7 9 you <end> 4 1 5 3 <end> <end> 9 7 8 5 <end> <end> Figure 1. A sequence, “how are you ?”, generated by the proposed • The full generation is performed in a level-order traversal. (green) • The output is read o ff from an in-order traversal. (blue)

Imitation Learning • Learn a generation policy that mimics the actions of an oracle generation policy • Oracle policies • Uniform oracle: similar to quick-sort • Coaching oracle: reinforce the policy’s own preferences π ⇤ coaching ( a | s ) / π ⇤ uniform ( a | s ) π ( a | s ) • Annealed coaching oracle: π ⇤ annealed ( a | s ) = βπ ⇤ uniform ( a | s ) + (1 � β ) π ⇤ coaching ( a | s )

Imitation Learning • Annealed coaching oracle • Random oracle encourages exploration • Reinforcement leads to a specific generation order • A special case for comparison • Deterministic Left-to-Right Oracle (standard order)

Policy Networks • Partial binary tee is considered as a flat sequence of nodes in a level-order traversal. • Essentially, still a sequence model • Transformer, LSTM can be applied.

Rethinking the Generation Orders of Sequence jcykcai Why - PowerPoint PPT Presentation

Rethinking the Generation Orders of Sequence jcykcai Why left-to-right? Humans do it But humans also do First generate some abstract of what to say Then serialize them The Importance of Generation Order in Language Modeling

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

BUILDINGS AND GROUNDS WORK ORDERS Work orders closed to date in the 17/18 school year: 3,039

Partial Orders on the integers. In this case ( a , b ) R if a b . a a so R is reflexive. a b

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

BUILDINGS AND GROUNDS WORK ORDERS Work orders generated in the last 4 months: 1,640

Expressive Completeness over Nat and Finite orders MLO=Automata=regular expressions (over finite

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Acts 29 O PEN Y OUR B IBLE T O : Acts 29 For we walk by faith, not by sight. 2 Corinthians 5:7

By Paul Lamey Why study the Sabbath? Why study the Sabbath? For the Christian it remains a

Persecution and prayer: Sovereign Lord Acts 4:23-31 The unmistakable implication of the

21/04/2013 1 21/04/2013 Jesu s is Lor d of Cr eat ion He has delivered us from the domain

BTS 65 Floor spring BTS 65 FLOOR SPRING THE ECONOMIC FLOOR SPRING Technical Data BTS 65

UKNOF Jan 2014 NICC Standards Paul Rosbotham Director, NICC NICC Standards Ltd

How not to drown in a sea of information: An event recognition approach Elias Alevizos 1 ,

Wombat: one more Bleichenbacher attack toolkit Olivier Levillain Aina Toky Rasoamanana Tlcom

Rethinking the Generation Orders of Sequence jcykcai Why - PowerPoint PPT Presentation

Rethinking the Generation Orders of Sequence jcykcai Why left-to-right? Humans do it But humans also do First generate some abstract of what to say Then serialize them The Importance of Generation Order in Language Modeling

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

RETHINKING THE TOOLS OF ENGAGEMENT FLIPPING THE OUTCOMES RETHINKING THE TOOLS OF ENGAGEMENT /

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

BUILDINGS AND GROUNDS WORK ORDERS Work orders closed to date in the 17/18 school year: 3,039

Partial Orders on the integers. In this case ( a , b ) R if a b . a a so R is reflexive. a b

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

BUILDINGS AND GROUNDS WORK ORDERS Work orders generated in the last 4 months: 1,640

Expressive Completeness over Nat and Finite orders MLO=Automata=regular expressions (over finite

Asynchronous sequence circuits An asynchronous sequence machine is a sequence circuit without

Acts 29 O PEN Y OUR B IBLE T O : Acts 29 For we walk by faith, not by sight. 2 Corinthians 5:7

By Paul Lamey Why study the Sabbath? Why study the Sabbath? For the Christian it remains a

Persecution and prayer: Sovereign Lord Acts 4:23-31 The unmistakable implication of the

21/04/2013 1 21/04/2013 Jesu s is Lor d of Cr eat ion He has delivered us from the domain

BTS 65 Floor spring BTS 65 FLOOR SPRING THE ECONOMIC FLOOR SPRING Technical Data BTS 65

UKNOF Jan 2014 NICC Standards Paul Rosbotham Director, NICC NICC Standards Ltd

How not to drown in a sea of information: An event recognition approach Elias Alevizos 1 ,

Wombat: one more Bleichenbacher attack toolkit Olivier Levillain Aina Toky Rasoamanana Tlcom

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or