Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - PowerPoint PPT Presentation

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda

Seq2seq Transformatjon Variable length output Model Variable length input

Example Applicatjons ● Summarizatjon (extractjve/abstractjve) ● Machine translatjon ● Dialog systems /chatbots ● Text generatjon ● Questjon answering ● ●

Seq2seq Transformatjon Variable length output Model size should be Model constant. Variable length input Solutjon : Apply a constant sized neural net module repeatedly on the data

Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion

Processing Pipeline Variable length output Decoder Intermediate representatjon Encoder Variable length input

Processing Pipeline Variable length output Decoder Attention Intermediate representatjon Encoder Variable length input Embedding Variable length text

Architecture Variants Encoder Decoder Atuentjon Recurrent net Recurrent net No Recurrent net Recurrent net Yes Convolutjonal net Convolutjonal net No Convolutjonal net Recurrent net Yes Convolutjonal net Convolutjonal net Yes Fully connected net Fully connected net Yes with self-atuentjon with self-atuentjon

RNN-decoder with RNN-encoder = RNN cell Soft Soft Soft Soft max max max max <start> Thanks Very Much Tusen <end> Takk Decoder Encoder

RNN-dec with RNN-enc, Training Thanks Very Much <end> Soft Soft Soft Soft max max max max <start> Thanks Very Much Ground Truths Tusen <end> Takk Decoder Encoder

RNN-dec with RNN-enc, Decoding Thanks Much Very <end> Greedy Decoding Soft Soft Soft Soft max max max max <start> Thanks Much Very Tusen <end> Takk Decoder Encoder

Decoding Approaches ● Optjmal decoding ● Greedy decoding ● Easy ● Not optjmal ● Beam search ● Closer to optjmal decoder ● Choose top N candidates instead of the best one at each step.

Beam Search Decoding Beam Width = 3

Straight-forward Extensions Next state Current state Current control Next control state state Current state Next state Current Input Current Input LSTM Cell RNN Cell Next state Current state Next state Current state Current state Next state Current state Next state Current Input Current Input Bidirectional Cell Stacked Cell

RNN-decoder with RNN-encoder with Atuentjon = RNN cell Soft Soft Soft Soft max max max max Context + <start> Thanks Very Much Tusen <end> Takk Decoder Encoder

Atuentjon ● Context is given by ● Atuentjon weights are dynamic ● Generally defjned by with where functjon f can be defjned in several ways. ● Dot product ● Weighted dot product ● Use another MLP (eg: 2 layer)

Atuentjon RNN Cell +

Example: Google Neural Machine Translatjon

Why Convolutjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Convolutjons networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs are constant

Long range dependency capture with conv nets n k k

Conv net, Recurrent net with Atuentjon CNN-a CNN-c z z z z y y y y 1 2 3 4 4 1 2 3 d a a a a i i ,1 i ,2 i ,3 i ,4 c i h + h i i 1 h i g i d W h g = + i d i i Gehring et.al. A Convolutjonal Encoder Model for Neural Machine Translatjon (2016)

Two conv nets with atuentjon e e e 3 2 1 z z z 2 3 1 d i = 1,2,3,4 i a i 1,2,3,4 j 1,2,3 Wd Wd Wd Wd = = i j , c c c c 2 4 1 3 h i = , 1,2,3,4 i W W W W g g g g 1 2 3 4 Gehring et.al, Convolutjonal Sequence to Sequence Learning , 2017

Why Self-atuentjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Self-atuentjon networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs does not depend on the input length

FCN with self-atuentjon Probability of the next words Previous Words Vasvani et.al, Atuentjon is all you need , 2017 Inputs

Scaled dot product atuentjon Query Keys Values

Multj-Head Atuentjon

Encoder Self-atuentjon Self Attention

Decoder Self-atuentjon • Almost same as encoder self atuentjon • But only lefuward positjons are considered.

Encoder-decoder atuentjon Decoder state Encoder states

Overall Operatjon Next Word Previous Words Neural machine translation, philipp Koehn

Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●

Why Reinforcement Learning ● Exposure bias ● In training ground truths are used. In testjng, generated word in the previous step is used to generate the next word. ● Use generated words in training needs sampling : Non difgerentjable ● Maximum Likelihood criterion is not directly relevant to evaluatjon metrics ● BLEU (Machine translatjon) ● ROUGE (Summarizatjon) ● Use BLEU/ROUGE in training: Non difgerentjable

Sequence Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Hidden layers, Atuentjon weights etc. ● Actjon: Next Word ● Policy: Generate the next word ( actjon ) given the current hidden layers and atuentjon weights ( state ) ● Reward: Score computed using the evaluatjon metric (eg: BLEU)

Maximum Likelihood Training (Revisit) Minimize the negative log likelihood

Reinforcement Learning Formulatjon Minimize the expected negative reward, using REINFORCE algorithm

Reinforcement Learning Details ● Expected reward ● We need the gradient ● Need to write this as an expectatjon, so that we can evaluate it using samples. Use the log derivatjve trick: ● This is an expectatjon ● Approximate this with sample mean ● In practjce we use only one sample

Reinforcement Learning Details ● Gradient ● This estjmatjon has high variance. Use a baseline to combat this problem. ● Baseline can be anything independent of ● It can for example be estjmated as the reward for word sequence generated using argmax at each cell.

Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●

Maximum Likelihood Dialog Systems Am I Fine <start> I Am Are You? How

Why Reinforcement Learning ● Maximum Likelihood criterion is not directly relevant to successful dialogs ● Dull responses (“I don’t know”) ● Repetjtjve responses ● Need to integrate developer defjned rewards relevant to longer term goals of the dialog

Dialog Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Previous dialog turns ● Actjon: Next dialog utuerance ● Policy: Generate the next dialog utuerance ( actjon ) given the previous dialog turns ( state ) ● Reward: Score computed based on relevant factors such as ease of answering, informatjon fmow, semantjc coherence etc.

Training Setup Decoder Decoder Encoder Encoder Agent 2 Agent 1

Training Procedure ● From the viewpoint of a given agent, the procedure is similar to that of sequence generatjon ● REINFORCE algorithm ● Appropriate rewards must be calculated based on current and previous dialog turns. ● Can be initjalized with maximum likelihood trained models.

Adversarial Learning Use a discriminator as in GANs to calculate the reward ● Same training procedure based on REINFORCE for generator ● Discriminator Human Dialog

Questjon Answering ● Slightly difgerent from sequence-to-sequence model. Single Word Answer/ Fixed Length Output Start-end points of the answer Model Variable length inputs Question/Query Passage/Document/Context

QA- Naive Approach ● Combine questjon and passage and use an RNN to classify it . ● Will not work because relatjonship between the passage and questjon is not adequately captured. Single Word Answer/ Start-end points of the answer Fixed Length Output Model Variable length input Question and passage

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - PowerPoint PPT Presentation

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda Seq2seq Transformatjon Variable length output Model Variable length input Example Applicatjons Summarizatjon (extractjve/abstractjve) Machine translatjon

Mutatjon-based Testjng of Rule-based Model Transformatjons Using Higher-order Transformatjons

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

15 Exhaustjve Graph Rewrite Systems (XGRS) for Refactorings and Other Transformatjons Prof. Dr.

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sta$c Typing of Complex Presence Constraints in Interfaces Nathalie Oostvogels , Joeri De Koster,

Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2019/

TTF3 Power Coupler Update on Operating and Fabricating Issues TTC Meeting, FNAL, Chicago, April,

Update on Quartz Plate Calorimetry Y. Onel HCAL GENERAL MEETING March 27, 2014 1 HE Quartz

Ubiquitous Learning Analytics Student/ Context Modelling and Adacem ic Analytics Research

Logistics Paper summaries on Procedural Modeling Procedural Modeling Any takers?

The ATLAS Trigger & Data Acquisition System in Run-2 Catrin Bernius New York University US LHC

Sung Hak Lim, Mihoko Nojiri (JHEP, 2018) Amit

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - PowerPoint PPT Presentation

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda Seq2seq Transformatjon Variable length output Model Variable length input Example Applicatjons Summarizatjon (extractjve/abstractjve) Machine translatjon

Mutatjon-based Testjng of Rule-based Model Transformatjons Using Higher-order Transformatjons

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Text Processing CS440 Text processing NLP tasks typically require multiple steps of text

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

15 Exhaustjve Graph Rewrite Systems (XGRS) for Refactorings and Other Transformatjons Prof. Dr.

Chapter 9: Text Processing 10/16/2015 3:40 PM Text Processing 1 Outline and Reading Strings

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sta$c Typing of Complex Presence Constraints in Interfaces Nathalie Oostvogels , Joeri De Koster,

Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2019/

TTF3 Power Coupler Update on Operating and Fabricating Issues TTC Meeting, FNAL, Chicago, April,

Update on Quartz Plate Calorimetry Y. Onel HCAL GENERAL MEETING March 27, 2014 1 HE Quartz

Ubiquitous Learning Analytics Student/ Context Modelling and Adacem ic Analytics Research

Logistics Paper summaries on Procedural Modeling Procedural Modeling Any takers?

The ATLAS Trigger &amp; Data Acquisition System in Run-2 Catrin Bernius New York University US LHC

Sung Hak Lim, Mihoko Nojiri (JHEP, 2018) Amit

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

The ATLAS Trigger & Data Acquisition System in Run-2 Catrin Bernius New York University US LHC