 
              Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda
Seq2seq Transformatjon Variable length output Model Variable length input
Example Applicatjons ● Summarizatjon (extractjve/abstractjve) ● Machine translatjon ● Dialog systems /chatbots ● Text generatjon ● Questjon answering ● ●
Seq2seq Transformatjon Variable length output Model size should be Model constant. Variable length input Solutjon : Apply a constant sized neural net module repeatedly on the data
Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion
Processing Pipeline Variable length output Decoder Intermediate representatjon Encoder Variable length input
Processing Pipeline Variable length output Decoder Attention Intermediate representatjon Encoder Variable length input Embedding Variable length text
Architecture Variants Encoder Decoder Atuentjon Recurrent net Recurrent net No Recurrent net Recurrent net Yes Convolutjonal net Convolutjonal net No Convolutjonal net Recurrent net Yes Convolutjonal net Convolutjonal net Yes Fully connected net Fully connected net Yes with self-atuentjon with self-atuentjon
Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion
RNN-decoder with RNN-encoder = RNN cell Soft Soft Soft Soft max max max max <start> Thanks Very Much Tusen <end> Takk Decoder Encoder
RNN-dec with RNN-enc, Training Thanks Very Much <end> Soft Soft Soft Soft max max max max <start> Thanks Very Much Ground Truths Tusen <end> Takk Decoder Encoder
RNN-dec with RNN-enc, Decoding Thanks Much Very <end> Greedy Decoding Soft Soft Soft Soft max max max max <start> Thanks Much Very Tusen <end> Takk Decoder Encoder
Decoding Approaches ● Optjmal decoding ● Greedy decoding ● Easy ● Not optjmal ● Beam search ● Closer to optjmal decoder ● Choose top N candidates instead of the best one at each step.
Beam Search Decoding Beam Width = 3
Straight-forward Extensions Next state Current state Current control Next control state state Current state Next state Current Input Current Input LSTM Cell RNN Cell Next state Current state Next state Current state Current state Next state Current state Next state Current Input Current Input Bidirectional Cell Stacked Cell
RNN-decoder with RNN-encoder with Atuentjon = RNN cell Soft Soft Soft Soft max max max max Context + <start> Thanks Very Much Tusen <end> Takk Decoder Encoder
Atuentjon ● Context is given by ● Atuentjon weights are dynamic ● Generally defjned by with where functjon f can be defjned in several ways. ● Dot product ● Weighted dot product ● Use another MLP (eg: 2 layer)
Atuentjon RNN Cell +
Example: Google Neural Machine Translatjon
Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion
Why Convolutjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Convolutjons networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs are constant
Long range dependency capture with conv nets n k k
Conv net, Recurrent net with Atuentjon CNN-a CNN-c z z z z y y y y 1 2 3 4 4 1 2 3 d a a a a i i ,1 i ,2 i ,3 i ,4 c i h + h i i 1 h i g i d W h g = + i d i i Gehring et.al. A Convolutjonal Encoder Model for Neural Machine Translatjon (2016)
Two conv nets with atuentjon e e e 3 2 1 z z z 2 3 1 d i = 1,2,3,4 i a i 1,2,3,4 j 1,2,3 Wd Wd Wd Wd = = i j , c c c c 2 4 1 3 h i = , 1,2,3,4 i W W W W g g g g 1 2 3 4 Gehring et.al, Convolutjonal Sequence to Sequence Learning , 2017
Possible Approaches ● Recurrent networks ● Apply the NN module in a serial fashion ● Convolutjons networks ● Apply the NN modules in a hierarchical fashion ● Self-atuentjon ● Apply the NN module in a parallel fashion
Why Self-atuentjon ● Recurrent networks are serial ● Unable to be parallelized ● “Distance” between feature vector and difgerent inputs are not constant ● Self-atuentjon networks ● Can be parallelized (faster) ● “Distance” between feature vector and difgerent inputs does not depend on the input length
FCN with self-atuentjon Probability of the next words Previous Words Vasvani et.al, Atuentjon is all you need , 2017 Inputs
Scaled dot product atuentjon Query Keys Values
Multj-Head Atuentjon
Encoder Self-atuentjon Self Attention
Decoder Self-atuentjon • Almost same as encoder self atuentjon • But only lefuward positjons are considered.
Encoder-decoder atuentjon Decoder state Encoder states
Overall Operatjon Next Word Previous Words Neural machine translation, philipp Koehn
Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●
Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●
Why Reinforcement Learning ● Exposure bias ● In training ground truths are used. In testjng, generated word in the previous step is used to generate the next word. ● Use generated words in training needs sampling : Non difgerentjable ● Maximum Likelihood criterion is not directly relevant to evaluatjon metrics ● BLEU (Machine translatjon) ● ROUGE (Summarizatjon) ● Use BLEU/ROUGE in training: Non difgerentjable
Sequence Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Hidden layers, Atuentjon weights etc. ● Actjon: Next Word ● Policy: Generate the next word ( actjon ) given the current hidden layers and atuentjon weights ( state ) ● Reward: Score computed using the evaluatjon metric (eg: BLEU)
Maximum Likelihood Training (Revisit) Minimize the negative log likelihood
Reinforcement Learning Formulatjon Minimize the expected negative reward, using REINFORCE algorithm
Reinforcement Learning Details ● Expected reward ● We need the gradient ● Need to write this as an expectatjon, so that we can evaluate it using samples. Use the log derivatjve trick: ● This is an expectatjon ● Approximate this with sample mean ● In practjce we use only one sample
Reinforcement Learning Details ● Gradient ● This estjmatjon has high variance. Use a baseline to combat this problem. ● Baseline can be anything independent of ● It can for example be estjmated as the reward for word sequence generated using argmax at each cell.
Reinforcement Learning ● Machine Translatjon/Summarizatjon ● Dialog Systems ● ●
Maximum Likelihood Dialog Systems Am I Fine <start> I Am Are You? How
Why Reinforcement Learning ● Maximum Likelihood criterion is not directly relevant to successful dialogs ● Dull responses (“I don’t know”) ● Repetjtjve responses ● Need to integrate developer defjned rewards relevant to longer term goals of the dialog
Dialog Generatjon as Reinforcement Learning ● Agent: The Recurrent Net ● State: Previous dialog turns ● Actjon: Next dialog utuerance ● Policy: Generate the next dialog utuerance ( actjon ) given the previous dialog turns ( state ) ● Reward: Score computed based on relevant factors such as ease of answering, informatjon fmow, semantjc coherence etc.
Training Setup Decoder Decoder Encoder Encoder Agent 2 Agent 1
Training Procedure ● From the viewpoint of a given agent, the procedure is similar to that of sequence generatjon ● REINFORCE algorithm ● Appropriate rewards must be calculated based on current and previous dialog turns. ● Can be initjalized with maximum likelihood trained models.
Adversarial Learning Use a discriminator as in GANs to calculate the reward ● Same training procedure based on REINFORCE for generator ● Discriminator Human Dialog
Questjon Answering ● Slightly difgerent from sequence-to-sequence model. Single Word Answer/ Fixed Length Output Start-end points of the answer Model Variable length inputs Question/Query Passage/Document/Context
QA- Naive Approach ● Combine questjon and passage and use an RNN to classify it . ● Will not work because relatjonship between the passage and questjon is not adequately captured. Single Word Answer/ Start-end points of the answer Fixed Length Output Model Variable length input Question and passage
Recommend
More recommend