Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - - PowerPoint PPT Presentation
Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - - PowerPoint PPT Presentation
Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda Seq2seq Transformatjon Variable length output Model Variable length input Example Applicatjons Summarizatjon (extractjve/abstractjve) Machine translatjon
Seq2seq Transformatjon
Model Variable length input Variable length output
Example Applicatjons
- Summarizatjon (extractjve/abstractjve)
- Machine translatjon
- Dialog systems /chatbots
- Text generatjon
- Questjon answering
Seq2seq Transformatjon
Model size should be constant. Model Variable length input Variable length output Solutjon: Apply a constant sized neural net module repeatedly on the data
Possible Approaches
- Recurrent networks
- Apply the NN module in a serial fashion
- Convolutjons networks
- Apply the NN modules in a hierarchical fashion
- Self-atuentjon
- Apply the NN module in a parallel fashion
Processing Pipeline
Decoder Variable length input Variable length output Encoder Intermediate representatjon
Processing Pipeline
Intermediate representatjon Decoder Variable length output Variable length input Encoder Variable length text Embedding Attention
Architecture Variants
Encoder Decoder Atuentjon Recurrent net Recurrent net No Recurrent net Recurrent net Yes Convolutjonal net Convolutjonal net No Convolutjonal net Recurrent net Yes Convolutjonal net Convolutjonal net Yes Fully connected net with self-atuentjon Fully connected net with self-atuentjon Yes
Possible Approaches
- Recurrent networks
- Apply the NN module in a serial fashion
- Convolutjons networks
- Apply the NN modules in a hierarchical fashion
- Self-atuentjon
- Apply the NN module in a parallel fashion
RNN-decoder with RNN-encoder
Soft max Soft max Soft max Soft max
Tusen Takk <end> <start> Thanks Very Much Encoder Decoder = RNN cell
RNN-dec with RNN-enc, Training
Soft max Soft max Soft max Soft max
Tusen Takk <end> <start> Thanks Very Much Encoder Decoder Ground Truths
Thanks Very Much <end>
RNN-dec with RNN-enc, Decoding
Soft max Soft max Soft max Soft max
Tusen Takk <end> <start> Thanks Much Very Encoder Decoder
Thanks Much Very <end>
Greedy Decoding
Decoding Approaches
- Optjmal decoding
- Greedy decoding
- Easy
- Not optjmal
- Beam search
- Closer to optjmal decoder
- Choose top N candidates instead of the best one at
each step.
Beam Search Decoding
Beam Width = 3
Straight-forward Extensions
Current state Next state Current Input
RNN Cell
Current state Next state Current Input
LSTM Cell
Next control state Current control state Current state Next state Current Input Next state Current state Current state Next state Current Input Next state Current state
Bidirectional Cell Stacked Cell
RNN-decoder with RNN-encoder with Atuentjon
Soft max Soft max Soft max Soft max
Tusen Takk <end> <start> Thanks Very Much Encoder Decoder = RNN cell + Context
Atuentjon
- Context is given by
- Atuentjon weights are dynamic
- Generally defjned by with
where functjon f can be defjned in several ways.
- Dot product
- Weighted dot product
- Use another MLP (eg: 2 layer)
Atuentjon
+ RNN Cell
Example: Google Neural Machine Translatjon
Possible Approaches
- Recurrent networks
- Apply the NN module in a serial fashion
- Convolutjons networks
- Apply the NN modules in a hierarchical fashion
- Self-atuentjon
- Apply the NN module in a parallel fashion
Why Convolutjon
- Recurrent networks are serial
- Unable to be parallelized
- “Distance” between feature vector and difgerent
inputs are not constant
- Convolutjons networks
- Can be parallelized (faster)
- “Distance” between feature vector and difgerent
inputs are constant
Long range dependency capture with conv nets
k
n
k
Conv net, Recurrent net with Atuentjon
Gehring et.al. A Convolutjonal Encoder Model for Neural Machine Translatjon (2016)
CNN-a CNN-c
1
z
3
z
2
z
4
z
1
y
2
y
3
y
4
y
,1 i
a
,2 i
a
,3 i
a
,4 i
a
i
g
i
c
i
h
i
h
i
d
1 i
h +
i d i i
d W h g = +
Two conv nets with atuentjon
Gehring et.al, Convolutjonal Sequence to Sequence Learning, 2017
W W W W
Wd Wd Wd Wd
1
z
3
z
2
z
1,2,3,4
i
d i =
1
e
2
e
3
e
,
1,2,3,4 1,2,3
i j
a i j = =
1
c
2
c
3
c
4
c
1
g
2
g
3
g
4
g
, 1,2,3,4
i
h i =
Possible Approaches
- Recurrent networks
- Apply the NN module in a serial fashion
- Convolutjons networks
- Apply the NN modules in a hierarchical fashion
- Self-atuentjon
- Apply the NN module in a parallel fashion
Why Self-atuentjon
- Recurrent networks are serial
- Unable to be parallelized
- “Distance” between feature vector and difgerent
inputs are not constant
- Self-atuentjon networks
- Can be parallelized (faster)
- “Distance” between feature vector and difgerent
inputs does not depend on the input length
FCN with self-atuentjon
Inputs Previous Words Probability of the next words
Vasvani et.al, Atuentjon is all you need, 2017
Scaled dot product atuentjon
Query Keys Values
Multj-Head Atuentjon
Encoder Self-atuentjon
Self Attention
Decoder Self-atuentjon
- Almost same as encoder self atuentjon
- But only lefuward positjons are considered.
Encoder-decoder atuentjon
Encoder states Decoder state
Overall Operatjon
Previous Words Next Word
Neural machine translation, philipp Koehn
Reinforcement Learning
- Machine Translatjon/Summarizatjon
- Dialog Systems
Reinforcement Learning
- Machine Translatjon/Summarizatjon
- Dialog Systems
Why Reinforcement Learning
- Exposure bias
- In training ground truths are used. In testjng, generated
word in the previous step is used to generate the next word.
- Use generated words in training needs sampling : Non
difgerentjable
- Maximum Likelihood criterion is not directly relevant to
evaluatjon metrics
- BLEU (Machine translatjon)
- ROUGE (Summarizatjon)
- Use BLEU/ROUGE in training: Non difgerentjable
Sequence Generatjon as Reinforcement Learning
- Agent: The Recurrent Net
- State: Hidden layers, Atuentjon weights etc.
- Actjon: Next Word
- Policy: Generate the next word (actjon) given
the current hidden layers and atuentjon weights (state)
- Reward: Score computed using the evaluatjon
metric (eg: BLEU)
Maximum Likelihood Training (Revisit)
Minimize the negative log likelihood
Reinforcement Learning Formulatjon
Minimize the expected negative reward, using REINFORCE algorithm
Reinforcement Learning Details
- Expected reward
- We need the gradient
- Need to write this as an expectatjon, so that we can
evaluate it using samples. Use the log derivatjve trick:
- This is an expectatjon
- Approximate this with sample mean
- In practjce we use only one sample
Reinforcement Learning Details
- Gradient
- This estjmatjon has high variance. Use a baseline to
combat this problem.
- Baseline can be anything independent of
- It can for example be estjmated as the reward for word
sequence generated using argmax at each cell.
Reinforcement Learning
- Machine Translatjon/Summarizatjon
- Dialog Systems
Maximum Likelihood Dialog Systems
How Are You? I Am Fine I Am <start>
Why Reinforcement Learning
- Maximum Likelihood criterion is not directly
relevant to successful dialogs
- Dull responses (“I don’t know”)
- Repetjtjve responses
- Need to integrate developer defjned rewards
relevant to longer term goals of the dialog
Dialog Generatjon as Reinforcement Learning
- Agent: The Recurrent Net
- State: Previous dialog turns
- Actjon: Next dialog utuerance
- Policy: Generate the next dialog utuerance
(actjon) given the previous dialog turns (state)
- Reward: Score computed based on relevant
factors such as ease of answering, informatjon fmow, semantjc coherence etc.
Training Setup
Agent 1 Agent 2 Decoder Decoder Encoder Encoder
Training Procedure
- From the viewpoint of a given agent, the
procedure is similar to that of sequence generatjon
- REINFORCE algorithm
- Appropriate rewards must be calculated based
- n current and previous dialog turns.
- Can be initjalized with maximum likelihood
trained models.
Adversarial Learning
- Use a discriminator as in GANs to calculate the reward
- Same training procedure based on REINFORCE for generator
Discriminator
Human Dialog
Questjon Answering
- Slightly difgerent from sequence-to-sequence
model.
Model Variable length inputs Fixed Length Output Single Word Answer/ Start-end points of the answer Passage/Document/Context Question/Query
QA- Naive Approach
- Combine questjon and passage and use an RNN to
classify it.
- Will not work because relatjonship between the
passage and questjon is not adequately captured.
Model Variable length input Fixed Length Output Single Word Answer/ Start-end points of the answer Question and passage
QA- More Successful Approach
- Use atuentjon between the questjon and
passage
- Bi-directjonal atuentjon, co-atuentjon
- Temporal relatjonship modeling
- Classifjcatjon or predict start and end-point of
the answer within passage.
QA Example with Bi-directjonal Atuentjon
Bi-directional attention flow for machine comprehension Seo M. et.al