Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - - PowerPoint PPT Presentation

sequnce s to sequence transformatjons in text processing
SMART_READER_LITE
LIVE PREVIEW

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada - - PowerPoint PPT Presentation

Sequnce(s)-to-Sequence Transformatjons in Text Processing Narada Warakagoda Seq2seq Transformatjon Variable length output Model Variable length input Example Applicatjons Summarizatjon (extractjve/abstractjve) Machine translatjon


slide-1
SLIDE 1

Sequnce(s)-to-Sequence Transformatjons in Text Processing

Narada Warakagoda

slide-2
SLIDE 2

Seq2seq Transformatjon

Model Variable length input Variable length output

slide-3
SLIDE 3

Example Applicatjons

  • Summarizatjon (extractjve/abstractjve)
  • Machine translatjon
  • Dialog systems /chatbots
  • Text generatjon
  • Questjon answering
slide-4
SLIDE 4

Seq2seq Transformatjon

Model size should be constant. Model Variable length input Variable length output Solutjon: Apply a constant sized neural net module repeatedly on the data

slide-5
SLIDE 5

Possible Approaches

  • Recurrent networks
  • Apply the NN module in a serial fashion
  • Convolutjons networks
  • Apply the NN modules in a hierarchical fashion
  • Self-atuentjon
  • Apply the NN module in a parallel fashion
slide-6
SLIDE 6

Processing Pipeline

Decoder Variable length input Variable length output Encoder Intermediate representatjon

slide-7
SLIDE 7

Processing Pipeline

Intermediate representatjon Decoder Variable length output Variable length input Encoder Variable length text Embedding Attention

slide-8
SLIDE 8

Architecture Variants

Encoder Decoder Atuentjon Recurrent net Recurrent net No Recurrent net Recurrent net Yes Convolutjonal net Convolutjonal net No Convolutjonal net Recurrent net Yes Convolutjonal net Convolutjonal net Yes Fully connected net with self-atuentjon Fully connected net with self-atuentjon Yes

slide-9
SLIDE 9

Possible Approaches

  • Recurrent networks
  • Apply the NN module in a serial fashion
  • Convolutjons networks
  • Apply the NN modules in a hierarchical fashion
  • Self-atuentjon
  • Apply the NN module in a parallel fashion
slide-10
SLIDE 10

RNN-decoder with RNN-encoder

Soft max Soft max Soft max Soft max

Tusen Takk <end> <start> Thanks Very Much Encoder Decoder = RNN cell

slide-11
SLIDE 11

RNN-dec with RNN-enc, Training

Soft max Soft max Soft max Soft max

Tusen Takk <end> <start> Thanks Very Much Encoder Decoder Ground Truths

Thanks Very Much <end>

slide-12
SLIDE 12

RNN-dec with RNN-enc, Decoding

Soft max Soft max Soft max Soft max

Tusen Takk <end> <start> Thanks Much Very Encoder Decoder

Thanks Much Very <end>

Greedy Decoding

slide-13
SLIDE 13

Decoding Approaches

  • Optjmal decoding
  • Greedy decoding
  • Easy
  • Not optjmal
  • Beam search
  • Closer to optjmal decoder
  • Choose top N candidates instead of the best one at

each step.

slide-14
SLIDE 14

Beam Search Decoding

Beam Width = 3

slide-15
SLIDE 15

Straight-forward Extensions

Current state Next state Current Input

RNN Cell

Current state Next state Current Input

LSTM Cell

Next control state Current control state Current state Next state Current Input Next state Current state Current state Next state Current Input Next state Current state

Bidirectional Cell Stacked Cell

slide-16
SLIDE 16

RNN-decoder with RNN-encoder with Atuentjon

Soft max Soft max Soft max Soft max

Tusen Takk <end> <start> Thanks Very Much Encoder Decoder = RNN cell + Context

slide-17
SLIDE 17

Atuentjon

  • Context is given by
  • Atuentjon weights are dynamic
  • Generally defjned by with

where functjon f can be defjned in several ways.

  • Dot product
  • Weighted dot product
  • Use another MLP (eg: 2 layer)
slide-18
SLIDE 18

Atuentjon

+ RNN Cell

slide-19
SLIDE 19

Example: Google Neural Machine Translatjon

slide-20
SLIDE 20

Possible Approaches

  • Recurrent networks
  • Apply the NN module in a serial fashion
  • Convolutjons networks
  • Apply the NN modules in a hierarchical fashion
  • Self-atuentjon
  • Apply the NN module in a parallel fashion
slide-21
SLIDE 21

Why Convolutjon

  • Recurrent networks are serial
  • Unable to be parallelized
  • “Distance” between feature vector and difgerent

inputs are not constant

  • Convolutjons networks
  • Can be parallelized (faster)
  • “Distance” between feature vector and difgerent

inputs are constant

slide-22
SLIDE 22

Long range dependency capture with conv nets

k

n

k

slide-23
SLIDE 23

Conv net, Recurrent net with Atuentjon

Gehring et.al. A Convolutjonal Encoder Model for Neural Machine Translatjon (2016)

CNN-a CNN-c

1

z

3

z

2

z

4

z

1

y

2

y

3

y

4

y

,1 i

a

,2 i

a

,3 i

a

,4 i

a

i

g

i

c

i

h

i

h

i

d

1 i

h +

i d i i

d W h g = +

slide-24
SLIDE 24

Two conv nets with atuentjon

Gehring et.al, Convolutjonal Sequence to Sequence Learning, 2017

W W W W

Wd Wd Wd Wd

1

z

3

z

2

z

1,2,3,4

i

d i =

1

e

2

e

3

e

,

1,2,3,4 1,2,3

i j

a i j = =

1

c

2

c

3

c

4

c

1

g

2

g

3

g

4

g

, 1,2,3,4

i

h i =

slide-25
SLIDE 25

Possible Approaches

  • Recurrent networks
  • Apply the NN module in a serial fashion
  • Convolutjons networks
  • Apply the NN modules in a hierarchical fashion
  • Self-atuentjon
  • Apply the NN module in a parallel fashion
slide-26
SLIDE 26

Why Self-atuentjon

  • Recurrent networks are serial
  • Unable to be parallelized
  • “Distance” between feature vector and difgerent

inputs are not constant

  • Self-atuentjon networks
  • Can be parallelized (faster)
  • “Distance” between feature vector and difgerent

inputs does not depend on the input length

slide-27
SLIDE 27

FCN with self-atuentjon

Inputs Previous Words Probability of the next words

Vasvani et.al, Atuentjon is all you need, 2017

slide-28
SLIDE 28

Scaled dot product atuentjon

Query Keys Values

slide-29
SLIDE 29

Multj-Head Atuentjon

slide-30
SLIDE 30

Encoder Self-atuentjon

Self Attention

slide-31
SLIDE 31

Decoder Self-atuentjon

  • Almost same as encoder self atuentjon
  • But only lefuward positjons are considered.
slide-32
SLIDE 32

Encoder-decoder atuentjon

Encoder states Decoder state

slide-33
SLIDE 33

Overall Operatjon

Previous Words Next Word

Neural machine translation, philipp Koehn

slide-34
SLIDE 34

Reinforcement Learning

  • Machine Translatjon/Summarizatjon
  • Dialog Systems
slide-35
SLIDE 35

Reinforcement Learning

  • Machine Translatjon/Summarizatjon
  • Dialog Systems
slide-36
SLIDE 36

Why Reinforcement Learning

  • Exposure bias
  • In training ground truths are used. In testjng, generated

word in the previous step is used to generate the next word.

  • Use generated words in training needs sampling : Non

difgerentjable

  • Maximum Likelihood criterion is not directly relevant to

evaluatjon metrics

  • BLEU (Machine translatjon)
  • ROUGE (Summarizatjon)
  • Use BLEU/ROUGE in training: Non difgerentjable
slide-37
SLIDE 37

Sequence Generatjon as Reinforcement Learning

  • Agent: The Recurrent Net
  • State: Hidden layers, Atuentjon weights etc.
  • Actjon: Next Word
  • Policy: Generate the next word (actjon) given

the current hidden layers and atuentjon weights (state)

  • Reward: Score computed using the evaluatjon

metric (eg: BLEU)

slide-38
SLIDE 38

Maximum Likelihood Training (Revisit)

Minimize the negative log likelihood

slide-39
SLIDE 39

Reinforcement Learning Formulatjon

Minimize the expected negative reward, using REINFORCE algorithm

slide-40
SLIDE 40

Reinforcement Learning Details

  • Expected reward
  • We need the gradient
  • Need to write this as an expectatjon, so that we can

evaluate it using samples. Use the log derivatjve trick:

  • This is an expectatjon
  • Approximate this with sample mean
  • In practjce we use only one sample
slide-41
SLIDE 41

Reinforcement Learning Details

  • Gradient
  • This estjmatjon has high variance. Use a baseline to

combat this problem.

  • Baseline can be anything independent of
  • It can for example be estjmated as the reward for word

sequence generated using argmax at each cell.

slide-42
SLIDE 42

Reinforcement Learning

  • Machine Translatjon/Summarizatjon
  • Dialog Systems
slide-43
SLIDE 43

Maximum Likelihood Dialog Systems

How Are You? I Am Fine I Am <start>

slide-44
SLIDE 44

Why Reinforcement Learning

  • Maximum Likelihood criterion is not directly

relevant to successful dialogs

  • Dull responses (“I don’t know”)
  • Repetjtjve responses
  • Need to integrate developer defjned rewards

relevant to longer term goals of the dialog

slide-45
SLIDE 45

Dialog Generatjon as Reinforcement Learning

  • Agent: The Recurrent Net
  • State: Previous dialog turns
  • Actjon: Next dialog utuerance
  • Policy: Generate the next dialog utuerance

(actjon) given the previous dialog turns (state)

  • Reward: Score computed based on relevant

factors such as ease of answering, informatjon fmow, semantjc coherence etc.

slide-46
SLIDE 46

Training Setup

Agent 1 Agent 2 Decoder Decoder Encoder Encoder

slide-47
SLIDE 47

Training Procedure

  • From the viewpoint of a given agent, the

procedure is similar to that of sequence generatjon

  • REINFORCE algorithm
  • Appropriate rewards must be calculated based
  • n current and previous dialog turns.
  • Can be initjalized with maximum likelihood

trained models.

slide-48
SLIDE 48

Adversarial Learning

  • Use a discriminator as in GANs to calculate the reward
  • Same training procedure based on REINFORCE for generator

Discriminator

Human Dialog

slide-49
SLIDE 49

Questjon Answering

  • Slightly difgerent from sequence-to-sequence

model.

Model Variable length inputs Fixed Length Output Single Word Answer/ Start-end points of the answer Passage/Document/Context Question/Query

slide-50
SLIDE 50

QA- Naive Approach

  • Combine questjon and passage and use an RNN to

classify it.

  • Will not work because relatjonship between the

passage and questjon is not adequately captured.

Model Variable length input Fixed Length Output Single Word Answer/ Start-end points of the answer Question and passage

slide-51
SLIDE 51

QA- More Successful Approach

  • Use atuentjon between the questjon and

passage

  • Bi-directjonal atuentjon, co-atuentjon
  • Temporal relatjonship modeling
  • Classifjcatjon or predict start and end-point of

the answer within passage.

slide-52
SLIDE 52

QA Example with Bi-directjonal Atuentjon

Bi-directional attention flow for machine comprehension Seo M. et.al