The Neural Noisy Channel: Generative Models for Sequence to - - PowerPoint PPT Presentation

the neural noisy channel generative models for sequence
SMART_READER_LITE
LIVE PREVIEW

The Neural Noisy Channel: Generative Models for Sequence to - - PowerPoint PPT Presentation

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris Dyer The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling EVERYTHING Chris Dyer What is a discriminative


slide-1
SLIDE 1

The Neural Noisy Channel:
 Generative Models
 for
 Sequence to Sequence Modeling

Chris Dyer

slide-2
SLIDE 2

The Neural Noisy Channel:
 Generative Models
 for
 Sequence to Sequence Modeling

Chris Dyer

EVERYTHING

slide-3
SLIDE 3

Text Classification

What is a discriminative problem?

slide-4
SLIDE 4

Text Summary

What is a discriminative problem?

slide-5
SLIDE 5

Text Translation

What is a discriminative problem?

slide-6
SLIDE 6

Text Output

What is a discriminative problem?

  • Discriminative problems (in contrast to, e.g., density estimation, clustering, or

dimensionality reduction problems) seek to select the correct output for a given text input

  • Neural networks models are very good discriminative models, but they a lot
  • f training data to achieve good performance
slide-7
SLIDE 7
  • Discriminative training objectives are similar to the following:
  • That is, they directly model the posterior distribution over outputs given

inputs.

  • In many domains, we have lots of paired samples to train our models on, so

this estimation problem is feasible.

  • We have also developed very powerful function classes for modeling complex

relationships between inputs and outputs.

Discriminative Models

L(x, y, W) = log p(y | x; W)

slide-8
SLIDE 8

Text Classification

L(W) = X

i

log p(yi | xi; W) y x1 x2 x3 x4 x5

X

p(y | x)

slide-9
SLIDE 9
  • Generative models are a kind of density estimation problem:
  • The can, however, be used to compute the same conditional probabilities as

discriminative models:

  • The renormalization by p(x) is cause for concern, but making the Bayes
  • ptimal prediction under a 0-1 “cost” means we ignore the renormalization:

Generative Models

L(x, y, W) = log p(x, y | W) p(y | x) = p(x, y) p(x) = P

y0 p(x, y0)

ˆ y = arg max

y

p(y | x) = arg max

y

p(x, y) p(x) = P

y0 p(x, y0)

= arg max

y

p(x, y)

slide-10
SLIDE 10
  • A traditionally useful way of formulating a generative model involves the

application of Bayes’ rule.

  • This formulation posits the existence of two independent models, a prior

probability over outputs p(y) and a likelihood p(x | y), which says how likely an input x is to be observed with output y.

  • Why might we favor this model?
  • Humans learn new tasks quickly from small amounts of data
  • But they often have a great deal of prior knowledge about the output space.
  • Outputs are chosen that justify the input, whereas in discriminative models,
  • utputs are chosen that make the discriminative model happy.

Bayes’ Rule

p(y | x) = p(x | y)p(y) p(x) = P

y0 p(x | y0)p(y0)

slide-11
SLIDE 11

But didn’t we use generative models
 and give them up for some reason?

slide-12
SLIDE 12
  • Generative models frequently require modeling complex distributions, e.g.,

sentences, speech, images

  • Traditionally: complex distributions -> lots of (conditional) independence

assumptions (think naive Bayes, or n-grams, or HMMs)

  • Neural networks are powerful density estimators that figure out their
  • wn independence assumptions
  • The motivating hypothesis in this work:
  • The previous empirical limits of generative models were due to bad

independence assumptions, not the generative modeling paradigm per se.

Generative Neural Models

slide-13
SLIDE 13
  • Let’s investigate empirical properties of generative vs.

discriminative recurrent networks commonly used in NLP applications

  • Ng and Jordan (2001) show that linear models that are trained to generate

have lower sample complexity—although higher asymptotic errors—than models that are trained to discriminative (Naive Bayer vs. logistic regression)

  • What about nonlinear models such as neural networks?
  • Formal characterization of the generalization behaviors of complex

neural networks is difficult, with findings from convex problems failing to account for empirical facts about their generalization (Zhang et al, 2017)

Reasons for Optimism

slide-14
SLIDE 14

Warm up: Text Classification

slide-15
SLIDE 15

Warm up: Text Classification

{real news, fake news} x y

slide-16
SLIDE 16

Discriminative Model

L(W) = X

i

log p(yi | xi; W) y x1 x2 x3 x4 x5

X

p(y | x)

slide-17
SLIDE 17

Generative Model

x1 x2 x3 x4 vy x2 x3 x4 x5 L(W) = X

i

log p(xi | yi)p(yi)

p(x2 | x<2, y)

p(x3 | x<3, y)

p(x4 | x<4, y)

p(x5 | x<5, y)

slide-18
SLIDE 18

AGNews DBPedia Yahoo Yelp Binary

Naive Bayes

90.0 96.0 68.7 86.0

Knesser-Ney Bayes

89.3 95.4 69.3 81.8

Discriminative LSTM

92.1 98.7 73.7 92.6

Generative LSTM

90.7 94.8 70.5 90.0

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Full Dataset Results

slide-19
SLIDE 19

Sample Complexity and Asymptotic Errors

Sogou log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yahoo log (#training + 1) % accuracy 2 4 6 8 10 12 14 10 30 50 70

naive bayes KN bayes disc LSTM gen LSTM

DBPedia log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yelp Binary log (#training + 1) % accuracy 2 4 6 8 10 12 50 60 70 80 90 100

naive bayes KN bayes disc LSTM gen LSTM

slide-20
SLIDE 20
  • With discriminative training, we can use these class embeddings as the softmax

weights

  • This technique is not successful since the model (understandably) does not want

to predict the new class since it is trained discriminatively

  • In the generative case, the model predicts instances of the new class with

very high precision but very low recall

  • When we do self-training on these newly predicted examples, we are able to
  • btain good results in the zero-shot setting (about 60% of the time, depending
  • n the hidden class)

Zero-shot Learning

slide-21
SLIDE 21

Class Precision Recall Accuracy

company

98.9 46.6 93.3

educational institution

99.2 49.5 92.8

artist

88.3 4.3 90.3

athlete

96.5 90.1 94.6

  • ffice holder

89.1

means of transportation

96.5 74.3 94.2

building

99.9 37.7 92.1

natural place

98.9 88.2 95.4

village

99.9 68.1 93.8

animal

99.7 68.1 93.8

plant

99.2 76.9 94.3

album

0.03 0.001 88.8

film

99.4 73.3 94.5

written work

93.8 26.5 91.3

Zero-shot Learning

slide-22
SLIDE 22

Adversarial Examples

  • Generative models also provide an estimate of p(x), that is, the marginal

likelihood of the input.

  • The likelihood of the input is a good estimate of “what the model knows”.

Adversarial examples that fall out of this are a good indication that the model should stop what it’s doing and get help.

slide-23
SLIDE 23
  • Generative models of text approach their asymptotic errors more

rapidly, (better in small-data regime), are able to handle new classes, and can perform zero-shot learning by acquiring knowledge about the new class from an auxiliary task better, and they have a good estimate of p(x)

  • Discriminative models of text have lower asymptotic errors, faster

training and inference time

Discussion

slide-24
SLIDE 24

Main Course:
 Sequence to Sequence Modeling

slide-25
SLIDE 25
  • Many problems in text processing can be formulated as sequence to sequence

problems

  • Translation: input is a source language sentence, output is a target language

sentence

  • Summarization: input is a document, output is a short summary
  • Parsing: input is a sentence, output is a (linearized) parse tree
  • Code generation: input is a text description of an algorithm, output is a

program

  • Text to speech: input is an encoding of the linguistic features associated with

how a text should be pronounced, output is a waveform.

  • Speech recognition: input is an encoding of a waveform (or spectrum), output

is text.

Seq2Seq Modeling

slide-26
SLIDE 26
  • State of the art performance in most applications — provided

enough data exists

  • But there are some serious problems
  • You can’t use “unpaired” samples of x and y to train the model
  • “Explaining away effects” - models like this learn to ignore

“inconvenient” inputs (i.e., x), in favor of high probability continuations of an output prefix (y<i)

Seq2Seq Modeling

slide-27
SLIDE 27

“Source model” “Channel model”

Generative: Seq2Seq Models

slide-28
SLIDE 28

“Source model” “Channel model”

The world is colorful because of the Internet...

Generative: Seq2Seq Models

slide-29
SLIDE 29

“Source model” “Channel model”

The world is colorful because of the Internet...

Generative: Seq2Seq Models

slide-30
SLIDE 30

“Source model” “Channel model”

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

slide-31
SLIDE 31

“Source model” “Channel model”

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Source model can be estimated from
 unpaired y’s

Generative: Seq2Seq Models

slide-32
SLIDE 32

“Source model” “Channel model”

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

slide-33
SLIDE 33

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

slide-34
SLIDE 34

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

Is proposed output well-formed?

slide-35
SLIDE 35

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

Is proposed output well-formed? Does proposed output explain the observed input?

slide-36
SLIDE 36

The world is colorful because of the Internet... 世界因互联⽹罒⽽耍多彩...

Generative: Seq2Seq Models

Is proposed output well-formed? Does proposed output explain the observed input?

Model form avoids explaining away of inputs (“label bias”).

slide-37
SLIDE 37
  • Features:
  • Component models can be researched, parameterised, trained, and even

deployed separately.

  • Make principled use of unpaired data.
  • Outputs have to explain the input
  • Mitigate risks due to label bias (explaining away of inputs)
  • Detection of inputs that the model will be “unfamiliar” with
  • This work’s innovation: neural network component models
  • Training — straightforward.
  • Decoding — challenging.

Generative: Seq2Seq Models

slide-38
SLIDE 38

Label Bias?

Label bias is a species of “explaining away” that
 causes trouble in directed (locally normalized) models. a b c x y z → a b c’ x y z → a b’ c x y z → d w →

slide-39
SLIDE 39

Label Bias?

Label bias is a species of “explaining away” that
 causes trouble in directed (locally normalized) models. a b c x y z → a b c’ x y z → a b’ c x y z → d w → a b’ d x y z →

slide-40
SLIDE 40
  • We retain the standard decision rule:
  • Challenges
  • Hypothesis space is very large (Σ* in fact)

  • > We need to factorise the search problem
  • This is somewhat easy to do in a direct model (chain

rule!)

  • But even there we can only approximate the search

Decoding

slide-41
SLIDE 41

Direct model:

Decoding: Direct Model vs. Generative Model

slide-42
SLIDE 42

Direct model:

Chain rule!

Decoding: Direct Model vs. Generative Model

slide-43
SLIDE 43

Direct model: Not perfect, but

Chain rule!

Decoding: Direct Model vs. Generative Model

(Compare to using greedy decoding with MEMMs)

slide-44
SLIDE 44

Generative model (naive):

Decoding: Direct Model vs. Generative Model

slide-45
SLIDE 45

Generative model (naive):

Decoding: Direct Model vs. Generative Model

Chain rule!

slide-46
SLIDE 46

Generative model (naive):

Decoding: Direct Model vs. Generative Model

Probability doesn’t work
 like this.

slide-47
SLIDE 47

Decoding: Generative Model

Outline of solution: Introduce a latent variable z that determines when enough of the conditioning context has been read to generate another symbol How much of y do we need to read to model the jth token of x?

slide-48
SLIDE 48

The Segment to Segment
 Neural Transduction Model

Conditioning context Output sequence Introduced as a direct model by
 Yu et al. (2016) It’s a good direct model! It also is exactly what we need
 for the channel model Similar model: Graves (2012)

slide-49
SLIDE 49

Expensive to go through every token y_j in the vocabulary and calculate Use an auxiliary direct model p(y | x) to guide the search. y

Decoding with the Segment to Segment
 Neural Transduction Model

slide-50
SLIDE 50

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Decoding with the Segment to Segment
 Neural Transduction Model

slide-51
SLIDE 51

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Expanded objective

Decoding with the Segment to Segment
 Neural Transduction Model

slide-52
SLIDE 52

Experiments

  • Abstractive Sentence Summarisation
  • Machine translation
slide-53
SLIDE 53

Abstractive Sentence Summarisation

  • Task: generating a condensed version of a sentence while

preserving its meaning.

  • Data: pair of headline and first sentence of the article.
  • Example:
  • Source: vietnam will accelerate the export of industrial goods

mainly by developing auxiliary industries , and helping enterprises sharpen competitive edges , according to the ministry of industry on thursday .

  • Target: vietnam to boost industrial goods export
  • Evaluation using ROUGE* (higher is better)
slide-54
SLIDE 54

Abstractive Sentence Summarisation

Model Parallel data Data for LM ROUGE-1 ROUGE-2 ROUGE-L Direct 1m

  • 30.78

14.67 28.57 Direct 3.8m

  • 33.82

16.66 31.50 Channel + LM + bias 1m 1m 31.96 14.89 29.51 Direct + channel + LM + bias 1m 1m 33.18 15.65 30.45 Channel + LM + bias 1m 3.8m 32.51 15.00 29.90 Direct + channel + LM + bias 1m 3.8m 33.35 15.77 30.68 Channel + LM + bias 3.8m 3.8m 34.12 16.41 31.38 Direct + channel + LM + bias 3.8m 3.8m 34.41 16.86 31.83

slide-55
SLIDE 55

Abstractive Sentence Summarisation

Model Parallel data Data for LM ROUGE-1 ROUGE-2 ROUGE-L Direct 1m

  • 30.78

14.67 28.57 Direct 3.8m

  • 33.82

16.66 31.50 Channel + LM + bias 1m 1m 31.96 14.89 29.51 Direct + channel + LM + bias 1m 1m 33.18 15.65 30.45 Channel + LM + bias 1m 3.8m 32.51 15.00 29.90 Direct + channel + LM + bias 1m 3.8m 33.35 15.77 30.68 Channel + LM + bias 3.8m 3.8m 34.12 16.41 31.38 Direct + channel + LM + bias 3.8m 3.8m 34.41 16.86 31.83

slide-56
SLIDE 56

Abstractive Sentence Summarisation

Model Parallel data Data for LM ROUGE-1 ROUGE-2 ROUGE-L Direct 1m

  • 30.78

14.67 28.57 Direct 3.8m

  • 33.82

16.66 31.50 Channel + LM + bias 1m 1m 31.96 14.89 29.51 Direct + channel + LM + bias 1m 1m 33.18 15.65 30.45 Channel + LM + bias 1m 3.8m 32.51 15.00 29.90 Direct + channel + LM + bias 1m 3.8m 33.35 15.77 30.68 Channel + LM + bias 3.8m 3.8m 34.12 16.41 31.38 Direct + channel + LM + bias 3.8m 3.8m 34.41 16.86 31.83

slide-57
SLIDE 57

Abstractive Sentence Summarisation

Model Parallel data Data for LM ROUGE-1 ROUGE-2 ROUGE-L Direct 1m

  • 30.78

14.67 28.57 Direct 3.8m

  • 33.82

16.66 31.50 Channel + LM + bias 1m 1m 31.96 14.89 29.51 Direct + channel + LM + bias 1m 1m 33.18 15.65 30.45 Channel + LM + bias 1m 3.8m 32.51 15.00 29.90 Direct + channel + LM + bias 1m 3.8m 33.35 15.77 30.68 Channel + LM + bias 3.8m 3.8m 34.12 16.41 31.38 Direct + channel + LM + bias 3.8m 3.8m 34.41 16.86 31.83

slide-58
SLIDE 58

Abstractive Sentence Summarisation

Model Parallel data Data for LM ROUGE-1 ROUGE-2 ROUGE-L Direct 1m

  • 30.78

14.67 28.57 Direct 3.8m

  • 33.82

16.66 31.50 Channel + LM + bias 1m 1m 31.96 14.89 29.51 Direct + channel + LM + bias 1m 1m 33.18 15.65 30.45 Channel + LM + bias 1m 3.8m 32.51 15.00 29.90 Direct + channel + LM + bias 1m 3.8m 33.35 15.77 30.68 Channel + LM + bias 3.8m 3.8m 34.12 16.41 31.38 Direct + channel + LM + bias 3.8m 3.8m 34.41 16.86 31.83

slide-59
SLIDE 59

Abstractive Sentence Summarisation

  • State-of-the-art results across many different

models on ROUGE-2

  • Best existing model for incorporating unpaired data
  • Human annotators preferred summaries from

generative model 2 to 1

slide-60
SLIDE 60

Machine Translation

  • Medium-sized Chinese-English news parallel data
  • Large LSTM language model trained on English

news + target side of parallel data

  • Evaluation using BLEU-4 (higher is better)
slide-61
SLIDE 61

Machine Translation

Model BLEU seq2seq w/o attention 11.19 Seq2seq with attention 25.27 Direct model 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Gen Discriminative

slide-62
SLIDE 62
slide-63
SLIDE 63

Machine Translation

Model BLEU seq2seq w/o attention 11.19 Seq2seq with attention 25.27 Direct model 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Gen Discriminative

slide-64
SLIDE 64

Discussion

  • Generative models have benefits for “discriminative problems”
  • Learning efficiency
  • Improved sample complexity
  • Approach asymptotic error rate more rapidly, although

higher asymptotic errors (empirical observation)

  • Incorporation of unpaired training samples / prior

knowledge

  • Avoid label bias/explaining away effects
slide-65
SLIDE 65

Thanks!