[PPT] - Generative Models for Discriminative Problems Chris Dyer DeepMind PowerPoint Presentation

SLIDE 1

Generative Models for Discriminative Problems

Chris Dyer  DeepMind December 19, 2017 ASRU 2017

SLIDE 2

A discriminative problem: for some input x, find

the most likely y in a set

A discriminative model directly models p(y | x)

logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att) 

A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)  Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification 

Y(x) x y x y

SLIDE 3

A discriminative problem: for some input x, find

the most likely y in a set

A discriminative model directly models p(y | x)

logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att) 

A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)  Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification 

Y(x) x y x y

SLIDE 4

A discriminative problem: for some input x, find

the most likely y in a set

A discriminative model directly models p(y | x)

logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att) 

A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)  Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification 

Y(x) x y x y

SLIDE 5

A discriminative problem: for some input x, find

the most likely y in a set

A discriminative model directly models p(y | x)

logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att) 

A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)  Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification 

Y(x) x y x y

SLIDE 6

A discriminative problem: for some input x, find

the most likely y in a set

A discriminative model directly models p(y | x)

logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att) 

A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)  Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification 

Y(x) x y x y

SLIDE 7

But why?

(Chiu et al., last week) (Bentivogli et al., 2016)

SLIDE 8

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 9

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 10

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 11

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 12

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 13

“Human-like learning” looks more like model building+inference than
ptimizing pattern recognition functions (Lake et al., 2015)
Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

Reduce label bias when producing sequential outputs
Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?  Five reasons

SLIDE 14

But didn’t we use generative models  and give them up for some reason?

SLIDE 15

To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

Complex distributions → lots of bad independence assumptions

(naive Bayes, n-grams, HMMs, statistical translation models)

But: neural networks let the learner figure out their own

independence assumptions!

Using generative models require solving difficult inference problems
Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

You aren’t “optimizing the task”!

Why not generative models? 

SLIDE 16

To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

Complex distributions → lots of bad independence assumptions

(naive Bayes, n-grams, HMMs, statistical translation models)

But: neural networks let the learner figure out their own

independence assumptions!

Using generative models require solving difficult inference problems
Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

You aren’t “optimizing the task”!

Why not generative models? 

SLIDE 17

To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

Complex distributions → lots of bad independence assumptions

(naive Bayes, n-grams, HMMs, statistical translation models)

But: neural networks let the learner figure out their own

independence assumptions!

Using generative models require solving difficult inference problems
Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

You aren’t “optimizing the task”!

Why not generative models? 

SLIDE 18

To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

Complex distributions → lots of bad independence assumptions

(naive Bayes, n-grams, HMMs, statistical translation models)

But: neural networks let the learner figure out their own

independence assumptions!

Using generative models require solving difficult inference problems
Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

You aren’t “optimizing the task”!

Why not generative models? 

SLIDE 19

To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

Complex distributions → lots of bad independence assumptions

(naive Bayes, n-grams, HMMs, statistical translation models)

But: neural networks let the learner figure out their own

independence assumptions!

Using generative models require solving difficult inference problems
Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

You aren’t “optimizing the task”!

Why not generative models? 

SLIDE 20

Text categorization
Syntactic parsing
Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas  sleep furiously

Case studies 

x = Welcome to Okinawa y = 沖縄へようこそ。

SLIDE 21

Text categorization
Syntactic parsing
Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas  sleep furiously

Case studies 

x = Welcome to Okinawa y = 沖縄へようこそ。

SLIDE 22

Supervised classification
Sample efficiency of a generative-discriminative

pair (Ng and Jordan, 2001)

How well do generative models do on standard

datasets “at scale”?

How well do generative models do across a

range of data conditions?

Experimental setup  Text categorization

(Yogatama, D, et al., arXiv 2017)

SLIDE 23

L(W) = X

i

log p(yi | xi; W) y x1 x2 x3 x4 x5

X

p(y | x)

Discriminative model  Text categorization

SLIDE 24

x1 x2 x3 x4 vy x2 x3 x4 x5 L(W) = X

i

log p(xi | yi)p(yi)

p(x2 | x<2, y)

p(x3 | x<3, y)

p(x4 | x<4, y)

p(x5 | x<5, y)

Generative model  Text categorization

p(y)

SLIDE 25

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization 

SLIDE 26

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization 

SLIDE 27

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization 

SLIDE 28

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization 

SLIDE 29

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization 

SLIDE 30

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes

Sample efficiency 

Yahoo! Answers data: 1,395,000 instances / 10 classes

SLIDE 31

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes Gen LSTM

Sample efficiency 

Yahoo! Answers data: 1,395,000 instances / 10 classes

SLIDE 32

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes Disc LSTM Gen LSTM

Sample efficiency 

Yahoo! Answers data: 1,395,000 instances / 10 classes

SLIDE 33

Sogou log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yahoo log (#training + 1) % accuracy 2 4 6 8 10 12 14 10 30 50 70

naive bayes KN bayes disc LSTM gen LSTM

DBPedia log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yelp Binary log (#training + 1) % accuracy 2 4 6 8 10 12 50 60 70 80 90 100

naive bayes KN bayes disc LSTM gen LSTM

Sample efficiency 

SLIDE 34

Generative models of text approach their

asymptotic errors more rapidly (better in small- data regime).

Discriminative models of text have lower

asymptotic errors, faster training and inference time, and a good estimate of p(x)

Discussion 

The downside is inference is expensive. We

have to evaluate the likelihood of the document for every class!

SLIDE 35

Text categorization
Syntactic parsing
Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas  sleep furiously

Case studies 

x = Welcome to Okinawa y = 沖縄へようこそ。

SLIDE 36

Generate symbols sequentially using an RNN
Add some control symbols to rewrite the history
ccasionally
Occasionally compress a sequence into a constituent
RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing  Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

SLIDE 37

Generate symbols sequentially using an RNN
Add some control symbols to rewrite the history
ccasionally
Occasionally compress a sequence into a constituent
RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing  Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

SLIDE 38

Generate symbols sequentially using an RNN
Add some control symbols to rewrite the history
ccasionally
Occasionally compress a sequence into a constituent
RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing  Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

SLIDE 39

Generate symbols sequentially using an RNN
Add some control symbols to rewrite the history
ccasionally
Occasionally compress a sequence into a constituent
RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

This is a top-down, left-to-right generation of a

tree+sequence (other traversal orders are possible)

Syntactic parsing  Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

SLIDE 40

The hungry cat meows loudly

Example derivation 

SLIDE 41

stack action

(S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The (S (NP The hungry cat ) (S (NP The hungry cat)

Compress “The hungry cat”  into a single composite symbol probability

NT(S)

p(nt(S) | top)

GEN(The)

p(gen(The) | (S, (NP)

NT(NP)

p(nt(NP) | (S)

GEN(hungry)

p(gen(hungry) | (S, (NP, The)

GEN(cat)

p(gen(cat) | . . .)

REDUCE

p(reduce | . . .)

SLIDE 42

stack action

GEN(meows) REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) (S (NP The hungry cat) (VP meows) . (S (NP The hungry cat) (VP meows (S (NP The hungry cat) (VP (S (NP The hungry cat) NT(S)

p(nt(S) | top)

GEN(The)

p(gen(The) | (S, (NP)

NT(NP)

p(nt(NP) | (S)

GEN(hungry)

p(gen(hungry) | (S, (NP, The)

GEN(cat)

p(gen(cat) | . . .)

REDUCE

p(reduce | . . .)

(S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP)

p(nt(VP) | (S, (NP The hungry cat))

probability

SLIDE 43

Valid (tree, string) pairs are in bijection to valid sequences of

actions (specifically, the DFS, left-to-right traversal of the trees)

Every stack configuration perfectly encodes the complete

history of actions.

Therefore, the probability decomposition is justified by the

chain rule, i.e.

(chain rule) (prop 2)

p(x, y) = p(actions(x, y)) p(actions(x, y)) = Y

i

p(ai | a<i) = Y

i

p(ai | stack(a<i))

(prop 1)

Deriving the model 

SLIDE 44

Modeling the next action 

(S (NP The hungry cat) (VP meows

p(ai | )

1. unbounded depth
1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

SLIDE 45

Modeling the next action 

(S (NP The hungry cat) (VP meows

p(ai | )

1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

SLIDE 46

Modeling the next action 

(S (NP The hungry cat) (VP meows

p(ai | )

1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

2. arbitrarily complex trees
2. Arbitrarily complex trees → recursive neural nets

SLIDE 47

The hungry cat

)

(NP The hungry cat)

Need representation for:

NP

What head type?

Syntactic composition 

SLIDE 48

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

(

Syntactic composition 

SLIDE 49

The

NP

hungry cat

) NP (

(NP The hungry cat)

Need representation for:

Syntactic composition 

SLIDE 50

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat)

hungry

Syntactic composition  Recursion

SLIDE 51

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat) | {z }

v

Syntactic composition  Recursion

SLIDE 52

Modeling the next action 

(S (NP The hungry cat) (VP meows

p(ai | )

1. Unbounded depth → recurrent neural nets
2. Arbitrarily complex trees → recursive neural nets

∼REDUCE

h1 h2 h3 h4

(S (NP The hungry cat) (VP meows)

p(ai+1 |

)

3. limited updates
3. Limited updates to state → stack RNNs

(D, et al., ACL 2015; Ballesteros, D, et al., EMNLP 2015)

SLIDE 53

In text categorization, it was not really a problem to

exhaustively evaluate all candidate y’s.

Here, we can’t do that — we have O(2|x|) candidates!
Outline of the solution
Learn a tractable instrumental distribution, q(y | x),

which approximates the posterior over trees

Use importance sampling to solve the inference

problems (maximization, marginalization) we care about

Inference 

SLIDE 54

Type F1

Petrov and Klein (2007)

Gen

90.1

Shindo et al (2012)  Single model

Gen

91.1

Vinyals et al (2015)  PTB only

Disc

90.5

Shindo et al (2012)  Ensemble

Gen+Ensemble

92.4

Vinyals et al (2015)  Semisupervised

Disc+SemiSup

92.8

Discriminative  PTB only

Disc

91.7

Generative  PTB only

Gen

93.6

Choe and Charniak (2016)  Semisupervised

Gen  +SemiSup

93.8

Fried et al. (2017)

Gen+Semi  +Ensemble

94.7

Results: Parsing

SLIDE 55

Type F1

Petrov and Klein (2007)

Gen

90.1

Shindo et al (2012)  Single model

Gen

91.1

Vinyals et al (2015)  PTB only

Disc

90.5

Shindo et al (2012)  Ensemble

Gen+Ensemble

92.4

Vinyals et al (2015)  Semisupervised

Disc+SemiSup

92.8

Discriminative  PTB only

Disc

91.7

Generative  PTB only

Gen

93.6

Choe and Charniak (2016)  Semisupervised

Gen  +SemiSup

93.8

Fried et al. (2017)

Gen+Semi  +Ensemble

94.7

Results: Parsing

SLIDE 56

RNNGs are effective both for modeling language

and parsing

Generative parser outperforms discriminative

parser

Expectation: the discriminative model would do

better with more data

We are in the “generative” regime!

Discussion

SLIDE 57

Text categorization
Syntactic parsing
Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas  sleep furiously

Case studies 

x = Welcome to Okinawa y = 沖縄へようこそ。

SLIDE 58

State of the art performance in most applications
Two serious problems that concern us:
Nontrivial to use “unpaired” samples of x or y to train the model
“Explaining away effects” - models like this learn to ignore

“inconvenient” inputs (i.e., x), in favor of high probability continuations of an output prefix (y<i)

Seq2Seq Modeling Direct model

(Yu, D, et al., ICLR 2017)

SLIDE 59

Label bias is a species of “explaining away” that  causes trouble in directed (locally normalized) models. a b c x y z → a b c’ x y z → a b’ c x y z → d w → a b’ d x y z →

Seq2Seq Modeling What is label bias?

SLIDE 60

Seq2Seq Modeling Generative model

SLIDE 61

“Source model” “Channel model”

Seq2Seq Modeling Generative model

SLIDE 62

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

SLIDE 63

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

SLIDE 64

“Source model” “Channel model”

The world is colorful because of the Internet. 世界はインターネット  のためにカラフルです。

Seq2Seq Modeling Generative model

SLIDE 65

“Source model” “Channel model”

The world is colorful because of the Internet.

Source model can be estimated from  unpaired y’s

Seq2Seq Modeling Generative model

世界はインターネット  のためにカラフルです。

SLIDE 66

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

Inference model form avoids explaining away of inputs (“label bias”).

世界はインターネット  のためにカラフルです。

SLIDE 67

Question: Can we use neural network component

models without bad independence assumptions?

Training — straightforward
Decoding — challenging

Seq2Seq Modeling Generative model

SLIDE 68

Some bad initial results
The IS algorithm we proposed hurt us unless the

number of samples (k) was massive

Reranking an k-best list from a direct model

didn’t help unless k was even bigger

Question: can we develop a left-to-right decoder

for a noisy channel MT model?

Decoding

SLIDE 69

Direct model:

Decoding Direct vs. generative

SLIDE 70

Direct model:

Chain rule!

Decoding Direct vs. generative

SLIDE 71

Direct model: Not perfect, but

Chain rule!

(Compare to using greedy decoding with MEMMs)

Decoding Direct vs. generative

SLIDE 72

Generative model (naive):

Decoding Direct vs. generative

SLIDE 73

Generative model (naive):

Chain rule!

Decoding Direct vs. generative

SLIDE 74

Generative model (naive):

Probability doesn’t work  like this.

Decoding Direct vs. generative

SLIDE 75

Outline of solution: Introduce a latent variable z that determines when enough of the conditioning context has been read to generate another symbol How much of y do we need to read to model the jth token of x?

Decoding Direct vs. generative

SLIDE 76

Conditioning context Output sequence Introduced as a direct model by  Yu et al. (2016) It’s a good direct model It also is exactly what we need  for the channel model Similar model: Graves (2012)

The Segment to Segment Model

SLIDE 77

Expensive to go through every token yj in the vocabulary and calculate Use an auxiliary direct model q(y, z | x) to guide the search. y

Decoding with an auxiliary model

SLIDE 78

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Decoding with an auxiliary model

SLIDE 79

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Expanded objective

Decoding with an auxiliary model

SLIDE 80

Medium-sized Chinese-English news parallel data
Large LSTM language model trained on English

news + target side of parallel data

Evaluation using BLEU-4 (higher is better)

Experiments Machine translation

SLIDE 81

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Gen Discriminative

Experiments Machine translation

SLIDE 82

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Experiments Machine translation

Gen Discriminative

SLIDE 83

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Experiments Machine translation

Gen Discriminative

SLIDE 84

Generative can be used well for “discriminative problems”
Especially in data-restricted scenarios
Especially with neural nets, which let us define great generative

models

Open questions
Inference is hard, but there are lots of exciting possibilities for

learning to do inference

Is there a theoretical account for when a particular dataset is in

the “generative” vs. “discriminative” regime and where the crossover point is?

Conclusions

SLIDE 85

Thank you!

SLIDE 86

Generative models also provide an estimate of p(x)
The likelihood of the input is a good estimate of “what the model

knows”. Examples that fall out of this are a good indication that the model should stop what it’s doing and get help.

Outlier detection 

SLIDE 87

Learn (label) concepts, to be used as class embeddings from an

auxiliary task

For example, from a large unannotated corpus, learn standard word

embeddings and use them as class embeddings

Fix the class embeddings during training
When we see a new class, use the word embedding for the class
Train on classes
Predict for all classes

n − 1

vy

Zero-shot learning 

SLIDE 88

Class Precision Recall Accuracy

company

98.9 46.6 93.3

educational institution

99.2 49.5 92.8

athlete

96.5 90.1 94.6

means of transportation

96.5 74.3 94.2

building

99.9 37.7 92.1

natural place

98.9 88.2 95.4

village

99.9 68.1 93.8

animal

99.7 68.1 93.8

plant

99.2 76.9 94.3

film

99.4 73.3 94.5

written work

93.8 26.5 91.3

AVERAGE

98.3 63.6 93.6

Zero-shot learning 

SLIDE 89

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)

Inference Importance sampling

SLIDE 90

p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo  estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N} Eq(y|x)w(x, y)

MC

≈ 1 N

N

X

i=1

w(x, y(i))

Inference Importance sampling

SLIDE 91

Perplexity 5-gram IKN 169.3 LSTM LM 113.4 Generative (IS) 102.4

Generative Models for Discriminative Problems

Terminological clarification

Terminological clarification

Terminological clarification

Terminological clarification

Terminological clarification

But why?

Why generative models? Five reasons

Why generative models? Five reasons

Why generative models? Five reasons

Why generative models? Five reasons

Why generative models? Five reasons

Why generative models? Five reasons

But didn’t we use generative models and give them up for some reason?

Why not generative models?

Why not generative models?

Why not generative models?

Why not generative models?

Why not generative models?

Case studies

Case studies

Experimental setup Text categorization

Discriminative model Text categorization

Generative model Text categorization

Supervised text categorization

Supervised text categorization

Supervised text categorization

Supervised text categorization

Supervised text categorization

Sample efficiency

Sample efficiency

Sample efficiency

Sample efficiency

Discussion

Case studies

Syntactic parsing Recurrent Neural Net Grammars

Syntactic parsing Recurrent Neural Net Grammars

Syntactic parsing Recurrent Neural Net Grammars

Syntactic parsing Recurrent Neural Net Grammars

The hungry cat meows loudly

Example derivation

Deriving the model

Modeling the next action

p(ai | )

Modeling the next action

p(ai | )

Modeling the next action

p(ai | )

(NP The hungry cat)

Syntactic composition

(NP The hungry cat)

Syntactic composition

(NP The hungry cat)

Syntactic composition

(NP The (ADJP very hungry) cat)

Syntactic composition Recursion

(NP The (ADJP very hungry) cat)

Syntactic composition Recursion

Modeling the next action

p(ai | )

∼REDUCE

p(ai+1 |

)

Inference

Results: Parsing

Results: Parsing

Discussion

Case studies

Seq2Seq Modeling Direct model

Seq2Seq Modeling What is label bias?

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Seq2Seq Modeling Generative model

Decoding

Decoding Direct vs. generative

Terminological clarification 

Terminological clarification 

Terminological clarification 

Terminological clarification 

Terminological clarification 

Why generative models?  Five reasons

Why generative models?  Five reasons

Why generative models?  Five reasons

Why generative models?  Five reasons

Why generative models?  Five reasons

Why generative models?  Five reasons

But didn’t we use generative models  and give them up for some reason?

Why not generative models? 

Why not generative models? 

Why not generative models? 

Why not generative models? 

Why not generative models? 

Case studies 

Case studies 

Experimental setup  Text categorization

Discriminative model  Text categorization

Generative model  Text categorization

Supervised text categorization 

Supervised text categorization 

Supervised text categorization 

Supervised text categorization 

Supervised text categorization 

Sample efficiency 

Sample efficiency 

Sample efficiency 

Sample efficiency 

Discussion 

Case studies 

Syntactic parsing  Recurrent Neural Net Grammars

Syntactic parsing  Recurrent Neural Net Grammars

Syntactic parsing  Recurrent Neural Net Grammars

Syntactic parsing  Recurrent Neural Net Grammars

Example derivation 

Deriving the model 

Modeling the next action 

Modeling the next action 

Modeling the next action 

Syntactic composition 

Syntactic composition 

Syntactic composition 

Syntactic composition  Recursion

Syntactic composition  Recursion

Modeling the next action 

Inference 

Case studies 

Outlier detection 

Zero-shot learning 

Zero-shot learning