Generative Models for Discriminative Problems Chris Dyer DeepMind - - PowerPoint PPT Presentation

generative models for discriminative problems
SMART_READER_LITE
LIVE PREVIEW

Generative Models for Discriminative Problems Chris Dyer DeepMind - - PowerPoint PPT Presentation

Generative Models for Discriminative Problems Chris Dyer DeepMind ASRU 2017 December 19, 2017 Terminological clarification A discriminative problem : for some input x , find Y ( x ) the most likely y in a set A


slide-1
SLIDE 1

Generative Models for Discriminative Problems

Chris Dyer
 DeepMind December 19, 2017 ASRU 2017

slide-2
SLIDE 2
  • A discriminative problem: for some input x, find

the most likely y in a set

  • A discriminative model directly models p(y | x)


logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att)


  • A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)
 Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification


Y(x) x y x y

slide-3
SLIDE 3
  • A discriminative problem: for some input x, find

the most likely y in a set

  • A discriminative model directly models p(y | x)


logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att)


  • A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)
 Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification


Y(x) x y x y

slide-4
SLIDE 4
  • A discriminative problem: for some input x, find

the most likely y in a set

  • A discriminative model directly models p(y | x)


logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att)


  • A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)
 Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification


Y(x) x y x y

slide-5
SLIDE 5
  • A discriminative problem: for some input x, find

the most likely y in a set

  • A discriminative model directly models p(y | x)


logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att)


  • A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)
 Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification


Y(x) x y x y

slide-6
SLIDE 6
  • A discriminative problem: for some input x, find

the most likely y in a set

  • A discriminative model directly models p(y | x)


logistic/linear/… regressions, MLPs, CRFs, MEMMs, seq2seq(+att)


  • A generative model for a discriminative problem

models p(x, y), often by breaking it into p(y)p(x | y)
 Naive Bayes, GMMs, HMMs, PCFGs, the IBM translation models

Terminological clarification


Y(x) x y x y

slide-7
SLIDE 7

But why?

(Chiu et al., last week) (Bentivogli et al., 2016)

slide-8
SLIDE 8
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-9
SLIDE 9
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-10
SLIDE 10
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-11
SLIDE 11
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-12
SLIDE 12
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-13
SLIDE 13
  • “Human-like learning” looks more like model building+inference than
  • ptimizing pattern recognition functions (Lake et al., 2015)
  • Generative models may be more sample efficient than equivalent

discriminative models (Ng & Jordan, 2001)

  • In some domains, we can build (relatively) accurate models of data

generation → even better sample efficiency

  • Exploit alternative data/variables: zero shot learning, learning from unpaired

samples, semisupervised learning, exploit natural conditional independencies

  • Reduce label bias when producing sequential outputs
  • Safety considerations: model introspection by sampling, generative models

“know what they know”

Why generative models?
 Five reasons

slide-14
SLIDE 14

But didn’t we use generative models
 and give them up for some reason?

slide-15
SLIDE 15
  • To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

  • Complex distributions → lots of bad independence assumptions


(naive Bayes, n-grams, HMMs, statistical translation models)

  • But: neural networks let the learner figure out their own

independence assumptions!

  • Using generative models require solving difficult inference problems
  • Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

  • You aren’t “optimizing the task”!

Why not generative models?


slide-16
SLIDE 16
  • To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

  • Complex distributions → lots of bad independence assumptions


(naive Bayes, n-grams, HMMs, statistical translation models)

  • But: neural networks let the learner figure out their own

independence assumptions!

  • Using generative models require solving difficult inference problems
  • Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

  • You aren’t “optimizing the task”!

Why not generative models?


slide-17
SLIDE 17
  • To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

  • Complex distributions → lots of bad independence assumptions


(naive Bayes, n-grams, HMMs, statistical translation models)

  • But: neural networks let the learner figure out their own

independence assumptions!

  • Using generative models require solving difficult inference problems
  • Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

  • You aren’t “optimizing the task”!

Why not generative models?


slide-18
SLIDE 18
  • To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

  • Complex distributions → lots of bad independence assumptions


(naive Bayes, n-grams, HMMs, statistical translation models)

  • But: neural networks let the learner figure out their own

independence assumptions!

  • Using generative models require solving difficult inference problems
  • Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

  • You aren’t “optimizing the task”!

Why not generative models?


slide-19
SLIDE 19
  • To use “generative models for discriminative problems” we must model

complex distributions (sentences, documents, speech, images)

  • Complex distributions → lots of bad independence assumptions


(naive Bayes, n-grams, HMMs, statistical translation models)

  • But: neural networks let the learner figure out their own

independence assumptions!

  • Using generative models require solving difficult inference problems
  • Inference problems are especially difficult when you get rid of the

“bad independence assumptions”!

  • You aren’t “optimizing the task”!

Why not generative models?


slide-20
SLIDE 20
  • Text categorization
  • Syntactic parsing
  • Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas
 sleep furiously

Case studies


x = Welcome to Okinawa y = 沖縄へようこそ。

slide-21
SLIDE 21
  • Text categorization
  • Syntactic parsing
  • Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas
 sleep furiously

Case studies


x = Welcome to Okinawa y = 沖縄へようこそ。

slide-22
SLIDE 22
  • Supervised classification
  • Sample efficiency of a generative-discriminative

pair (Ng and Jordan, 2001)

  • How well do generative models do on standard

datasets “at scale”?

  • How well do generative models do across a

range of data conditions?

Experimental setup
 Text categorization

(Yogatama, D, et al., arXiv 2017)

slide-23
SLIDE 23

L(W) = X

i

log p(yi | xi; W) y x1 x2 x3 x4 x5

X

p(y | x)

Discriminative model
 Text categorization

slide-24
SLIDE 24

x1 x2 x3 x4 vy x2 x3 x4 x5 L(W) = X

i

log p(xi | yi)p(yi)

p(x2 | x<2, y)

p(x3 | x<3, y)

p(x4 | x<4, y)

p(x5 | x<5, y)

Generative model
 Text categorization

p(y)

slide-25
SLIDE 25

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization


slide-26
SLIDE 26

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization


slide-27
SLIDE 27

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization


slide-28
SLIDE 28

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization


slide-29
SLIDE 29

AGNews DBPedia Yahoo Yelp Binary

Bag of Words (Zhang et al., 2015)

88.8 96.6 68.9 92.2

char-CRNN (Xiao and Cho, 2016)

91.4 98.6 71.7 94.5

very deep CNN (Conneau et al., 2016)

91.3 98.7 73.4 95.7

Discriminative LSTM

92.1 98.7 73.7 92.6

Naive Bayes

90.0 96.0 68.7 86.0

Kneser-Ney Bayes

89.3 95.4 69.3 81.8

Generative LSTM

90.7 94.8 70.5 90.0

Supervised text categorization


slide-30
SLIDE 30

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes

Sample efficiency


Yahoo! Answers data: 1,395,000 instances / 10 classes

slide-31
SLIDE 31

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes Gen LSTM

Sample efficiency


Yahoo! Answers data: 1,395,000 instances / 10 classes

slide-32
SLIDE 32

Test-set accuracy 20 40 60 80 log(training instances + 1) 4 5.2 7.1 9.4 14.1

KN-Bayes Naive Bayes Disc LSTM Gen LSTM

Sample efficiency


Yahoo! Answers data: 1,395,000 instances / 10 classes

slide-33
SLIDE 33

Sogou log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yahoo log (#training + 1) % accuracy 2 4 6 8 10 12 14 10 30 50 70

naive bayes KN bayes disc LSTM gen LSTM

DBPedia log (#training + 1) % accuracy 2 4 6 8 10 12 20 40 60 80 100

naive bayes KN bayes disc LSTM gen LSTM

Yelp Binary log (#training + 1) % accuracy 2 4 6 8 10 12 50 60 70 80 90 100

naive bayes KN bayes disc LSTM gen LSTM

Sample efficiency


slide-34
SLIDE 34
  • Generative models of text approach their

asymptotic errors more rapidly (better in small- data regime).

  • Discriminative models of text have lower

asymptotic errors, faster training and inference time, and a good estimate of p(x)

Discussion


  • The downside is inference is expensive. We

have to evaluate the likelihood of the document for every class!

slide-35
SLIDE 35
  • Text categorization
  • Syntactic parsing
  • Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas
 sleep furiously

Case studies


x = Welcome to Okinawa y = 沖縄へようこそ。

slide-36
SLIDE 36
  • Generate symbols sequentially using an RNN
  • Add some control symbols to rewrite the history
  • ccasionally
  • Occasionally compress a sequence into a constituent
  • RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

  • This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing
 Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

slide-37
SLIDE 37
  • Generate symbols sequentially using an RNN
  • Add some control symbols to rewrite the history
  • ccasionally
  • Occasionally compress a sequence into a constituent
  • RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

  • This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing
 Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

slide-38
SLIDE 38
  • Generate symbols sequentially using an RNN
  • Add some control symbols to rewrite the history
  • ccasionally
  • Occasionally compress a sequence into a constituent
  • RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

  • This is a top-down, left-to-right generation of a

tree+sequence

Syntactic parsing
 Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

slide-39
SLIDE 39
  • Generate symbols sequentially using an RNN
  • Add some control symbols to rewrite the history
  • ccasionally
  • Occasionally compress a sequence into a constituent
  • RNN predicts next terminal/control symbol based on the

history of compressed elements and non-compressed terminals

  • This is a top-down, left-to-right generation of a

tree+sequence (other traversal orders are possible)

Syntactic parsing
 Recurrent Neural Net Grammars

(D, et al., ACL 2016; Kuncoro, D, et al., EACL 2017)

slide-40
SLIDE 40

The hungry cat meows loudly

Example derivation


slide-41
SLIDE 41

stack action

(S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The (S (NP The hungry cat ) (S (NP The hungry cat)

Compress “The hungry cat”
 into a single composite symbol probability

NT(S)

p(nt(S) | top)

GEN(The)

p(gen(The) | (S, (NP)

NT(NP)

p(nt(NP) | (S)

GEN(hungry)

p(gen(hungry) | (S, (NP, The)

GEN(cat)

p(gen(cat) | . . .)

REDUCE

p(reduce | . . .)

slide-42
SLIDE 42

stack action

GEN(meows) REDUCE (S (NP The hungry cat) (VP meows) GEN(.) REDUCE (S (NP The hungry cat) (VP meows) .) (S (NP The hungry cat) (VP meows) . (S (NP The hungry cat) (VP meows (S (NP The hungry cat) (VP (S (NP The hungry cat) NT(S)

p(nt(S) | top)

GEN(The)

p(gen(The) | (S, (NP)

NT(NP)

p(nt(NP) | (S)

GEN(hungry)

p(gen(hungry) | (S, (NP, The)

GEN(cat)

p(gen(cat) | . . .)

REDUCE

p(reduce | . . .)

(S (S (NP (S (NP The hungry cat (S (NP The hungry (S (NP The NT(VP)

p(nt(VP) | (S, (NP The hungry cat))

probability

slide-43
SLIDE 43
  • Valid (tree, string) pairs are in bijection to valid sequences of

actions (specifically, the DFS, left-to-right traversal of the trees)

  • Every stack configuration perfectly encodes the complete

history of actions.

  • Therefore, the probability decomposition is justified by the

chain rule, i.e.

(chain rule) (prop 2)

p(x, y) = p(actions(x, y)) p(actions(x, y)) = Y

i

p(ai | a<i) = Y

i

p(ai | stack(a<i))

(prop 1)

Deriving the model


slide-44
SLIDE 44

Modeling the next action


(S (NP The hungry cat) (VP meows

p(ai | )

  • 1. unbounded depth
  • 1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

slide-45
SLIDE 45

Modeling the next action


(S (NP The hungry cat) (VP meows

p(ai | )

  • 1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

slide-46
SLIDE 46

Modeling the next action


(S (NP The hungry cat) (VP meows

p(ai | )

  • 1. Unbounded depth → recurrent neural nets

h1 h2 h3 h4

  • 2. arbitrarily complex trees
  • 2. Arbitrarily complex trees → recursive neural nets
slide-47
SLIDE 47

The hungry cat

)

(NP The hungry cat)

Need representation for:

NP

What head type?

Syntactic composition


slide-48
SLIDE 48

The

NP

hungry cat

) NP

(NP The hungry cat)

Need representation for:

(

Syntactic composition


slide-49
SLIDE 49

The

NP

hungry cat

) NP (

(NP The hungry cat)

Need representation for:

Syntactic composition


slide-50
SLIDE 50

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat)

hungry

Syntactic composition
 Recursion

slide-51
SLIDE 51

The

NP

cat

) NP (

(NP The (ADJP very hungry) cat)

Need representation for: (NP The hungry cat) | {z }

v

v

Syntactic composition
 Recursion

slide-52
SLIDE 52

Modeling the next action


(S (NP The hungry cat) (VP meows

p(ai | )

  • 1. Unbounded depth → recurrent neural nets
  • 2. Arbitrarily complex trees → recursive neural nets

∼REDUCE

h1 h2 h3 h4

(S (NP The hungry cat) (VP meows)

p(ai+1 |

)

  • 3. limited updates
  • 3. Limited updates to state → stack RNNs

(D, et al., ACL 2015; Ballesteros, D, et al., EMNLP 2015)

slide-53
SLIDE 53
  • In text categorization, it was not really a problem to

exhaustively evaluate all candidate y’s.

  • Here, we can’t do that — we have O(2|x|) candidates!
  • Outline of the solution
  • Learn a tractable instrumental distribution, q(y | x),

which approximates the posterior over trees

  • Use importance sampling to solve the inference

problems (maximization, marginalization) we care about

Inference


slide-54
SLIDE 54

Type F1

Petrov and Klein (2007)

Gen

90.1

Shindo et al (2012)
 Single model

Gen

91.1

Vinyals et al (2015)
 PTB only

Disc

90.5

Shindo et al (2012)
 Ensemble

Gen+Ensemble

92.4

Vinyals et al (2015)
 Semisupervised

Disc+SemiSup

92.8

Discriminative
 PTB only

Disc

91.7

Generative
 PTB only

Gen

93.6

Choe and Charniak (2016)
 Semisupervised

Gen
 +SemiSup

93.8

Fried et al. (2017)

Gen+Semi
 +Ensemble

94.7

Results: Parsing

slide-55
SLIDE 55

Type F1

Petrov and Klein (2007)

Gen

90.1

Shindo et al (2012)
 Single model

Gen

91.1

Vinyals et al (2015)
 PTB only

Disc

90.5

Shindo et al (2012)
 Ensemble

Gen+Ensemble

92.4

Vinyals et al (2015)
 Semisupervised

Disc+SemiSup

92.8

Discriminative
 PTB only

Disc

91.7

Generative
 PTB only

Gen

93.6

Choe and Charniak (2016)
 Semisupervised

Gen
 +SemiSup

93.8

Fried et al. (2017)

Gen+Semi
 +Ensemble

94.7

Results: Parsing

slide-56
SLIDE 56
  • RNNGs are effective both for modeling language

and parsing

  • Generative parser outperforms discriminative

parser

  • Expectation: the discriminative model would do

better with more data

  • We are in the “generative” regime!

Discussion

slide-57
SLIDE 57
  • Text categorization
  • Syntactic parsing
  • Sequence to sequence transduction

POLITICS x = y = y = x = Colorless green ideas
 sleep furiously

Case studies


x = Welcome to Okinawa y = 沖縄へようこそ。

slide-58
SLIDE 58
  • State of the art performance in most applications
  • Two serious problems that concern us:
  • Nontrivial to use “unpaired” samples of x or y to train the model
  • “Explaining away effects” - models like this learn to ignore

“inconvenient” inputs (i.e., x), in favor of high probability continuations of an output prefix (y<i)

Seq2Seq Modeling Direct model

(Yu, D, et al., ICLR 2017)

slide-59
SLIDE 59

Label bias is a species of “explaining away” that
 causes trouble in directed (locally normalized) models. a b c x y z → a b c’ x y z → a b’ c x y z → d w → a b’ d x y z →

Seq2Seq Modeling What is label bias?

slide-60
SLIDE 60

Seq2Seq Modeling Generative model

slide-61
SLIDE 61

“Source model” “Channel model”

Seq2Seq Modeling Generative model

slide-62
SLIDE 62

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

slide-63
SLIDE 63

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

slide-64
SLIDE 64

“Source model” “Channel model”

The world is colorful because of the Internet. 世界はインターネット
 のためにカラフルです。

Seq2Seq Modeling Generative model

slide-65
SLIDE 65

“Source model” “Channel model”

The world is colorful because of the Internet.

Source model can be estimated from
 unpaired y’s

Seq2Seq Modeling Generative model

世界はインターネット
 のためにカラフルです。

slide-66
SLIDE 66

“Source model” “Channel model”

The world is colorful because of the Internet.

Seq2Seq Modeling Generative model

Inference model form avoids explaining away of inputs (“label bias”).

世界はインターネット
 のためにカラフルです。

slide-67
SLIDE 67
  • Question: Can we use neural network component

models without bad independence assumptions?

  • Training — straightforward
  • Decoding — challenging

Seq2Seq Modeling Generative model

slide-68
SLIDE 68
  • Some bad initial results
  • The IS algorithm we proposed hurt us unless the

number of samples (k) was massive

  • Reranking an k-best list from a direct model

didn’t help unless k was even bigger

  • Question: can we develop a left-to-right decoder

for a noisy channel MT model?

Decoding

slide-69
SLIDE 69

Direct model:

Decoding Direct vs. generative

slide-70
SLIDE 70

Direct model:

Chain rule!

Decoding Direct vs. generative

slide-71
SLIDE 71

Direct model: Not perfect, but

Chain rule!

(Compare to using greedy decoding with MEMMs)

Decoding Direct vs. generative

slide-72
SLIDE 72

Generative model (naive):

Decoding Direct vs. generative

slide-73
SLIDE 73

Generative model (naive):

Chain rule!

Decoding Direct vs. generative

slide-74
SLIDE 74

Generative model (naive):

Probability doesn’t work
 like this.

Decoding Direct vs. generative

slide-75
SLIDE 75

Outline of solution: Introduce a latent variable z that determines when enough of the conditioning context has been read to generate another symbol How much of y do we need to read to model the jth token of x?

Decoding Direct vs. generative

slide-76
SLIDE 76

Conditioning context Output sequence Introduced as a direct model by
 Yu et al. (2016) It’s a good direct model It also is exactly what we need
 for the channel model Similar model: Graves (2012)

The Segment to Segment Model

slide-77
SLIDE 77

Expensive to go through every token yj in the vocabulary and calculate Use an auxiliary direct model q(y, z | x) to guide the search. y

Decoding with an auxiliary model

slide-78
SLIDE 78

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Decoding with an auxiliary model

slide-79
SLIDE 79

Possible proposals: Chinese markets open Chinese markets closed Market close Financial markets

Expanded objective

Decoding with an auxiliary model

slide-80
SLIDE 80
  • Medium-sized Chinese-English news parallel data
  • Large LSTM language model trained on English

news + target side of parallel data

  • Evaluation using BLEU-4 (higher is better)

Experiments Machine translation

slide-81
SLIDE 81

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Gen Discriminative

Experiments Machine translation

slide-82
SLIDE 82

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Experiments Machine translation

Gen Discriminative

slide-83
SLIDE 83

Model BLEU Seq2seq with attention 25.27 Direct model (q by itself) 23.33 Direct + LM + bias 23.33 Channel + LM + bias 26.28 Direct + channel + LM + bias 26.44

Experiments Machine translation

Gen Discriminative

slide-84
SLIDE 84
  • Generative can be used well for “discriminative problems”
  • Especially in data-restricted scenarios
  • Especially with neural nets, which let us define great generative

models

  • Open questions
  • Inference is hard, but there are lots of exciting possibilities for

learning to do inference

  • Is there a theoretical account for when a particular dataset is in

the “generative” vs. “discriminative” regime and where the crossover point is?

Conclusions

slide-85
SLIDE 85

Thank you!

slide-86
SLIDE 86
  • Generative models also provide an estimate of p(x)
  • The likelihood of the input is a good estimate of “what the model

knows”. Examples that fall out of this are a good indication that the model should stop what it’s doing and get help.

Outlier detection


slide-87
SLIDE 87
  • Learn (label) concepts, to be used as class embeddings from an

auxiliary task

  • For example, from a large unannotated corpus, learn standard word

embeddings and use them as class embeddings

  • Fix the class embeddings during training
  • When we see a new class, use the word embedding for the class
  • Train on classes
  • Predict for all classes

n − 1

vy

Zero-shot learning


slide-88
SLIDE 88

Class Precision Recall Accuracy

company

98.9 46.6 93.3

educational institution

99.2 49.5 92.8

athlete

96.5 90.1 94.6

means of transportation

96.5 74.3 94.2

building

99.9 37.7 92.1

natural place

98.9 88.2 95.4

village

99.9 68.1 93.8

animal

99.7 68.1 93.8

plant

99.2 76.9 94.3

film

99.4 73.3 94.5

written work

93.8 26.5 91.3

AVERAGE

98.3 63.6 93.6

Zero-shot learning


slide-89
SLIDE 89

q(y | x) Assume we’ve got a conditional distribution p(x, y) > 0 = ⇒ q(y | x) > 0 y ∼ q(y | x) (i) (ii) is tractable and q(y | x) (iii) is tractable s.t. w(x, y) = p(x, y) q(y | x) Let the importance weights p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y)

Inference Importance sampling

slide-90
SLIDE 90

p(x) = X

y∈Y(x)

p(x, y) = X

y∈Y(x)

w(x, y)q(y | x) = Ey∼q(y|x)w(x, y) Replace this expectation with its Monte Carlo
 estimate. y(i) ∼ q(y | x) for i ∈ {1, 2, . . . , N} Eq(y|x)w(x, y)

MC

≈ 1 N

N

X

i=1

w(x, y(i))

Inference Importance sampling

slide-91
SLIDE 91

Perplexity 5-gram IKN 169.3 LSTM LM 113.4 Generative (IS) 102.4

Results: Language modeling