Text Generative Models CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation

text generative models
SMART_READER_LITE
LIVE PREVIEW

Text Generative Models CSCI 699 Instructor: Xiang Ren USC Computer - - PowerPoint PPT Presentation

Text Generative Models CSCI 699 Instructor: Xiang Ren USC Computer Science Language Modeling CSCI 699 Instructor: Xiang Ren USC Computer Science Are These Sentences OK? Jane went to the store. store to Jane went the. Jane went


slide-1
SLIDE 1

Text Generative Models

CSCI 699

Instructor: Xiang Ren USC Computer Science

slide-2
SLIDE 2

Language Modeling

CSCI 699

Instructor: Xiang Ren USC Computer Science

slide-3
SLIDE 3

Are These Sentences OK?

  • Jane went to the store.
  • store to Jane went the.
  • Jane went store.
  • Jane goed to the store.
  • The store went to Jane.
  • The food truck went to Jane.
slide-4
SLIDE 4

Calculating the Probability

  • f a Sentence
slide-5
SLIDE 5

Calculating the Probability

  • f a Sentence
slide-6
SLIDE 6

Review: Count-based Language Models

slide-7
SLIDE 7

Count-based Language Models

slide-8
SLIDE 8

A Refresher on Evaluation

slide-9
SLIDE 9

What Can we Do w/ LMs?

  • Score sentences:

Jane went to the store . → high store to Jane went the . → low (same as calculating loss for training)

slide-10
SLIDE 10

What Can we Do w/ LMs?

  • Score sentences:

Jane went to the store . → high store to Jane went the . → low (same as calculating loss for training)

  • Generate sentences:

while didn’t choose end-of-sentence symbol: calculate probability sample a new word from the probability distribution

slide-11
SLIDE 11

Problems and Solutions?

  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle → solution: class based language models

slide-12
SLIDE 12

Problems and Solutions?

  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle

  • → solution: class based language models

Cannot condition on context with intervening words

  • Dr. Jane Smith
  • Dr. Gertrude Smith

→ solution: skip-gram language models

slide-13
SLIDE 13

Problems and Solutions?

  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle

  • → solution: class based language models

Cannot condition on context with intervening words

  • Dr. Jane Smith
  • Dr. Gertrude Smith

→ solution: skip-gram language models

  • Cannot handle long-distance dependencies

→ solution: cache, trigger, topic, syntactic models, etc. for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer

slide-14
SLIDE 14

An Alternative: Featurized Log-Linear Models

slide-15
SLIDE 15

An Alternative: Featurized Models

  • Calculate features of the context
slide-16
SLIDE 16

An Alternative: Featurized Models

  • Calculate features of the context
  • Based on the features, calculate probabilities
slide-17
SLIDE 17

An Alternative: Featurized Models

  • Calculate features of the context
  • Based on the features, calculate probabilities
  • Optimize feature weights using gradient descent,

etc.

slide-18
SLIDE 18

Example:

Previous words: “giving a"

Words we’re predicting

slide-19
SLIDE 19

Example:

Previous words: “giving a"

Words we’re predicting How likely are they?

slide-20
SLIDE 20

Example:

Previous words: “giving a"

Words we’re predicting How likely are they? How likely are they given prev. word is “a”?

slide-21
SLIDE 21

Example:

Previous words: “giving a"

Words we’re predicting How likely are they? How likely are they given prev. word is “a”? How likely are they given 2nd prev. word is “giving”?

slide-22
SLIDE 22

Example:

Previous words: “giving a"

Words we’re predicting How likely are they? How likely are they given prev. word is “a”? How likely are they given 2nd prev. word is “giving”? Total score

slide-23
SLIDE 23

Softmax

slide-24
SLIDE 24

A Computation Graph View

giving a

lookup2 lookup1

+ + bias = scores

softmax

probs Each vector is size of output vocabulary

slide-25
SLIDE 25

A Note: “Lookup”

  • lookup(2)

Lookup can be viewed as “grabbing” a single vector from a big matrix of word embeddings

  • num. words

vector size

slide-26
SLIDE 26

A Note: “Lookup”

  • lookup(2)

Lookup can be viewed as “grabbing” a single vector from a big matrix of word embeddings

  • num. words

vector size

  • Similarly, can be viewed as multiplying by a “one-

hot” vector

  • num. words

vector size

… 1

* 0

  • Former tends to be faster
slide-27
SLIDE 27

Training a Model

  • Reminder: to train, we calculate a “loss

function” (a measure of how bad our predictions are), and move the parameters to reduce the loss

slide-28
SLIDE 28

Training a Model

  • Reminder: to train, we calculate a “loss

function” (a measure of how bad our predictions are), and move the parameters to reduce the loss

  • The most common loss function for probabilistic

models is “negative log likelihood” 0.002 0.003 0.329 0.444 0.090 … If element 3 (or zero-indexed, 2) p= is the correct answer:

  • log

1.112

slide-29
SLIDE 29

Parameter Update

slide-30
SLIDE 30

Choosing a Vocabulary

slide-31
SLIDE 31

Unknown Words

  • Necessity for UNK words
  • We won’t have all the words in the world in training data
  • Larger vocabularies require more memory and

computation time

slide-32
SLIDE 32

Unknown Words

  • Necessity for UNK words
  • We won’t have all the words in the world in training data
  • Larger vocabularies require more memory and

computation time

  • Common ways:
  • Frequency threshold (usually UNK <= 1)
  • Rank threshold
slide-33
SLIDE 33

Evaluation and Vocabulary

  • Important: the vocabulary must be the same over

models you compare

  • Or more accurately, all models must be able to

generate the test set (it’s OK if they can generate more than the test set, but not less)

  • e.g. Comparing a character-based model to a

word-based model is fair, but not vice-versa

slide-34
SLIDE 34

Beyond Linear Models

slide-35
SLIDE 35

Linear Models can’t Learn Feature Combinations

  • These can’t be expressed by linear features

What can we do?

  • Remember combinations as features (individual

scores for “farmers eat”, “cows eat”) → Feature space explosion! Neural nets farmers eat steak → high farmers eat hay → low cows eat steak → low cows eat hay → high

slide-36
SLIDE 36

Neural Language Models

  • (See Bengio et al. 2004)

giving a

lookup lookup

probs

softmax

+ = bias scores

W

tanh( W1*h + b1)

slide-37
SLIDE 37

Where is Strength Shared?

giving a

lookup lookup

probs

softmax tanh( W1*h + b1)

+ = bias scores

W

Word embeddings: Similar input words get similar vectors Similar output words get similar rows in in the softmax matrix Similar contexts get similar hidden states

slide-38
SLIDE 38
  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle → solved, and similar contexts as well! !

What Problems are Handled?

slide-39
SLIDE 39
  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle

  • Dr. Jane Smith
  • Dr. Gertrude Smith

→ solved! ! → solved, and similar contexts as well! ! Cannot condition on context with intervening words

What Problems are Handled?

slide-40
SLIDE 40
  • Cannot share strength among similar words

she bought a car she purchased a car she bought a bicycle she purchased a bicycle

  • Dr. Jane Smith
  • Dr. Gertrude Smith

for tennis class he wanted to buy his own racquet for programming class he wanted to buy his own computer → solved! !

  • Cannot handle long-distance dependencies

→ not solved yet " → solved, and similar contexts as well! ! Cannot condition on context with intervening words

What Problems are Handled?

slide-41
SLIDE 41

Training Tricks

slide-42
SLIDE 42

Shuffling the Training Data

  • Stochastic gradient methods update the

parameters a little bit at a time

  • What if we have the sentence “I love this

sentence so much!” at the end of the training data 50 times?

slide-43
SLIDE 43

Shuffling the Training Data

  • Stochastic gradient methods update the

parameters a little bit at a time

  • What if we have the sentence “I love this

sentence so much!” at the end of the training data 50 times?

  • T
  • train correctly, we should randomly shuffle the
  • rder at each time step
slide-44
SLIDE 44

Other Optimization Options

  • SGD with Momentum: Remember gradients from past

time steps to prevent sudden changes

  • Adagrad: Adapt the learning rate to reduce learning

rate for frequently updated parameters (as measured by the variance of the gradient)

  • Adam: Like Adagrad, but keeps a running average of

momentum and gradient variance

  • Many others: RMSProp, Adadelta, etc.

(See Ruder 2016 reference for more details)

slide-45
SLIDE 45

Early Stopping, Learning Rate Decay

  • Neural nets have tons of parameters: we want to

prevent them from over-fitting

slide-46
SLIDE 46

Early Stopping, Learning Rate Decay

  • Neural nets have tons of parameters: we want to

prevent them from over-fitting

  • We can do this by monitoring our performance on

held-out development data and stopping training when it starts to get worse

  • It also sometimes helps to reduce the learning rate

and continue training

slide-47
SLIDE 47

Dropout

  • Neural nets have lots of parameters, and are prone

to overfitting

  • Dropout: randomly zero-out nodes in the hidden

layer with probability p at training time only

  • Because the number of nodes at training/test is

different, scaling is necessary:

  • Standard dropout: scale by p at test time
  • Inverted dropout: scale by 1/(1-p) at training time

x x

slide-48
SLIDE 48

Efficiency Tricks: Operation Batching

slide-49
SLIDE 49

Efficiency Tricks: Mini-batching

  • On modern hardware 10 operations of size 1 is

much slower than 1 operation of size 10

  • Minibatching combines together smaller operations

into one big one

slide-50
SLIDE 50

Minibatching

slide-51
SLIDE 51

Conditional Generation

CSCI 699

Instructor: Xiang Ren USC Computer Science

slide-52
SLIDE 52

Language Models

  • Language models are generative models of text

s ~ P(x)

“The Malfoys!” said Hermione. Harry was watching him. He looked like Madame Maxime. When she strode up the wrong staircase to visit himself. “I’m afraid I’ve definitely been suspended from power, no chance — indeed?” said Snape. He put his head back behind them and read groups as they crossed a corner and fluttered down onto their ink lamp, and picked up his spoon. The doorbell rang. It was a lot cleaner down in London.

T ext Credit: Max Deutsch (https://medium.com/deep-writing/

slide-53
SLIDE 53

Conditioned Language Models

  • Not just generate text, generate text according to

some specification Input X Structured Data English Document Utterance Image Speech Output Y (Text) NL Description Japanese Short Description Response Text Transcript Task

slide-54
SLIDE 54

Conditioned Language Models

  • Not just generate text, generate text according to

some specification Input X Structured Data English Document Utterance Image Speech Output Y (Text) NL Description Japanese Short Description Response Text Transcript Task NL Generation Translation Summarization Response Generation Image Captioning Speech Recognition

slide-55
SLIDE 55

Formulation and Modeling

slide-56
SLIDE 56

Calculating the Probability of a Sentence

slide-57
SLIDE 57

Conditional Language Models

slide-58
SLIDE 58

(One Type of) Language Model

LSTM LSTM LSTM LSTM predict hate pre d ict this pre d ict prd e ict movie </s> LSTM

(Mikolov et al. 2011)

<s> I hate this movie

predict I

slide-59
SLIDE 59

LSTM LSTM LSTM LSTM LSTM

</s>

LSTM LSTM LSTM LSTM argmax argmax argmax

</s>

(One Type of) Conditional Language Model

I hate this movie kono eiga ga kirai this movie

(Sutskever et al. 2014)

Encoder

argmax argmax

I hate Decoder

slide-60
SLIDE 60

How to Pass Hidden State?

  • Initialize decoder w/ encoder (Sutskever et al. 2014)

encoder decoder

slide-61
SLIDE 61

How to Pass Hidden State?

  • Initialize decoder w/ encoder (Sutskever et al. 2014)
  • encoder

decoder

Transform (can be different dimensions)

encoder transform decoder

slide-62
SLIDE 62

How to Pass Hidden State?

  • Initialize decoder w/ encoder (Sutskever et al. 2014)
  • encoder

decoder

Transform (can be different dimensions)

encoder transform decoder

  • encoder

Input at every time step (Kalchbrenner & Blunsom 2013)

decoder decoder decoder

slide-63
SLIDE 63

Methods of Generation

slide-64
SLIDE 64

The Generation Problem

  • We have a model of P(Y|X), how do we use it to

generate a sentence?

slide-65
SLIDE 65

The Generation Problem

  • We have a model of P(Y|X), how do we use it to

generate a sentence?

  • Two methods:
  • Sampling: Try to generate a random sentence

according to the probability distribution.

  • Argmax: Try to generate the sentence with the

highest probability.

slide-66
SLIDE 66

Ancestral Sampling

  • Randomly generate words one-by-one.
  • An exact method for sampling from P(X), no further

work needed.

while yj-1 != “</s>”: yj ~ P(yj | X, y1, …, yj-1)

slide-67
SLIDE 67

Greedy Search

  • One by one, pick the single highest-probability word
  • Not exact, real problems:
  • Will often generate the “easy” words first
  • Will prefer multiple common words to one rare word

while yj-1 != “</s>”: yj = argmax P(yj | X, y1, …, yj-1)

slide-68
SLIDE 68

Beam Search

  • Instead of picking one high-probability word,

maintain several paths

  • Some in reading materials, more in a later class
slide-69
SLIDE 69

Model Ensembling

slide-70
SLIDE 70

Ensembling

  • Why?
  • Multiple models make somewhat uncorrelated errors
  • Models tend to be more uncertain when they are about to make errors
  • Smooths over idiosyncrasies of the model

LSTM1 predict1 I LSTM2 predict2

  • Combine predictions from multiple models

<s> <s>

slide-71
SLIDE 71

Linear Interpolation

slide-72
SLIDE 72

Log-linear Interpolation

slide-73
SLIDE 73

Parameter Averaging

  • Problem: Ensembling means we have to use M

models at test time, increasing our time/memory complexity

  • Parameter averaging is a cheap way to get some

good effects of ensembling

  • Basically, write out models several times near the

end of training, and take the average of parameters

slide-74
SLIDE 74
  • Ensemble Distillation

(e.g. Kim et al. 2016)

Problem: parameter averaging only works for models within the same run

  • Knowledge distillation trains a model to copy the

ensemble

  • Specifically, it tries to match the description over

predicted words

  • Why? We want the model to make the same mistakes as

an ensemble

  • Shown to increase accuracy notably
slide-75
SLIDE 75

Stacking

  • What if we have two very different models where

prediction of outputs is done in very different ways?

  • e.g. a word-by-word translation model and

character-by-character translation model

  • Stacking uses the output of one system in

calculating features for another system

slide-76
SLIDE 76

How do we Evaluate?

slide-77
SLIDE 77

Basic Evaluation Paradigm

  • Use parallel test set
  • Use system to generate translations
  • Compare target translations w/ reference
slide-78
SLIDE 78

Human Evaluation

  • Ask a human to do evaluation
  • Final goal, but slow, expensive, and sometimes inconsistent
slide-79
SLIDE 79

BLEU

  • Works by comparing n-gram overlap w/ reference
  • Pros: Easy to use, good for measuring system improvement
  • Cons: Often doesn’t match human eval, bad for comparing

very different systems

slide-80
SLIDE 80

METEOR

  • Like BLEU in overall principle, with many other

tricks: consider paraphrases, reordering, and function word/content word difference

  • Pros: Generally significantly better than BLEU,
  • esp. for high-resource languages
  • Cons: Requires extra resources for new languages

(although these can be made automatically), and more complicated

slide-81
SLIDE 81

Perplexity

  • Calculate the perplexity of the words in the held-out

set without doing generation

  • Pros: Naturally solves multiple-reference problem!
  • Cons: Doesn’t consider decoding or actually

generating output.

  • May be reasonable for problems with lots of

ambiguity.

slide-82
SLIDE 82

What Do We Condition On?

slide-83
SLIDE 83

From Structured Data

  • (e.g. Wen et al 2015)

When you say “Natural Language Generation” to an old-school NLPer, it means this

slide-84
SLIDE 84

From Input + Labels

  • (e.g. Zhou and Neubig 2017)

For example, word + morphological tags -> inflectedword

  • Other options: politeness/gender in translation, etc.
slide-85
SLIDE 85

From Images

  • (e.g. Karpathy et al. 2015)

Input is image features, output is text

slide-86
SLIDE 86

Other Auxiliary Information

  • Name of a recipe + ingredients -> recipe (Kiddon

et al. 2016)

  • TED talk description -> TED talk (Hoang et al.

2016)

  • etc. etc.
slide-87
SLIDE 87

Questions?