(part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline - - PowerPoint PPT Presentation

โ–ถ
part 1
SMART_READER_LITE
LIVE PREVIEW

(part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline - - PowerPoint PPT Presentation

Maxent and Neural Language Models (part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural


slide-1
SLIDE 1

Maxent and Neural Language Models (part 1)

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

Outline

Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-3
SLIDE 3

Terminology

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naรฏve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :) common NLP term

slide-4
SLIDE 4

Maxent Models are Flexible

Maxent models can be used:

  • to create featureful language models, or
  • to design discriminatively trained classifiers
slide-5
SLIDE 5

Maxent Models as Featureful n-gram Language Models

generatively trained: learn to model (class-specific) language

๐‘ž ๐‘ฆ๐‘— ๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘‚+1:๐‘—โˆ’๐‘—) = maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘‚+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * โ€ฆ * p(<EOS> | Label , furiously)20

Model each n-gram term with a maxent model

slide-6
SLIDE 6

Maxent Models for Classification: Discriminatively โ€ฆ

๐‘ž ๐‘ ๐‘Œ) = ๐ง๐›๐ฒ๐Ÿ๐จ๐ฎ(๐‘Œ; ๐‘)

Discriminatively trained classifier

Directly model the posterior

slide-7
SLIDE 7

Maxent Models for Classification: Discriminatively or Generatively Trained

๐‘ž ๐‘ ๐‘Œ) โˆ ๐’’ ๐‘Œ ๐‘) โˆ— ๐‘ž(๐‘)

๐‘ž ๐‘ ๐‘Œ) = maxent(๐‘Œ; ๐‘)

Discriminatively trained classifier Generatively trained classifier with maxent-based language model

Directly model the posterior Model the posterior with Bayes rule

slide-8
SLIDE 8

Maximum Entropy (Log-linear) Models For Discriminatively Trained Classifiers

discriminatively trained: classify in one go

๐‘ž ๐‘ง ๐‘ฆ) = maxent ๐‘ฆ, ๐‘ง

(weโ€™ll start with this one)

slide-9
SLIDE 9

Discriminative ML Classification in 30 Seconds

  • Common goal: probabilistic classifier p(y | x)
  • Often done by defining features between x

and y that are meaningful

slide-10
SLIDE 10

Discriminative ML Classification in 30 Seconds

  • Common goal: probabilistic classifier p(y | x)
  • Often done by defining features between x

and y that are meaningful

โ€“ Denoted by a general vector of K features ๐‘” ๐‘ฆ, ๐‘ง = (๐‘”

1 ๐‘ฆ, ๐‘ง , โ€ฆ , ๐‘” ๐ฟ(๐‘ฆ, ๐‘ง))

slide-11
SLIDE 11

Discriminative ML Classification in 30 Seconds

  • Common goal: probabilistic classifier p(y | x)
  • Often done by defining features between x

and y that are meaningful

โ€“ Denoted by a general vector of K features ๐‘” ๐‘ฆ, ๐‘ง = (๐‘”

1 ๐‘ฆ, ๐‘ง , โ€ฆ , ๐‘” ๐ฟ(๐‘ฆ, ๐‘ง))

  • Features can be thought of as โ€œsoftโ€ rules

โ€“ E.g., POSITIVE sentiments tweets may be more likely to have the word โ€œhappyโ€

slide-12
SLIDE 12

Discriminative ML Classification in 30 Seconds

  • Common goal: probabilistic classifier p(y | x)
  • Often done by defining features between x

and y that are meaningful

โ€“ Denoted by a general vector of K features ๐‘” ๐‘ฆ, ๐‘ง = (๐‘”

1 ๐‘ฆ, ๐‘ง , โ€ฆ , ๐‘” ๐ฟ(๐‘ฆ, ๐‘ง))

  • Features can be thought of as โ€œsoftโ€ rules

โ€“ E.g., POSITIVE sentiments tweets may be more likely to have the word โ€œhappyโ€ Q: What are the features in a Naรฏve Bayes classifier?

slide-13
SLIDE 13

Core Aspects to Maxent Classifier p(y|x)

We need to define

  • features ๐‘” ๐‘ฆ, ๐‘ง between x and y that are

meaningful;

  • weights ๐œ„ (one per feature) to say how

important each feature is; and

  • a way to form probabilities from ๐‘” and ๐œ„
slide-14
SLIDE 14

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result

  • f a Shining Path attack today

against a community in Junin department, central Peruvian mountain region .

p( | )

ATTACK

slide-15
SLIDE 15

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Discriminative Document Classification

ATTACK

  • # killed:
  • Type:
  • Perp:

shot

ATTACK

slide-16
SLIDE 16

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-17
SLIDE 17

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-18
SLIDE 18

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-19
SLIDE 19

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-20
SLIDE 20

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-21
SLIDE 21

Discriminative Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

These extractions are all features that have fired (likely have some significance)

slide-22
SLIDE 22

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

We need to score the different extracted clues.

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

slide-23
SLIDE 23

Score and Combine Our Clues

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

โ€ฆ

COMBINE posterior probability of ATTACK

โ€ฆ

scorek(๐Ÿ—, ATTACK)

slide-24
SLIDE 24

Scoring Our Clues

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

โ€ฆ

(ignore the feature indexing for now)

slide-25
SLIDE 25

Scoring Our Clues

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

โ€ฆ

Learn these scoresโ€ฆ but how? What do we

  • ptimize?
slide-26
SLIDE 26

https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/

Lesson 1

slide-27
SLIDE 27 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

slide-28
SLIDE 28

What functionโ€ฆ

  • perates on any real number?

is never less than 0?

slide-29
SLIDE 29

What functionโ€ฆ

  • perates on any real number?

is never less than 0? f(x) = exp(x)

slide-30
SLIDE 30 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

slide-31
SLIDE 31

exp( ))

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

slide-32
SLIDE 32

exp( ))

score1(๐Ÿ—, ATTACK) score2(๐Ÿ—, ATTACK) score3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

Learn the scores (but weโ€™ll declare what combinations should be looked at)

slide-33
SLIDE 33

exp( ))

weight1 * applies1(๐Ÿ—, ATTACK) weight2 * applies2(๐Ÿ—, ATTACK) weight3 * applies3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

slide-34
SLIDE 34

exp( ))

weight1 * applies1(๐Ÿ—, ATTACK) weight2 * applies2(๐Ÿ—, ATTACK) weight3 * applies3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

K different weightsโ€ฆ for K different features

slide-35
SLIDE 35

exp( ))

weight1 * applies1(๐Ÿ—, ATTACK) weight2 * applies2(๐Ÿ—, ATTACK) weight3 * applies3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

K different weightsโ€ฆ for K different featuresโ€ฆ multiplied and then summed

slide-36
SLIDE 36

exp(

)

Dot_product of weight_vec feature_vec(๐Ÿ—, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

K different weightsโ€ฆ for K different featuresโ€ฆ multiplied and then summed

slide-37
SLIDE 37

exp(

)

๐œ„๐‘ˆ๐‘”(๐Ÿ—, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

K different weightsโ€ฆ for K different featuresโ€ฆ multiplied and then summed

slide-38
SLIDE 38

๐œ„๐‘ˆ๐‘”(๐Ÿ—, ATTACK)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | ) =

ATTACK

exp( ))

Maxent Modeling

1 Z

Q: How do we define Z?

slide-39
SLIDE 39

exp( )

ฮฃ

label y

Z =

Normalization for Classification

๐‘ž ๐‘ง ๐‘ฆ) โˆ exp(๐œ„๐‘ˆ๐‘” ๐‘ฆ, ๐‘ง )

classify doc x with label y in one go

๐œ„๐‘ˆ๐‘”(๐Ÿ—, Y)

slide-40
SLIDE 40

exp( )

โ€ฆ

ฮฃ

label y

Z =

Normalization for Classification (long form)

๐‘ž ๐‘ง ๐‘ฆ) โˆ exp(๐œ„๐‘ˆ๐‘” ๐‘ฆ, ๐‘ง )

classify doc x with label y in one go

weight1 * applies1(๐Ÿ—, y) weight2 * applies2(๐Ÿ—, y) weight3 * applies3(๐Ÿ—, y)

slide-41
SLIDE 41

Core Aspects to Maxent Classifier p(y|x)

  • features ๐‘” ๐‘ฆ, ๐‘ง between x and y that are

meaningful;

  • weights ๐œ„ (one per feature) to say how

important each feature is; and

  • a way to form probabilities from ๐‘” and ๐œ„

๐‘ž ๐‘ง ๐‘ฆ) = exp(๐œ„๐‘ˆ๐‘”(๐‘ฆ, ๐‘ง)) ฯƒ๐‘งโ€ฒ exp(๐œ„๐‘ˆ๐‘”(๐‘ฆ, ๐‘งโ€ฒ))

slide-42
SLIDE 42

Outline

Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

  • 1. Defining Appropriate

Features

  • 2. Understanding features

in conditional models

slide-43
SLIDE 43

Defining Appropriate Features in a Maxent Model

Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired

slide-44
SLIDE 44

Defining Appropriate Features in a Maxent Model

Feature functions help extract useful features (characteristics) of the data They turn data into numbers Features that are not 0 are said to have fired Generally templated Often binary-valued (0 or 1), but can be real-valued

slide-45
SLIDE 45

Templated Features

Define a feature fclue(๐Ÿ—, label) for each clue you want to consider The feature fclue fires if the clue applies to/can be found in the (๐Ÿ—, label) pair Clue is often a target phrase (an n-gram) and a label

slide-46
SLIDE 46

Templated Features

Define a feature fclue(๐Ÿ—, label) for each clue you want to consider The feature fclue fires if the clue applies to/can be found in the (๐Ÿ—, label) pair Clue is often a target phrase (an n-gram) and a label

Q: For a classifier p(label | ๐Ÿ—) are clues that depend only on ๐Ÿ— useful?

slide-47
SLIDE 47

exp( ))

weight1 * applies1(๐Ÿ—, ATTACK) weight2 * applies2(๐Ÿ—, ATTACK) weight3 * applies3(๐Ÿ—, ATTACK)

โ€ฆ

Maxent Modeling: Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )โˆ

ATTACK

binary

slide-48
SLIDE 48

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise
slide-49
SLIDE 49

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: What does this function check?

slide-50
SLIDE 50

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used?

slide-51
SLIDE 51

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used?

A1: ๐‘Š๐‘€

slide-52
SLIDE 52

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used?

A1: ๐‘Š๐‘€

slide-53
SLIDE 53

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used?

A1: ๐‘Š๐‘€ A2: ๐‘Š2๐‘€

slide-54
SLIDE 54

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used? 3. How many features are defined if unigram and bigram targets are used?

A1: ๐‘Š๐‘€ A2: ๐‘Š2๐‘€

slide-55
SLIDE 55

Example of a Templated Binary Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applieshurt,ATTACK ๐Ÿ—, ATTACK = แ‰Š1, hurt ๐‘—๐‘œ ๐Ÿ— and ATTACK == ATTACK 0,

  • therwise

Q: If there are V vocab types and L label types: 1. How many features are defined if unigram targets are used? 2. How many features are defined if bigram targets are used? 3. How many features are defined if unigram and bigram targets are used?

A1: ๐‘Š๐‘€ A2: ๐‘Š2๐‘€ A2: (V + ๐‘Š2)๐‘€

slide-56
SLIDE 56

More on Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

Templated real- valued Non-templated real-valued binary

??? ???

slide-57
SLIDE 57

More on Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

applies ๐Ÿ—, ATTACK = log ๐‘ž ๐Ÿ— ATTACK)

Non-templated real-valued binary Templated real- valued

???

slide-58
SLIDE 58

More on Feature Functions

appliestarget,type ๐Ÿ—, ATTACK = แ‰Š1, target ๐‘›๐‘๐‘ข๐‘‘โ„Ž๐‘“๐‘ก ๐Ÿ— and type == ATTACK 0,

  • therwise

appliestarget,type ๐Ÿ—, ATTACK = log ๐‘ž ๐Ÿ— ATTACK) + log ๐‘ž type ATTACK) + log ๐‘ž(ATTACK |type)

Templated real- valued

applies ๐Ÿ—, ATTACK = log ๐‘ž ๐Ÿ— ATTACK)

Non-templated real-valued binary

slide-59
SLIDE 59

Understanding Conditioning ๐‘ž ๐‘ง ๐‘ฆ) โˆ count(๐‘ฆ)

Q: Is this a good model?

slide-60
SLIDE 60

Understanding Conditioning ๐‘ž ๐‘ง ๐‘ฆ) โˆ exp(๐œ„ โ‹… ๐‘” ๐‘ฆ )

Q: Is this a good model?

slide-61
SLIDE 61

https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/

Lesson 11

slide-62
SLIDE 62

Earlier, I said Maxent Models as Featureful n-gram Language Models of text x

๐‘ž ๐‘ฆ๐‘— ๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘‚+1:๐‘—โˆ’๐‘—) = maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘‚+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

p(Colorless green ideas sleep furiously | Label) = p(Colorless | Label, <BOS>) * โ€ฆ * p(<EOS> | Label , furiously)20

Model each n-gram term with a maxent model

Q: What would this look like?

slide-63
SLIDE 63

Language Model with Maxent n-grams

๐‘ž๐‘œ ๐Ÿ— ๐‘ง) = เท‘

๐‘—=1 ๐‘

maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

n-gram label

slide-64
SLIDE 64

Language Model with Maxent n-grams

๐‘ž๐‘œ ๐Ÿ— ๐‘ง) = เท‘

๐‘—=1 ๐‘

maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

= เท‘

๐‘—=1 ๐‘

exp(๐œ„๐‘ˆ๐‘”(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)) ฯƒ๐‘ฆโ€ฒ exp(๐œ„๐‘ˆ๐‘”(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆโ€ฒ))

n-gram label

slide-65
SLIDE 65

Language Model with Maxent n-grams

๐‘ž๐‘œ ๐Ÿ— ๐‘ง) = เท‘

๐‘—=1 ๐‘

maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

n-gram

= เท‘

๐‘—=1 ๐‘ exp(๐œ„๐‘ˆ๐‘”(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—))

๐‘Ž(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—)

slide-66
SLIDE 66

Language Model with Maxent n-grams

๐‘ž๐‘œ ๐Ÿ— ๐‘ง) = เท‘

๐‘—=1 ๐‘

maxent(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—)

n-gram

= เท‘

๐‘—=1 ๐‘ exp(๐œ„๐‘ˆ๐‘”(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—, ๐‘ฆ๐‘—))

๐‘Ž(๐‘ง, ๐‘ฆ๐‘—โˆ’๐‘œ+1:๐‘—โˆ’๐‘—)

Q: Why is this Z a function of the context?

slide-67
SLIDE 67

Outline

Maximum Entropy models Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-68
SLIDE 68

pฮธ(u | v )

probabilistic model

  • bjective

๐บ(๐œ„; ๐‘ฃ, ๐‘ค)

slide-69
SLIDE 69

Primary Objective: Likelihood

  • Goal: maximize the score your model gives to

the training data it observes

  • This is called the likelihood of your data
  • In classification, this is p(label | ๐Ÿ—)
  • For language modeling, this is p(๐Ÿ— | label)
slide-70
SLIDE 70

Objective = Full Likelihood? (in LM)

These values can have very small magnitude โž” underflow Differentiating this product could be a pain

เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— โˆ เท‘

๐‘—

exp(๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž )

(assume โ„Ž๐‘— has whatever context and n-gram history necessary)

slide-71
SLIDE 71

Logarithms

(0, 1] โž” (-โˆž, 0] Products โž” Sums log(ab) = log(a) + log(b) log(a/b) = log(a) โ€“ log(b) Inverse of exp log(exp(x)) = x

slide-72
SLIDE 72

Log-Likelihood (n-gram LM)

Differentiating this becomes nicer (even though Z depends on ฮธ) Wide range of (negative) numbers Sums are more stable

Products โž” Sums log(ab) = log(a) + log(b) log(a/b) = log(a) โ€“ log(b)

log เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฆ๐‘—|โ„Ž๐‘—)

slide-73
SLIDE 73

Log-Likelihood (n-gram LM)

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ฮธ)

log เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฆ๐‘—|โ„Ž๐‘—) = เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

Inverse of exp log(exp(x)) = x

slide-74
SLIDE 74

Log-Likelihood (n-gram LM)

Wide range of (negative) numbers Sums are more stable

log เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฆ๐‘—|โ„Ž๐‘—) = เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—) = ๐บ ๐œ„

slide-75
SLIDE 75

Outline

Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-76
SLIDE 76

How will we optimize F(ฮธ)?

Calculus

slide-77
SLIDE 77

F(ฮธ) ฮธ

slide-78
SLIDE 78

F(ฮธ) ฮธ

ฮธ*

slide-79
SLIDE 79

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative of F wrt ฮธ

ฮธ*

slide-80
SLIDE 80

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

slide-81
SLIDE 81

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)

ฮธ0 z0

slide-82
SLIDE 82

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)

ฮธ0 z0 g0

slide-83
SLIDE 83

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

ฮธ0 z0 ฮธ1 g0

slide-84
SLIDE 84

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

ฮธ0 z0 ฮธ1 z1 ฮธ2 g0 g1

slide-85
SLIDE 85

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2

slide-86
SLIDE 86

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the derivative

Set t = 0

Pick a starting value ฮธt

Until converged:

  • 1. Get value zt = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2

slide-87
SLIDE 87

Remember: Common Derivative Rules

slide-88
SLIDE 88

Gradient = Multi-variable derivative

K-dimensional input K-dimensional output

slide-89
SLIDE 89

Gradient Ascent

slide-90
SLIDE 90

Gradient Ascent

slide-91
SLIDE 91

Gradient Ascent

slide-92
SLIDE 92

Gradient Ascent

slide-93
SLIDE 93

Gradient Ascent

slide-94
SLIDE 94

Gradient Ascent

slide-95
SLIDE 95

F(ฮธ) ฮธ Fโ€™(ฮธ)

derivative

  • f F wrt ฮธ

ฮธ*

What if you canโ€™t find the roots? Follow the gradient

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get gradient g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

ฮธ0 z0 ฮธ1 z1 ฮธ2 z2 z3 ฮธ3 g0 g1 g2 K-dimensional vectors

slide-96
SLIDE 96

Outline

Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-97
SLIDE 97

Reminder: Expectation of a Random Variable

1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

๐”ฝ ๐‘Œ = เท

๐‘ฆ

๐‘ฆ ๐‘ž ๐‘ฆ

slide-98
SLIDE 98

Reminder: Expectation of a Random Variable

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

๐”ฝ ๐‘Œ = เท

๐‘ฆ

๐‘ฆ ๐‘ž ๐‘ฆ

slide-99
SLIDE 99

Reminder: Expectation of a Random Variable

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

๐”ฝ ๐‘Œ = เท

๐‘ฆ

๐‘ฆ ๐‘ž ๐‘ฆ

slide-100
SLIDE 100

Expectations Depend on a Probability Distribution

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

๐”ฝ ๐‘Œ = เท

๐‘ฆ

๐‘ฆ ๐‘ž ๐‘ฆ

slide-101
SLIDE 101

Log-Likelihood (n-gram LM)

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends

  • n ฮธ)

log เท‘

๐‘—

๐‘ž๐œ„ ๐‘ฆ๐‘— โ„Ž๐‘— = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฆ๐‘—|โ„Ž๐‘—) = เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—) = ๐บ ๐œ„

slide-102
SLIDE 102

Log-Likelihood Gradient

Each component k is the difference between:

slide-103
SLIDE 103

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data เท

๐‘—

๐‘”

๐‘™(๐‘ฆ๐‘—, โ„Ž๐‘—)

slide-104
SLIDE 104

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pฮธ thinks it computes for feature fk เท

๐‘—

๐”ฝ๐‘ฆโ€ฒ~ ๐‘ž ๐‘”

๐‘™(๐‘ฆโ€ฒ, โ„Ž๐‘—)

เท

๐‘—

๐‘”

๐‘™(๐‘ฆ๐‘—, โ„Ž๐‘—)

slide-105
SLIDE 105

https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/

Lesson 6

slide-106
SLIDE 106

๐›ผ๐œ„๐บ ๐œ„ = ๐›ผ๐œ„ เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

Log-Likelihood Gradient Derivation for LM p ๐‘ฆ๐‘— | โ„Ž๐‘—

๐‘ง๐‘—

slide-107
SLIDE 107

๐›ผ๐œ„๐บ ๐œ„ = ๐›ผ๐œ„ เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

= เท

๐‘—

๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’

Log-Likelihood Gradient Derivation for LM p ๐‘ฆ๐‘— | โ„Ž๐‘—

๐‘ง๐‘—

๐‘Ž โ„Ž๐‘— = เท

๐‘ฆโ€ฒ

exp(๐œ„ โ‹… ๐‘” ๐‘ฆโ€ฒ, โ„Ž๐‘— )

slide-108
SLIDE 108

๐›ผ๐œ„๐บ ๐œ„ = ๐›ผ๐œ„ เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

= เท

๐‘—

๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ เท

๐‘—

เท

๐‘ฆโ€ฒ

exp ๐œ„๐‘ˆ๐‘” ๐‘ฆโ€ฒ, โ„Ž๐‘— ๐‘Ž โ„Ž๐‘— ๐‘”(๐‘ฆโ€ฒ, โ„Ž๐‘—)

Log-Likelihood Gradient Derivation for LM p ๐‘ฆ๐‘— | โ„Ž๐‘—

๐œ– ๐œ–๐œ„ log ๐‘•(โ„Ž ๐œ„ ) = ๐œ–๐‘• ๐œ–โ„Ž(๐œ„) ๐œ–โ„Ž ๐œ–๐œ„ use the (calculus) chain rule

scalar p(xโ€™ | hi) vector of functions

slide-109
SLIDE 109

๐›ผ๐œ„๐บ ๐œ„ = ๐›ผ๐œ„ เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

= เท

๐‘—

๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ เท

๐‘—

เท

๐‘ฆโ€ฒ

exp ๐œ„๐‘ˆ๐‘” ๐‘ฆโ€ฒ, โ„Ž๐‘— ๐‘Ž โ„Ž๐‘— ๐‘”(๐‘ฆโ€ฒ, โ„Ž๐‘—)

Log-Likelihood Gradient Derivation for LM p ๐‘ฆ๐‘— | โ„Ž๐‘—

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

slide-110
SLIDE 110

Gradient Optimization for LM p ๐‘ฆ๐‘— | โ„Ž๐‘—

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get value z t = F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

๐œ–๐บ ๐œ–๐œ„๐‘™ = เท

๐‘—

๐‘”

๐‘™ ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ เท ๐‘—

เท

๐‘ฆโ€ฒ

๐‘”

๐‘™ ๐‘ฆโ€ฒ, โ„Ž๐‘— ๐‘ž ๐‘ฆโ€ฒ โ„Ž๐‘—)

เท

๐‘—

๐œ„๐‘ˆ๐‘” ๐‘ฆ๐‘—, โ„Ž๐‘— โˆ’ log ๐‘Ž(โ„Ž๐‘—)

slide-111
SLIDE 111

Gradient Optimization for Classifier p ๐‘ง |๐Ÿ—

Set t = 0 Pick a starting value ฮธt Until converged:

  • 1. Get func. value F(ฮธ t)
  • 2. Get derivative g t = Fโ€™(ฮธ t)
  • 3. Get scaling factor ฯ t
  • 4. Set ฮธ t+1 = ฮธ t + ฯ t *g t
  • 5. Set t += 1

๐œ–๐บ ๐œ–๐œ„๐‘™ = ๐‘”

๐‘™ ๐Ÿ—, ๐‘ง โˆ’ เท ๐‘งโ€ฒ

๐‘”

๐‘™ ๐Ÿ—, ๐‘งโ€ฒ ๐‘ž ๐‘งโ€ฒ ๐Ÿ—)

๐œ„๐‘ˆ๐‘” ๐Ÿ—, ๐‘ง โˆ’ log ๐‘Ž(๐Ÿ—)

slide-112
SLIDE 112

Preventing Extreme Values

Naรฏve Bayes

Extreme values are 0 probabilities

๐‘ž ๐‘ฃ ๐‘ค โˆ count(๐‘ค, ๐‘ฃ) ๐‘ž ๐‘ฃ ๐‘ค โˆ count ๐‘ค, ๐‘ฃ + ๐œ‡

slide-113
SLIDE 113

Preventing Extreme Values

Naรฏve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large ฮธ values

๐‘ž ๐‘ฃ ๐‘ค โˆ count(๐‘ค, ๐‘ฃ) ๐‘ž ๐‘ฃ ๐‘ค โˆ count ๐‘ค, ๐‘ฃ + ๐œ‡ ๐บ ๐œ„ = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฃ๐‘—|๐‘ค๐‘—) ๐บ ๐œ„ = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฃ๐‘—|๐‘ค๐‘—) โˆ’ ๐‘† ๐œ„

slide-114
SLIDE 114

Preventing Extreme Values

Naรฏve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large ฮธ values

regularization

๐‘ž ๐‘ฃ ๐‘ค โˆ count(๐‘ค, ๐‘ฃ) ๐‘ž ๐‘ฃ ๐‘ค โˆ count ๐‘ค, ๐‘ฃ + ๐œ‡ ๐บ ๐œ„ = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฃ๐‘—|๐‘ค๐‘—) ๐บ ๐œ„ = เท

๐‘—

log ๐‘ž๐œ„(๐‘ฃ๐‘—|๐‘ค๐‘—) โˆ’ ๐‘† ๐œ„

slide-115
SLIDE 115

(Squared) L2 Regularization

slide-116
SLIDE 116

https://www.csee.umbc.edu/courses/undergraduate/473/f19/loglin-tutorial/

Lesson 8

slide-117
SLIDE 117

Outline

Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-118
SLIDE 118

Revisiting the SNAP Function

softmax

๐‘ž ๐‘ฃ ๐‘ค) โˆ exp(๐œ„ โ‹… ๐‘” ๐‘ค, ๐‘ฃ )

slide-119
SLIDE 119

Revisiting the SNAP Function

softmax

๐‘ž ๐‘ฃ ๐‘ค) โˆ exp(๐œ„ โ‹… ๐‘” ๐‘ค, ๐‘ฃ ) softmax ๐’œ ๐‘— = exp(๐‘จ๐‘—) ฯƒ๐‘˜ exp(๐‘จ

๐‘˜)

slide-120
SLIDE 120

N-gram Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1

slide-121
SLIDE 121

N-gram Language Models

predict the next word given some contextโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ ๐‘‘๐‘๐‘ฃ๐‘œ๐‘ข(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—)

wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

slide-122
SLIDE 122

N-gram Language Models

predict the next word given some contextโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) โˆ ๐‘‘๐‘๐‘ฃ๐‘œ๐‘ข(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—)

wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

slide-123
SLIDE 123

Maxent Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„ โ‹… ๐‘”(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—))

slide-124
SLIDE 124

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„ โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1, ๐‘ฅ๐‘—))

can we learn the feature function(s)?

slide-125
SLIDE 125

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„๐’™๐’‹ โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?

slide-126
SLIDE 126

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 ew

slide-127
SLIDE 127

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 combine these representationsโ€ฆ C = f

matrix-vector product

ew

slide-128
SLIDE 128

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 combine these representationsโ€ฆ C = f

matrix-vector product

ew ฮธwi

slide-129
SLIDE 129

Neural Language Models

predict the next word given some contextโ€ฆ wi-3 wi-2 wi wi-1 compute beliefs about what is likelyโ€ฆ

๐‘ž ๐‘ฅ๐‘— ๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1) = softmax(๐œ„๐‘ฅ๐‘— โ‹… ๐’ˆ(๐‘ฅ๐‘—โˆ’3, ๐‘ฅ๐‘—โˆ’2, ๐‘ฅ๐‘—โˆ’1))

create/use โ€œdistributed representationsโ€โ€ฆ ei-3 ei-2 ei-1 combine these representationsโ€ฆ C = f

matrix-vector product

ew ฮธwi

slide-130
SLIDE 130

โ€œA Neural Probabilistic Language Model,โ€ Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

slide-131
SLIDE 131

โ€œA Neural Probabilistic Language Model,โ€ Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252

slide-132
SLIDE 132

โ€œA Neural Probabilistic Language Model,โ€ Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 โ€œwe were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)โ€ (Sect. 4.2)

slide-133
SLIDE 133

Outline

Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models