Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - - PowerPoint PPT Presentation

na ve bayes maxent and neural models
SMART_READER_LITE
LIVE PREVIEW

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some - - PowerPoint PPT Presentation

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Nave Bayes (NB) classification Terminology: bag-of-words Nave assumption


slide-1
SLIDE 1

Naïve Bayes, Maxent and Neural Models

CMSC 473/673 UMBC

Some slides adapted from 3SLP

slide-2
SLIDE 2

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-3
SLIDE 3

Probabilistic Classification

Discriminatively trained classifier Generatively trained classifier

Directly model the posterior Model the posterior with Bayes rule

Posterior Classification/Decoding maximum a posteriori Noisy Channel Model Decoding

slide-4
SLIDE 4

Posterior Decoding: Probabilistic Text Classification

Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis …

class

  • bserved

data class-based likelihood (language model) prior probability of class

  • bservation likelihood (averaged over all classes)
slide-5
SLIDE 5

Noisy Channel Model

what I want to tell you “sports” what you actually see “The Os lost again…” Decode Rerank hypothesized intent “sad stories” “sports” reweight according to what’s likely “sports”

slide-6
SLIDE 6

Noisy Channel

Machine translation Speech-to-text Spelling correction Text normalization Part-of-speech tagging Morphological analysis Image captioning …

possible (clean)

  • utput
  • bserved

(noisy) text (clean) language model

  • bservation (noisy) likelihood

translation/ decode model

slide-7
SLIDE 7

Use Logarithms

slide-8
SLIDE 8

Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/notguessed False Negative (FN) True Negative (TN)

slide-9
SLIDE 9

A combined measure: F

Weighted (harmonic) average of Precision & Recall Balanced F1 measure: β=1

slide-10
SLIDE 10

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-11
SLIDE 11

The Bag of Words Representation

slide-12
SLIDE 12

The Bag of Words Representation

slide-13
SLIDE 13

The Bag of Words Representation

13
slide-14
SLIDE 14

Bag of Words Representation

γ( )=c

seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...

classifier classifier

slide-15
SLIDE 15

Naïve Bayes Classifier

Start with Bayes Rule

label text Q: Are we doing discriminative training

  • r generative training?
slide-16
SLIDE 16

Naïve Bayes Classifier

Start with Bayes Rule

label text Q: Are we doing discriminative training

  • r generative training?

A: generative training

slide-17
SLIDE 17

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

label each word (token)

slide-18
SLIDE 18

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter

label each word (token)

slide-19
SLIDE 19

Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X

label each word (token)

slide-20
SLIDE 20

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

slide-21
SLIDE 21

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cj in C do

docsj = all docs with class =cj

From training corpus, extract Vocabulary

slide-22
SLIDE 22

Brill and Banko (2001) With enough data, the classifier may not matter

slide-23
SLIDE 23

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms For each cj in C do

docsj = all docs with class =cj

Calculate P(wk | cj) terms Textj = single doc containing all docsj Foreach word wk in Vocabulary nk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑞 𝑥𝑙| 𝑑𝑘 = class unigram LM

slide-24
SLIDE 24

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature But if, as in the previous slides

We use only word features we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

slide-25
SLIDE 25

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-26
SLIDE 26

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

film love this fun I

Sec.13.2.1

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

slide-27
SLIDE 27

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-28
SLIDE 28

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Positive Model Negative Model

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

Sec.13.2.1

slide-29
SLIDE 29

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

slide-30
SLIDE 30

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data? (automated, more principled)

slide-31
SLIDE 31

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-32
SLIDE 32

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :)

slide-33
SLIDE 33

Maxent Models for Classification: Discriminatively or Generatively Trained

Discriminatively trained classifier Generatively trained classifier

Directly model the posterior Model the posterior with Bayes rule

slide-34
SLIDE 34

Maximum Entropy (Log-linear) Models

discriminatively trained: classify in one go

slide-35
SLIDE 35

Maximum Entropy (Log-linear) Models

generatively trained: learn to model language

slide-36
SLIDE 36

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result

  • f a Shining Path attack today

against a community in Junin department, central Peruvian mountain region .

p( | )

ATTACK

slide-37
SLIDE 37

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Document Classification

ATTACK

  • # killed:
  • Type:
  • Perp:

shot

ATTACK

slide-38
SLIDE 38

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-39
SLIDE 39

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-40
SLIDE 40

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-41
SLIDE 41

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-42
SLIDE 42

Document Classification

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

slide-43
SLIDE 43

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

We need to score the different combinations.

slide-44
SLIDE 44

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK are all of these uncorrelated?

scorek(department, ATTACK)

slide-45
SLIDE 45

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK

Q: What are the score and combine functions for Naïve Bayes?

slide-46
SLIDE 46

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

slide-47
SLIDE 47

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9 Lesson 1

slide-48
SLIDE 48 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-49
SLIDE 49

What function…

  • perates on any real number?

is never less than 0?

slide-50
SLIDE 50

What function…

  • perates on any real number?

is never less than 0? f(x) = exp(x)

slide-51
SLIDE 51 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-52
SLIDE 52

exp( ))

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-53
SLIDE 53

exp( ))

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-54
SLIDE 54

exp( ))

weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-55
SLIDE 55

exp( ))

weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)

Maxent Modeling: Feature Functions

Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0

  • r 1), but can be real-

valued

  • ccurstarget,type fatally shot,ATTACK =

ቊ1, target == fatally shot and type == ATTACK 0,

  • therwise
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝

ATTACK

binary

slide-56
SLIDE 56

More on Feature Functions

Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued

  • ccurstarget,type fatally shot, ATTACK =

log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) + log 𝑞(ATTACK |type)

Templated real- valued

  • ccurs fatally shot, ATTACK =

log 𝑞 fatally shot ATTACK)

Non-templated real-valued Non-templated count-valued

???

  • ccurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK 0,

  • therwise

binary

slide-57
SLIDE 57

More on Feature Functions

Feature functions help extract useful features (characteristics) of the data Generally templated Often binary-valued (0 or 1), but can be real-valued

  • ccurs fatally shot, ATTACK =

log 𝑞 fatally shot ATTACK)

Non-templated real-valued

  • ccurs fatally shot, ATTACK =

count fatally sho𝑢 ATTACK)

Non-templated count-valued

  • ccurstarget,type fatally shot, ATTACK =

log 𝑞 fatally shot ATTACK) + log 𝑞 type ATTACK) + log 𝑞(ATTACK |type)

Templated real- valued

  • ccurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK 0,

  • therwise

binary

slide-58
SLIDE 58

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | ) =

ATTACK

exp( ))

weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)

Maxent Modeling

1 Z

Q: How do we define Z?

slide-59
SLIDE 59

exp( )

Σ

label x

Z =

Normalization for Classification

𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )

classify doc y with label x in one go

weight1 * occurs1(fatally shot, ATTACK) weight2 * occurs2(seriously wounded, ATTACK) weight3 * occurs3(Shining Path, ATTACK)

slide-60
SLIDE 60

Normalization for Language Model

general class-based (X) language model of doc y

slide-61
SLIDE 61

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

slide-62
SLIDE 62

Normalization for Language Model

Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

slide-63
SLIDE 63

Understanding Conditioning

Is this a good language model?

slide-64
SLIDE 64

Understanding Conditioning

Is this a good language model?

slide-65
SLIDE 65

Understanding Conditioning

Is this a good language model? (no)

slide-66
SLIDE 66

Understanding Conditioning

Is this a good posterior classifier? (no)

slide-67
SLIDE 67

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9 Lesson 11

slide-68
SLIDE 68

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-69
SLIDE 69

pθ(x | y )

probabilistic model

  • bjective

(given observations)

slide-70
SLIDE 70

Objective = Full Likelihood?

These values can have very small magnitude ➔ underflow Differentiating this product could be a pain

slide-71
SLIDE 71

Logarithms

(0, 1] ➔ (-∞, 0] Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

slide-72
SLIDE 72

Log-Likelihood

Differentiating this becomes nicer (even though Z depends on θ) Wide range of (negative) numbers Sums are more stable

Products ➔ Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b)

slide-73
SLIDE 73

Log-Likelihood

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on θ)

Inverse of exp log(exp(x)) = x

slide-74
SLIDE 74

Log-Likelihood

Wide range of (negative) numbers Sums are more stable

= 𝐺 𝜄

slide-75
SLIDE 75

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-76
SLIDE 76

How will we optimize F(θ)?

Calculus

slide-77
SLIDE 77

F(θ) θ

slide-78
SLIDE 78

F(θ) θ

θ*

slide-79
SLIDE 79

F(θ) θ F’(θ)

derivative of F wrt θ

θ*

slide-80
SLIDE 80

Example

F’(x) = -2x + 4 F(x) = -(x-2)2

differentiate Solve F’(x) = 0

x = 2

slide-81
SLIDE 81

Common Derivative Rules

slide-82
SLIDE 82

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

slide-83
SLIDE 83

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)

θ0 y0

slide-84
SLIDE 84

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)

θ0 y0 g0

slide-85
SLIDE 85

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 g0

slide-86
SLIDE 86

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 g0 g1

slide-87
SLIDE 87

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2

slide-88
SLIDE 88

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2

slide-89
SLIDE 89

Gradient = Multi-variable derivative

K-dimensional input K-dimensional output

slide-90
SLIDE 90

Gradient Ascent

slide-91
SLIDE 91

Gradient Ascent

slide-92
SLIDE 92

Gradient Ascent

slide-93
SLIDE 93

Gradient Ascent

slide-94
SLIDE 94

Gradient Ascent

slide-95
SLIDE 95

Gradient Ascent

slide-96
SLIDE 96

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-97
SLIDE 97

Expectations

1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

slide-98
SLIDE 98

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-99
SLIDE 99

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-100
SLIDE 100

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-101
SLIDE 101

Log-Likelihood

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends

  • n θ)

= 𝐺 𝜄

slide-102
SLIDE 102

Log-Likelihood Gradient

Each component k is the difference between:

slide-103
SLIDE 103

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

slide-104
SLIDE 104

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pθ thinks it computes for feature fk

X' Yi

slide-105
SLIDE 105

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9 Lesson 6

slide-106
SLIDE 106

Log-Likelihood Gradient Derivation

𝑧𝑗

slide-107
SLIDE 107

Log-Likelihood Gradient Derivation

𝑧𝑗

slide-108
SLIDE 108

Log-Likelihood Gradient Derivation

𝜖 𝜖𝜄 log𝑕(ℎ 𝜄 ) = 𝜖𝑕 𝜖ℎ(𝜄) 𝜖ℎ 𝜖𝜄 use the (calculus) chain rule

scalar p(x’ | yi) vector of functions

𝑧𝑗

slide-109
SLIDE 109

Log-Likelihood Gradient Derivation

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

slide-110
SLIDE 110

Gradient Optimization

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

𝜖𝐺 𝜖𝜄𝑙 = ෍

𝑗

𝑔

𝑙 𝑦𝑗,𝑧𝑗 − ෍ 𝑗

𝑧′

𝑔

𝑙 𝑦𝑗,𝑧′ 𝑞 𝑧′ 𝑦𝑗)

slide-111
SLIDE 111

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

slide-112
SLIDE 112

Preventing Extreme Values

Naïve Bayes

Extreme values are 0 probabilities

slide-113
SLIDE 113

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

slide-114
SLIDE 114

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

regularization

slide-115
SLIDE 115

(Squared) L2 Regularization

slide-116
SLIDE 116

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9 Lesson 8

slide-117
SLIDE 117

Outline

Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words “Naïve” assumption Training & performance NB as a language Maximum Entropy classifiers Defining the model Defining the objective Learning: Optimizing the objective Math: gradient derivation Neural (language) models

slide-118
SLIDE 118

Revisiting the SNAP Function

softmax

slide-119
SLIDE 119

Revisiting the SNAP Function

softmax

slide-120
SLIDE 120

N-gram Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1

slide-121
SLIDE 121

N-gram Language Models

predict the next word given some context…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-122
SLIDE 122

N-gram Language Models

predict the next word given some context…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ 𝑑𝑝𝑣𝑜𝑢(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-123
SLIDE 123

Maxent Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝑔(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗))

slide-124
SLIDE 124

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1,𝑥𝑗))

can we learn the feature function(s)?

slide-125
SLIDE 125

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝒙𝒋 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))

can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?

slide-126
SLIDE 126

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 ew

slide-127
SLIDE 127

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew

slide-128
SLIDE 128

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew θwi

slide-129
SLIDE 129

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

𝑞 𝑥𝑗 𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1) ∝ softmax(𝜄𝑥𝑗 ⋅ 𝒈(𝑥𝑗−3,𝑥𝑗−2,𝑥𝑗−1))

create/use “distributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew θwi

slide-130
SLIDE 130

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

slide-131
SLIDE 131

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252

slide-132
SLIDE 132

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 “we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)