Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 - - PowerPoint PPT Presentation

maxent models ii
SMART_READER_LITE
LIVE PREVIEW

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 - - PowerPoint PPT Presentation

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1:


slide-1
SLIDE 1

Maxent Models (II)

CMSC 473/673 UMBC September 20th, 2017

slide-2
SLIDE 2

Announcements: Assignment 1

Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1: forgetting files Common pitfall #2: incorrect paths to files Common pitfall #3: 3rd party libraries

slide-3
SLIDE 3

Announcements: Question 6

𝑞(𝑜) 𝑦𝑗 𝑦𝑗−𝑜+1: 𝑦𝑗−1) = 𝜇𝑜𝑔(𝑜) 𝑦𝑗−𝑜+1:𝑦𝑗 + 1 − 𝜇𝑜 𝑞(𝑜−1)(𝑦𝑗|𝑦𝑗−𝑜+2:𝑦𝑗−1) 𝑞(𝑜) 𝑦𝑗 𝑦𝑗−𝑜+1: 𝑦𝑗−1) = 𝜇𝑜,𝑜𝑔(𝑜) 𝑦𝑗−𝑜+1:𝑦𝑗 + 𝜇𝑜,𝑜−1𝑔(𝑜−1) 𝑦𝑗−𝑜+2:𝑦𝑗 + ⋯ 𝜇𝑜,0𝑔(0) ⋅

𝜇𝑜,0 = 1 − ෍

𝑛=0 𝑜−1

𝜇𝑜,𝑜−𝑛

slide-4
SLIDE 4

Announcements: Course Project

Official handout will be out Friday 9/22

Until then, focus on assignment 1

Teams of 1-3 Mixed undergrad/grad is encouraged but not required Some novel aspect is needed

Ex 1: reimplement existing technique and apply to new domain Ex 2: reimplement existing technique and apply to new (human) language Ex 3: explore novel technique on existing problem

slide-5
SLIDE 5

Recap from last time…

slide-6
SLIDE 6

Classify or Decode with Bayes Rule

how well does text Y represent label X? how likely is label X overall?

For “simple” or “flat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

slide-7
SLIDE 7

Classification Evaluation: the 2-by-2 contingency table

Actually Correct Actually Incorrect Selected/ Guessed True Positive (TP) False Positive (FP) Not selected/ not guessed False Negative (FN) True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed Correct Guessed Correct Guessed

slide-8
SLIDE 8

Classification Evaluation: Accuracy, Precision, and Recall

Accuracy: % of items correct Precision: % of selected items that are correct Recall: % of correct items that are selected

Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

slide-9
SLIDE 9

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi Assume position doesn’t matter Assume the feature probabilities are independent given the class X

slide-10
SLIDE 10

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

slide-11
SLIDE 11

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification…

slide-12
SLIDE 12

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements Robust to Irrelevant Features Very good in domains with many equally important features Optimal if the independence assumptions hold Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go? Are the features really uncorrelated? Are plain counts always appropriate? Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification… and language modeling

slide-13
SLIDE 13

Maximum Entropy Models

a more general language model argmax𝑌𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)

slide-14
SLIDE 14

Maximum Entropy Models

a more general language model classify in one go argmax𝑌𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) argmax𝑌𝑞 𝑌 𝑍)

slide-15
SLIDE 15

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

We need to score the different combinations.

Document Classification

slide-16
SLIDE 16

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK are all of these uncorrelated?

scorek(department, ATTACK)

slide-17
SLIDE 17

Score and Combine Our Possibilities

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

COMBINE posterior probability of ATTACK

Q: What are the score and combine functions for Naïve Bayes?

slide-18
SLIDE 18

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =

ATTACK

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Learn these scores… but how? What do we

  • ptimize?
slide-19
SLIDE 19

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))

ATTACK

Maxent Modeling

slide-20
SLIDE 20 Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))

ATTACK

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

f(x) = exp(x)

exp gives a positive, unnormalized probability

slide-21
SLIDE 21

exp( ))

score1(fatally shot, ATTACK) score2(seriously wounded, ATTACK) score3(Shining Path, ATTACK)

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-22
SLIDE 22

exp( ))

weight1 * applies1(fatally shot, ATTACK) weight2 * applies2(seriously wounded, ATTACK) weight3 * applies3(Shining Path, ATTACK)

Maxent Modeling

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝

ATTACK

slide-23
SLIDE 23

Q: What if none of our features apply?

slide-24
SLIDE 24

Guiding Principle for Log-Linear Models

“[The log-linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”

Edwin T. Jaynes, 1957

slide-25
SLIDE 25

Guiding Principle for Log-Linear Models

“[The log-linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”

Edwin T. Jaynes, 1957

exp(θ· f)  exp(θ· 0) = 1

slide-26
SLIDE 26

exp( ))

θ1 * f1(fatally shot, ATTACK) θ 2 * f2(seriously wounded, ATTACK) θ 3 * f 3(Shining Path, ATTACK)

Easier-to-write form

slide-27
SLIDE 27

exp( ))

θ1 * f1(fatally shot, ATTACK) θ 2 * f2(seriously wounded, ATTACK) θ 3 * f 3(Shining Path, ATTACK)

Easier-to-write form

K weights K features

slide-28
SLIDE 28

exp( ) θ ·f (doc, ATTACK)

Easier-to-write form

K-dimensional weight vector K-dimensional feature vector dot product

slide-29
SLIDE 29

Log-Linear Models

slide-30
SLIDE 30

Log-Linear Models

slide-31
SLIDE 31

Log-Linear Models

Feature function(s) Sufficient statistics “Strength” function(s)

slide-32
SLIDE 32

Log-Linear Models

Feature Weights Natural parameters Distribution Parameters

slide-33
SLIDE 33

Log-Linear Models

How do we normalize?

slide-34
SLIDE 34

exp( )

weight1 * applies1(fatally shot, X) weight2 * applies2(seriously wounded, X) weight3 * applies3(Shining Path, X)

Σ

label x

Z =

Normalization for Classification

𝑞 𝑦 𝑧) ∝ exp(𝜄 ⋅ 𝑔 𝑦, 𝑧 )

classify doc y with label x in one go

slide-35
SLIDE 35

Normalization for Language Model

general class-based (X) language model of doc y

slide-36
SLIDE 36

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

slide-37
SLIDE 37

Normalization for Language Model

Can be significantly harder in the general case Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

slide-38
SLIDE 38

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo Lesson 1

slide-39
SLIDE 39

Connections to Other Techniques

Log-Linear Models

slide-40
SLIDE 40

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression

as statistical regression

slide-41
SLIDE 41

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt)

as statistical regression based in information theory

slide-42
SLIDE 42

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models

as statistical regression a form of based in information theory

slide-43
SLIDE 43

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes

as statistical regression a form of viewed as based in information theory

slide-44
SLIDE 44

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :)

slide-45
SLIDE 45

Learning θ

slide-46
SLIDE 46

pθ(y | x )

probabilistic model

  • bjective

(given observations)

slide-47
SLIDE 47

How will we optimize F(θ)?

Calculus

slide-48
SLIDE 48

F(θ) θ

slide-49
SLIDE 49

F(θ) θ

θ*

slide-50
SLIDE 50

F(θ) θ F’(θ)

derivative of F wrt θ

θ*

slide-51
SLIDE 51

Example

F’(x) = -2x + 4 F(x) = -(x-2)2

differentiate Solve F’(x) = 0

x = 2

slide-52
SLIDE 52

Common Derivative Rules

slide-53
SLIDE 53

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

slide-54
SLIDE 54

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)

θ0 y0

slide-55
SLIDE 55

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)

θ0 y0 g0

slide-56
SLIDE 56

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 g0

slide-57
SLIDE 57

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 g0 g1

slide-58
SLIDE 58

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2

slide-59
SLIDE 59

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2

slide-60
SLIDE 60

Gradient = Multi-variable derivative

K-dimensional input K-dimensional output

slide-61
SLIDE 61

Gradient Ascent

slide-62
SLIDE 62

Gradient Ascent

slide-63
SLIDE 63

Gradient Ascent

slide-64
SLIDE 64

Gradient Ascent

slide-65
SLIDE 65

Gradient Ascent

slide-66
SLIDE 66

Gradient Ascent