Maxent Models (III), & Neural Language Models CMSC 473/673 - - PowerPoint PPT Presentation

β–Ά
maxent models iii neural language models
SMART_READER_LITE
LIVE PREVIEW

Maxent Models (III), & Neural Language Models CMSC 473/673 - - PowerPoint PPT Presentation

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some slides adapted from 3SLP Recap from last time Maximum Entropy Models a more general language model argmax ) ()


slide-1
SLIDE 1

Maxent Models (III), & Neural Language Models

CMSC 473/673 UMBC September 25th, 2017

Some slides adapted from 3SLP

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Maximum Entropy Models

a more general language model classify in one go argmaxπ‘Œπ‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ) argmaxπ‘Œπ‘ž π‘Œ 𝑍)

slide-4
SLIDE 4

Maximum Entropy Models

Feature function(s) Sufficient statistics β€œStrength” function(s) Feature Weights Natural parameters Distribution Parameters

slide-5
SLIDE 5

F(ΞΈ) ΞΈ F’(ΞΈ)

derivative

  • f F wrt ΞΈ

ΞΈ*

What if you can’t find the roots? Follow the derivative

Set t = 0 Pick a starting value ΞΈt Until converged:

  • 1. Get value y t = F(ΞΈ t)
  • 2. Get derivative g t = F’(ΞΈ t)
  • 3. Get scaling factor ρ t
  • 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t
  • 5. Set t += 1

ΞΈ0 y0 ΞΈ1 y1 ΞΈ2 y2 y3 ΞΈ3 g0 g1 g2

slide-6
SLIDE 6

F(ΞΈ) ΞΈ F’(ΞΈ)

derivative

  • f F wrt ΞΈ

ΞΈ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value ΞΈt

Until converged:

  • 1. Get value y t = F(ΞΈ t)
  • 2. Get derivative g t = F’(ΞΈ t)
  • 3. Get scaling factor ρ t
  • 4. Set ΞΈ t+1 = ΞΈ t + ρ t *g t
  • 5. Set t += 1

ΞΈ0 y0 ΞΈ1 y1 ΞΈ2 y2 y3 ΞΈ3 g0 g1 g2

slide-7
SLIDE 7

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :)

slide-8
SLIDE 8

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo

slide-9
SLIDE 9

Objective = Full Likelihood?

slide-10
SLIDE 10

Objective = Full Likelihood?

These values can have very small magnitude  underflow Differentiating this product could be a pain

slide-11
SLIDE 11

Logarithms

(0, 1]  (-∞, 0] Products  Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b) Inverse of exp log(exp(x)) = x

slide-12
SLIDE 12

Log-Likelihood

Differentiating this becomes nicer (even though Z depends on ΞΈ) Wide range of (negative) numbers Sums are more stable

Products  Sums log(ab) = log(a) + log(b) log(a/b) = log(a) – log(b)

slide-13
SLIDE 13

Log-Likelihood

Differentiating this becomes nicer (even though Z depends on ΞΈ) Wide range of (negative) numbers Sums are more stable

Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 )

slide-14
SLIDE 14

Log-Likelihood

Wide range of (negative) numbers Sums are more stable

Inverse of exp log(exp(x)) = x π‘ž 𝑧 𝑦) ∝ exp(πœ„ β‹… 𝑔 𝑦, 𝑧 )

Differentiating this becomes nicer (even though Z depends on ΞΈ)

slide-15
SLIDE 15

Log-Likelihood

Wide range of (negative) numbers Sums are more stable Differentiating this becomes nicer (even though Z depends on ΞΈ)

slide-16
SLIDE 16

Expectations

1 2 3 4 5 6 number of pieces of candy 1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5

slide-17
SLIDE 17

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-18
SLIDE 18

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-19
SLIDE 19

Expectations

1 2 3 4 5 6 number of pieces of candy 1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5

slide-20
SLIDE 20

Log-Likelihood Gradient

Each component k is the difference between:

slide-21
SLIDE 21

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

slide-22
SLIDE 22

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pΞΈ thinks it computes for feature fk

slide-23
SLIDE 23

Log-Likelihood Gradient

Each component k is the difference between: the total value of feature fk in the training data

and

the total value the current model pΞΈ thinks it computes for feature fk

β€œmoment matching”

slide-24
SLIDE 24

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo Lesson 6

slide-25
SLIDE 25

Log-Likelihood Gradient Derivation

slide-26
SLIDE 26

Log-Likelihood Gradient Derivation

depends on ΞΈ

slide-27
SLIDE 27

Log-Likelihood Gradient Derivation

depends on ΞΈ

slide-28
SLIDE 28

Log-Likelihood Gradient Derivation

depends on ΞΈ

slide-29
SLIDE 29

Log-Likelihood Gradient Derivation

slide-30
SLIDE 30

Log-Likelihood Gradient Derivation

πœ– πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–π‘• πœ–β„Ž(πœ„) πœ–β„Ž πœ–πœ„ use the (calculus) chain rule

slide-31
SLIDE 31

Log-Likelihood Gradient Derivation

πœ– πœ–πœ„ log 𝑕(β„Ž πœ„ ) = πœ–π‘• πœ–β„Ž(πœ„) πœ–β„Ž πœ–πœ„ use the (calculus) chain rule

scalar p(y’ | xi) vector of functions

slide-32
SLIDE 32

Log-Likelihood Gradient Derivation

slide-33
SLIDE 33

Log-Likelihood Derivative Derivation

πœ–πΊ πœ–πœ„π‘™ = ෍

𝑗

𝑔

𝑙 𝑦𝑗, 𝑧𝑗 βˆ’ ෍ 𝑗

෍

𝑧′

𝑔

𝑙 𝑦𝑗, 𝑧′ π‘ž 𝑧′ 𝑦𝑗)

slide-34
SLIDE 34

Do we want these to fully match? What does it mean if they do? What if we have missing values in our data?

slide-35
SLIDE 35

Preventing Extreme Values

NaΓ―ve Bayes

Extreme values are 0 probabilities

slide-36
SLIDE 36

Preventing Extreme Values

NaΓ―ve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large ΞΈ values

slide-37
SLIDE 37

Preventing Extreme Values

NaΓ―ve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large ΞΈ values

regularization

slide-38
SLIDE 38

(Squared) L2 Regularization

slide-39
SLIDE 39

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo Lesson 8

slide-40
SLIDE 40

(More on) Connections to Other Machine Learning Techniques

slide-41
SLIDE 41

Classification: Discriminative NaΓ―ve Bayes

NaΓ―ve Bayes Observed features Label/class

slide-42
SLIDE 42

Classification: Discriminative NaΓ―ve Bayes

NaΓ―ve Bayes Maxent/ Logistic Regression Observed features Label/class

slide-43
SLIDE 43

Multinomial Logistic Regression

slide-44
SLIDE 44

Multinomial Logistic Regression

(in one dimension)

slide-45
SLIDE 45

Multinomial Logistic Regression

slide-46
SLIDE 46

Understanding Conditioning

Is this a good language model?

slide-47
SLIDE 47

Understanding Conditioning

Is this a good language model?

slide-48
SLIDE 48

Understanding Conditioning

Is this a good language model? (no)

slide-49
SLIDE 49

Understanding Conditioning

Is this a good posterior classifier? (no)

slide-50
SLIDE 50

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo Lesson 11

slide-51
SLIDE 51

Connections to Other Techniques

Log-Linear Models (Multinomial) logistic regression Softmax regression Maximum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets

as statistical regression a form of viewed as based in information theory to be cool today :)

slide-52
SLIDE 52

Revisiting the SNAP Function

softmax

slide-53
SLIDE 53

Revisiting the SNAP Function

softmax

slide-54
SLIDE 54

N-gram Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1

slide-55
SLIDE 55

N-gram Language Models

predict the next word given some context…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3,π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-56
SLIDE 56

N-gram Language Models

predict the next word given some context…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-57
SLIDE 57

Maxent Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„ β‹… 𝑔(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗))

slide-58
SLIDE 58

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„ β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗))

can we learn the feature function(s)?

slide-59
SLIDE 59

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π’™π’‹ β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

can we learn the feature function(s) for just the context? can we learn word-specific weights (by type)?

slide-60
SLIDE 60

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 ew

slide-61
SLIDE 61

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew

slide-62
SLIDE 62

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi

slide-63
SLIDE 63

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi

slide-64
SLIDE 64

β€œA Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

slide-65
SLIDE 65

β€œA Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252

slide-66
SLIDE 66

β€œA Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM Name N- gram Params. Test Ppl. Interpolation 3

  • 336

Kneser-Ney backoff 3

  • 323

Kneser-Ney backoff 5

  • 321

Class-based backoff 3 500 classes 312 Class-based backoff 5 500 classes 312

NPLM

N-gram Word Vector Dim. Hidden Dim. Mix with non- neural LM Ppl. 5 60 50 No 268 5 60 50 Yes 257 5 30 100 No 276 5 30 100 Yes 252 β€œwe were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)