Natural Language Processing (CSE 490U): Featurized Language Models - - PowerPoint PPT Presentation

natural language processing cse 490u featurized language
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing (CSE 490U): Featurized Language Models - - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62 Whats wrong with n-grams? Data sparseness: most histories and most words


slide-1
SLIDE 1

Natural Language Processing (CSE 490U): Featurized Language Models

Noah Smith

c 2017 University of Washington nasmith@cs.washington.edu

January 9, 2017

1 / 62

slide-2
SLIDE 2

What’s wrong with n-grams?

Data sparseness: most histories and most words will be seen only rarely (if at all).

2 / 62

slide-3
SLIDE 3

What’s wrong with n-grams?

Data sparseness: most histories and most words will be seen only rarely (if at all). Next central idea: teach histories and words how to share.

3 / 62

slide-4
SLIDE 4

Log-Linear Models: Definitions

We define a conditional log-linear model p(Y | X) as:

◮ Y is the set of events/outputs ( for language modeling, V) ◮ X is the set of contexts/inputs ( for n-gram language

modeling, Vn−1)

◮ φ : X × Y → Rd is a feature vector function ◮ w ∈ Rd are the model parameters

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y′)

4 / 62

slide-5
SLIDE 5

Breaking It Down

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y)

5 / 62

slide-6
SLIDE 6

Breaking It Down

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y) linear score w · φ(x, y)

6 / 62

slide-7
SLIDE 7

Breaking It Down

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y) linear score w · φ(x, y) nonnegative exp w · φ(x, y)

7 / 62

slide-8
SLIDE 8

Breaking It Down

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y) linear score w · φ(x, y) nonnegative exp w · φ(x, y) normalizer

  • y′∈Y

exp w · φ(x, y′) = Zw(x)

8 / 62

slide-9
SLIDE 9

Breaking It Down

pw(Y = y | X = x) = exp w · φ(x, y)

  • y′∈Y

exp w · φ(x, y) linear score w · φ(x, y) nonnegative exp w · φ(x, y) normalizer

  • y′∈Y

exp w · φ(x, y′) = Zw(x) “Log-linear” comes from the fact that: log pw(Y = y | X = x) = w · φ(x, y) − log Zw(x)

  • constant in y

This is an instance of the family of generalized linear models.

9 / 62

slide-10
SLIDE 10

The Geometric View

Suppose we have instance x, Y = {y1, y2, y3, y4}, and there are

  • nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

10 / 62

slide-11
SLIDE 11

The Geometric View

Suppose we have instance x, Y = {y1, y2, y3, y4}, and there are

  • nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

w · φ = w1φ1 + w2φ2 = 0

11 / 62

slide-12
SLIDE 12

The Geometric View

Suppose we have instance x, Y = {y1, y2, y3, y4}, and there are

  • nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

p(y3 | x) > p(y1 | x) > p(y4 | x) > p(y2 | x)

12 / 62

slide-13
SLIDE 13

The Geometric View

Suppose we have instance x, Y = {y1, y2, y3, y4}, and there are

  • nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

13 / 62

slide-14
SLIDE 14

The Geometric View

Suppose we have instance x, Y = {y1, y2, y3, y4}, and there are

  • nly two features, φ1 and φ2.

(x, y3) ϕ1 ϕ2 (x, y1) (x, y4) (x, y2)

p(y3 | x) > p(y1 | x) > p(y2 | x) > p(y4 | x)

14 / 62

slide-15
SLIDE 15

Why Build Language Models This Way?

◮ Exploit features of histories for sharing of statistical strength

and better smoothing (Lau et al., 1993)

◮ Condition the whole text on more interesting variables like the

gender, age, or political affiliation of the author (Eisenstein et al., 2011)

◮ Interpretability!

◮ Each feature φk controls a factor to the probability (ewk). ◮ If wk < 0 then φk makes the event less likely by a factor of

1 ewk .

◮ If wk > 0 then φk makes the event more likely by a factor of

ewk.

◮ If wk = 0 then φk has no effect. 15 / 62

slide-16
SLIDE 16

Log-Linear n-Gram Models

pw(X = x) =

  • j=1

pw(Xj = xj | X1:j−1 = x1:j−1) =

  • j=1

exp w · φ(x1:j−1, xj) Zw(x1:j−1)

assumption

=

  • j−1

exp w · φ(xj−n+1:j−1, xj) Zw(xj−n+1:j−1) =

  • j=1

exp w · φ(hj, xj) Zw(hj)

16 / 62

slide-17
SLIDE 17

Example

The man who knew too much many little few . . . hippopotamus

17 / 62

slide-18
SLIDE 18

What Features in φ(Xj−n+1:j−1, Xj)?

18 / 62

slide-19
SLIDE 19

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man”

19 / 62

slide-20
SLIDE 20

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man”

20 / 62

slide-21
SLIDE 21

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man” ◮ Spelling features: “Xj’s first character is capitalized”

21 / 62

slide-22
SLIDE 22

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man” ◮ Spelling features: “Xj’s first character is capitalized” ◮ Class features: “Xj is a member of class 132”

22 / 62

slide-23
SLIDE 23

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man” ◮ Spelling features: “Xj’s first character is capitalized” ◮ Class features: “Xj is a member of class 132” ◮ Gazetteer features: “Xj is listed as a geographic place name”

23 / 62

slide-24
SLIDE 24

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man” ◮ Spelling features: “Xj’s first character is capitalized” ◮ Class features: “Xj is a member of class 132” ◮ Gazetteer features: “Xj is listed as a geographic place name”

You can define any features you want!

◮ Too many features, and your model will overfit ◮ Too few (good) features, and your model will not learn

24 / 62

slide-25
SLIDE 25

What Features in φ(Xj−n+1:j−1, Xj)?

◮ Traditional n-gram features: “Xj−1 = the ∧ Xj = man” ◮ “Gappy” n-grams: “Xj−2 = the ∧ Xj = man” ◮ Spelling features: “Xj’s first character is capitalized” ◮ Class features: “Xj is a member of class 132” ◮ Gazetteer features: “Xj is listed as a geographic place name”

You can define any features you want!

◮ Too many features, and your model will overfit

◮ “Feature selection” methods, e.g., ignoring features with very

low counts, can help.

◮ Too few (good) features, and your model will not learn

25 / 62

slide-26
SLIDE 26

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

26 / 62

slide-27
SLIDE 27

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

◮ Sometimes “feature engineering” is used pejoratively.

27 / 62

slide-28
SLIDE 28

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

◮ Sometimes “feature engineering” is used pejoratively.

◮ Some people would rather not spend their time on it! 28 / 62

slide-29
SLIDE 29

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

◮ Sometimes “feature engineering” is used pejoratively.

◮ Some people would rather not spend their time on it!

◮ There is some work on automatically inducing features (Della

Pietra et al., 1997).

29 / 62

slide-30
SLIDE 30

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

◮ Sometimes “feature engineering” is used pejoratively.

◮ Some people would rather not spend their time on it!

◮ There is some work on automatically inducing features (Della

Pietra et al., 1997).

◮ More recent work in neural networks can be seen as

discovering features (instead of engineering them).

30 / 62

slide-31
SLIDE 31

“Feature Engineering”

◮ Many advances in NLP (not just language modeling) have

come from careful design of features.

◮ Sometimes “feature engineering” is used pejoratively.

◮ Some people would rather not spend their time on it!

◮ There is some work on automatically inducing features (Della

Pietra et al., 1997).

◮ More recent work in neural networks can be seen as

discovering features (instead of engineering them).

◮ But in much of NLP, there’s a strong preference for

interpretable features.

31 / 62

slide-32
SLIDE 32

How to Estimate w?

n-gram log-linear n-gram pθ(x) =

  • j=1

θxj|hj

  • j−1

exp w · φ(hj, xj) Zw(hj) Parameters: θv|h wk

∀v ∈ V, h ∈ (V ∪ {})n−1 ∀k ∈ {1, . . . , d}

MLE: ˆ θv|h = c(hv) c(h) no closed form

32 / 62

slide-33
SLIDE 33

MLE for w

◮ Let training data consist of {(hi, xi)}N i=1.

33 / 62

slide-34
SLIDE 34

MLE for w

◮ Let training data consist of {(hi, xi)}N i=1. ◮ Maximum likelihood estimation is:

max

w∈Rd N

  • i=1

log pw(xi | hi) max

w∈Rd N

  • i=1

log exp w · φ(hi, v) Zw(hi) = max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v)

  • Zw(hi)

34 / 62

slide-35
SLIDE 35

MLE for w

◮ Let training data consist of {(hi, xi)}N i=1. ◮ Maximum likelihood estimation is:

max

w∈Rd N

  • i=1

log pw(xi | hi) max

w∈Rd N

  • i=1

log exp w · φ(hi, v) Zw(hi) = max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v)

  • Zw(hi)

◮ This is concave in w.

35 / 62

slide-36
SLIDE 36

MLE for w

◮ Let training data consist of {(hi, xi)}N i=1. ◮ Maximum likelihood estimation is:

max

w∈Rd N

  • i=1

log pw(xi | hi) max

w∈Rd N

  • i=1

log exp w · φ(hi, v) Zw(hi) = max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v)

  • Zw(hi)

◮ This is concave in w. ◮ Zw(hi) involves a sum over V terms.

36 / 62

slide-37
SLIDE 37

MLE for w

max

w∈Rd N

  • i=1

w · φ(hi, xi) − log Zw(hi)

  • fi(w)

37 / 62

slide-38
SLIDE 38

MLE for w

max

w∈Rd N

  • i=1

w · φ(hi, xi) − log Zw(hi)

  • fi(w)

Hope/fear view: for each instance i,

◮ increase the score of the correct output xi,

score(xi) = w · φ(hi, xi)

◮ decrease the “softened max” score overall,

log

v∈V exp score(v)

38 / 62

slide-39
SLIDE 39

MLE for w

max

w∈Rd N

  • i=1

w · φ(hi, xi) − log Zw(hi)

  • fi(w)

Gradient view: ∇wfi = φ(hi, xi)

  • bserved features

  • v∈V

pw(v | hi) · φ(h, v)

  • expected features

Setting this to zero means getting model’s expectations to match empirical observations.

39 / 62

slide-40
SLIDE 40

MLE for w: Algorithms

◮ Batch methods (L-BFGS is popular) ◮ Stochastic gradient ascent/descent more common today,

especially with special tricks for adapting the step size over time

◮ Many specialized methods (e.g., “iterative scaling”)

40 / 62

slide-41
SLIDE 41

Stochastic Gradient Descent

Goal: minimize N

i=1 fi(w) with respect to w.

Input: initial value w, number of epochs T, learning rate α For t ∈ {1, . . . , T}:

◮ Choose a random permutation π of {1, . . . , N}. ◮ For i ∈ {1, . . . , N}:

w ← w − α · ∇wfπ(i) Output: w

41 / 62

slide-42
SLIDE 42

Avoiding Overfitting

Maximum likelihood estimation: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log Zw(hi)

◮ If φj(h, x) is (almost) always positive, we can always increase

the objective (a little bit) by increasing wj toward +∞.

42 / 62

slide-43
SLIDE 43

Avoiding Overfitting

Maximum likelihood estimation: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log Zw(hi)

◮ If φj(h, x) is (almost) always positive, we can always increase

the objective (a little bit) by increasing wj toward +∞. Standard solution is to add a regularization term: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λwp

p

where λ > 0 is a hyperparameter and p = 2 or 1.

43 / 62

slide-44
SLIDE 44

MLE for w

If we had more time, we’d study this problem more carefully! Here’s what you must remember:

◮ There is no closed form; you must use a numerical

  • ptimization algorithm like stochastic gradient descent.

◮ Log-linear models are powerful but expensive (Zw(hi)). ◮ Regularization is very important; we don’t actually do MLE.

◮ Just like for n-gram models! Only even more so, since

log-linear models are even more expressive.

44 / 62

slide-45
SLIDE 45

To-Do List

◮ Online quiz: due 11:59 pm Tuesday ◮ Read: Collins (2011) §2 ◮ A1, out today, due January 18

45 / 62

slide-46
SLIDE 46

References I

Galen Andrew and Jianfeng Gao. Scalable training of ℓ1-regularized log-linear models. In Proc. of ICML, 2007. Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996. Michael Collins. Log-linear models, MEMMs, and CRFs, 2011. URL http://www.cs.columbia.edu/~mcollins/crf.pdf. Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19 (4):380–393, 1997. Jacob Eisenstein, Amr Ahmed, and Eric P Xing. Sparse additive generative models of

  • text. In Proc. of ICML, 2011.

Joshua Goodman. Classes for fast maximum entropy training. In Proc. of ICASSP, 2001. John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated

  • gradient. In NIPS, 2009.

Raymond Lau, Ronald Rosenfeld, and Salim Roukos. Trigger-based language models: A maximum entropy approach. In Proc. of ICASSP, 1993. Roni Rosenfeld. Adaptive Statistical Language Modeling: A Maximum Entropy

  • Approach. PhD thesis, Carnegie Mellon University, 1994.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

46 / 62

slide-47
SLIDE 47

Extras

47 / 62

slide-48
SLIDE 48

Special Case: Logistic Regression

Consider the case where Y = {+1, −1}. pw(Y = +1 | x) = exp w · φ(x, +1) exp w · φ(x, +1) + exp w · φ(x, −1)

48 / 62

slide-49
SLIDE 49

Special Case: Logistic Regression

Consider the case where Y = {+1, −1}. pw(Y = +1 | x) = exp w · φ(x, +1) exp w · φ(x, +1) + exp w · φ(x, −1) = logit−1 (w · (φ(x, +1) − φ(x, −1)))

49 / 62

slide-50
SLIDE 50

Special Case: Logistic Regression

Consider the case where Y = {+1, −1}. pw(Y = +1 | x) = exp w · φ(x, +1) exp w · φ(x, +1) + exp w · φ(x, −1) = logit−1 (w · (φ(x, +1) − φ(x, −1)))

notation change

= logit−1 (w · f(x))

50 / 62

slide-51
SLIDE 51

Special Case: Logistic Regression

Consider the case where Y = {+1, −1}. pw(Y = +1 | x) = exp w · φ(x, +1) exp w · φ(x, +1) + exp w · φ(x, −1) = logit−1 (w · (φ(x, +1) − φ(x, −1)))

notation change

= logit−1 (w · f(x))

◮ Should be familiar, if you know about logistic regression.

51 / 62

slide-52
SLIDE 52

Special Case: Logistic Regression

Consider the case where Y = {+1, −1}. pw(Y = +1 | x) = exp w · φ(x, +1) exp w · φ(x, +1) + exp w · φ(x, −1) = logit−1 (w · (φ(x, +1) − φ(x, −1)))

notation change

= logit−1 (w · f(x))

◮ Should be familiar, if you know about logistic regression. ◮ When Y = {1, 2, . . . , k}, log-linear models are often called

multinomial logistic regression.

52 / 62

slide-53
SLIDE 53

Special Case: Classic n-Gram Language Model

Consider an n-gram language model, where X = Vn−1 and Y = V. Let:

◮ d = 1 ◮ φ1(h, v) = log c(hv) ◮ w1 = 1 ◮ Z(h) =

  • v′∈V

exp log c(hv′) =

  • v′∈V

c(hv′) = c(h)

53 / 62

slide-54
SLIDE 54

Special Case: Classic n-Gram Language Model

Consider an n-gram language model, where X = Vn−1 and Y = V. Let:

◮ d = 1 ◮ φ1(h, v) = log c(hv) ◮ w1 = 1 ◮ Z(h) =

  • v′∈V

exp log c(hv′) =

  • v′∈V

c(hv′) = c(h) Alternately:

◮ d = |V|n ◮ φ˜ h,˜ v(h, v) =

  • 1

if h = ˜ h ∧ v = ˜ v

  • therwise

◮ w˜ h,˜ v = log c(˜ h˜ v) c(˜ h) ◮ Z(h) = 1

54 / 62

slide-55
SLIDE 55

ℓ1 Regularization

This case warrants a little more discussion: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λw1 Note that: w1 =

d

  • j=1

|wj|

◮ This results in sparsity (i.e., many wj = 0).

55 / 62

slide-56
SLIDE 56

ℓ1 Regularization

This case warrants a little more discussion: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λw1 Note that: w1 =

d

  • j=1

|wj|

◮ This results in sparsity (i.e., many wj = 0).

◮ Many have argued that this is a good thing (Tibshirani, 1996);

it’s a kind of feature selection.

56 / 62

slide-57
SLIDE 57

ℓ1 Regularization

This case warrants a little more discussion: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λw1 Note that: w1 =

d

  • j=1

|wj|

◮ This results in sparsity (i.e., many wj = 0).

◮ Many have argued that this is a good thing (Tibshirani, 1996);

it’s a kind of feature selection.

◮ Do not confuse it with data sparseness (a problem to be

  • vercome)!

57 / 62

slide-58
SLIDE 58

ℓ1 Regularization

This case warrants a little more discussion: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λw1 Note that: w1 =

d

  • j=1

|wj|

◮ This results in sparsity (i.e., many wj = 0).

◮ Many have argued that this is a good thing (Tibshirani, 1996);

it’s a kind of feature selection.

◮ Do not confuse it with data sparseness (a problem to be

  • vercome)!

◮ This is not differentiable at wj = 0.

58 / 62

slide-59
SLIDE 59

ℓ1 Regularization

This case warrants a little more discussion: max

w∈Rd N

  • i=1

w · φ(hi, xi) − log

  • v∈V

exp w · φ(hi, v) − λw1 Note that: w1 =

d

  • j=1

|wj|

◮ This results in sparsity (i.e., many wj = 0).

◮ Many have argued that this is a good thing (Tibshirani, 1996);

it’s a kind of feature selection.

◮ Do not confuse it with data sparseness (a problem to be

  • vercome)!

◮ This is not differentiable at wj = 0. ◮ Optimization: special solutions for batch (e.g., Andrew and

Gao, 2007) and stochastic (e.g., Langford et al., 2009) settings.

59 / 62

slide-60
SLIDE 60

Maximum Entropy

Consider a distribution p over events in X. The Shannon entropy (in bits) of p is defined as: H(p) = −

  • x∈X

p(X = x) if p(X = x) = 0 log2 p(X = x)

  • therwise

This is a measure of “randomness”; entropy is zero when p is deterministic and log |X| when p is uniform. Maximum entropy principle: among distributions that fit the data, pick the one with the greatest entropy.

60 / 62

slide-61
SLIDE 61

Maximum Entropy

If “fit the data” is taken to mean ∀k ∈ {1, . . . , d}, Ep[φk] = ˜ E[φk] then the MLE of the log-linear family with features φ is the maximum entropy solution. This is why log-linear models are sometimes called “maxent” models (e.g., Berger et al., 1996)

61 / 62

slide-62
SLIDE 62

“Whole Sentence” Log-Linear Models

(Rosenfeld, 1994)

Instead of a log-linear model for each word-given-history, define a single log-linear model over event space V†: pw(x) = exp w · φ(x) Zw

◮ Any feature of the sentence could be included in this model! ◮ Zw is deceptively simple-looking!

Zw =

  • x∈V†

exp w · φ(x)

62 / 62