Sequential Data Modeling - Conditional Random Fields Graham Neubig - - PowerPoint PPT Presentation

sequential data modeling conditional random fields
SMART_READER_LITE
LIVE PREVIEW

Sequential Data Modeling - Conditional Random Fields Graham Neubig - - PowerPoint PPT Presentation

Sequential Data Modeling Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Sequential Data Modeling Conditional Random Fields Prediction


slide-1
SLIDE 1

1

Sequential Data Modeling – Conditional Random Fields

Sequential Data Modeling - Conditional Random Fields

Graham Neubig Nara Institute of Science and Technology (NAIST)

slide-2
SLIDE 2

2

Sequential Data Modeling – Conditional Random Fields

Prediction Problems

Given x, predict y

slide-3
SLIDE 3

3

Sequential Data Modeling – Conditional Random Fields

Prediction Problems Given x, predict y

A book review Oh, man I love this book! This book is so boring... Is it positive? yes no

Binary Prediction (2 choices)

A tweet On the way to the park! 公園に行くなう! Its language English Japanese

Multi-class Prediction (several choices)

A sentence I read a book Its parts-of-speech

Structured Prediction (millions of choices)

I read a book

DET NN VBD N

slide-4
SLIDE 4

4

Sequential Data Modeling – Conditional Random Fields

Logistic Regression

slide-5
SLIDE 5

5

Sequential Data Modeling – Conditional Random Fields

Example we will use:

  • Given an introductory sentence from Wikipedia
  • Predict whether the article is about a person
  • This is binary classification (of course!)

Given

Gonso was a Sanron sect priest (754-827) in the late Nara and early Heian periods.

Predict

Yes!

Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture.

No!

slide-6
SLIDE 6

6

Sequential Data Modeling – Conditional Random Fields

Review: Linear Prediction Model

  • Each element that helps us predict is a feature
  • Each feature has a weight, positive if it indicates “yes”,

and negative if it indicates “no”

  • For a new example, sum the weights
  • If the sum is at least 0: “yes”, otherwise: “no”

contains “priest” contains “(<#>-<#>)” contains “site” contains “Kyoto Prefecture” wcontains “priest” = 2 wcontains “(<#>-<#>)” = 1 wcontains “site” = -3 wcontains “Kyoto Prefecture” = -1 Kuya (903-972) was a priest born in Kyoto Prefecture.

2 + -1 + 1 = 2

slide-7
SLIDE 7

7

Sequential Data Modeling – Conditional Random Fields

Review: Mathematical Formulation

y = sign(w⋅ϕ(x)) = sign(∑i=1

I

wi⋅ϕi(x))

  • x: the input
  • φ(x): vector of feature functions {φ1(x), φ2(x), …, φI(x)}
  • w: the weight vector {w1, w2, …, wI}
  • y: the prediction, +1 if “yes”, -1 if “no”
  • (sign(v) is +1 if v >= 0, -1 otherwise)
slide-8
SLIDE 8

8

Sequential Data Modeling – Conditional Random Fields

  • 10
  • 5

5 10 0.5 1 w*phi(x) p ( y | x )

Perceptron and Probabilities

  • Sometimes we want the probability
  • Estimating confidence in predictions
  • Combining with other systems
  • However, perceptron only gives us a prediction

P( y∣x)

In other words:

P( y=1∣x)=1 if w⋅ϕ(x)≥0 y=sign(w⋅ϕ( x)) P( y=1∣x)=0 if w⋅ϕ(x)<0

slide-9
SLIDE 9

9

Sequential Data Modeling – Conditional Random Fields

  • 10
  • 5

5 10 0.5 1 w*phi(x) p ( y | x )

The Logistic Function

  • The logistic function is a “softened” version of the

function used in the perceptron

  • 10
  • 5

5 10 0.5 1 w*phi(x) p ( y | x )

Perceptron Logistic Function

P( y=1∣x)= e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

  • Can account for uncertainty
  • Differentiable
slide-10
SLIDE 10

10

Sequential Data Modeling – Conditional Random Fields

Logistic Regression

  • Train based on conditional likelihood
  • Find the parameters w that maximize the conditional

likelihood of all answers yi given the example xi

  • How do we solve this?

̂ w=argmax

w

∏i P( yi∣xi ; w)

slide-11
SLIDE 11

11

Sequential Data Modeling – Conditional Random Fields

Review: Perceptron Training Algorithm

create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y' != y w += y * phi

  • In other words
  • Try to classify each training example
  • Every time we make a mistake, update the weights
slide-12
SLIDE 12

12

Sequential Data Modeling – Conditional Random Fields

Stochastic Gradient Descent

  • Online training algorithm for probabilistic models

(including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw

  • In other words
  • For every training example, calculate the gradient

(the direction that will increase the probability of y)

  • Move in that direction, multiplied by learning rate α
slide-13
SLIDE 13

13

Sequential Data Modeling – Conditional Random Fields

  • 10
  • 5

5 10 0.2 0.4 w*phi(x) d p ( y | x ) / d w * p h i ( x )

Gradient of the Logistic Function

  • Take the derivative of the probability

d d w P( y=1∣x) = d d w e

w⋅ ϕ( x)

1+e

w⋅ϕ(x)

= ϕ(x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

d d w P( y=−1∣x) = d d w (1− e

w⋅ϕ(x)

1+e

w⋅ϕ(x))

= −ϕ (x) e

w⋅ϕ(x)

(1+e

w⋅ϕ(x)) 2

slide-14
SLIDE 14

14

Sequential Data Modeling – Conditional Random Fields

Example: Initial Update

  • Set α=1, initialize w=0

x = A site , located in Maizuru , Kyoto y = -1

w⋅ϕ(x)=0

w← w+−0.25ϕ (x)

wunigram “A” = -0.25 wunigram “site” = -0.25 wunigram “,” = -0.5 wunigram “located” = -0.25 wunigram “in” = -0.25 wunigram “Maizuru” = -0.25 wunigram “Kyoto” = -0.25

d d w P( y=−1∣x) = − e (1+e

0) 2 ϕ(x)

= −0.25ϕ(x)

slide-15
SLIDE 15

15

Sequential Data Modeling – Conditional Random Fields

Example: Second Update

x = Shoken , monk born in Kyoto y = 1

w⋅ϕ(x)=−1

w← w+0.196 ϕ( x)

wunigram “A” = -0.25 wunigram “site” = -0.25 wunigram “,” = -0.304 wunigram “located” = -0.25 wunigram “in” = -0.054 wunigram “Maizuru” = -0.25 wunigram “Kyoto” = -0.054

  • 0.5
  • 0.25 -0.25

wunigram “Shoken” = 0.196 wunigram “monk” = 0.196 wunigram “born” = 0.196

d d w P( y=1∣x) = e

1

(1+e

1) 2 ϕ(x)

= 0.196ϕ(x)

slide-16
SLIDE 16

16

Sequential Data Modeling – Conditional Random Fields

Calculating Optimal Sequences, Probabilities

slide-17
SLIDE 17

17

Sequential Data Modeling – Conditional Random Fields

Sequence Likelihood

  • Logistic regression considered probability of
  • What if we want to consider probability of a sequence?

y∈{−1,+1}

P( y∣x)

I visited Nara PRN VBD NNP Xi Yi

P(Y∣X)

slide-18
SLIDE 18

18

Sequential Data Modeling – Conditional Random Fields

φ( )*w=1 φ( )*w=2 φ( )*w=0

φ( ) φ( ) φ( )

Calculating Multi-class Probabilities

  • Each sequence has it's own feature vector
  • Use weights for each feature to calculate scores

time flies N V

φ( )

φT,<S>,N=1 φT,N,V=1 φT,V,</S>=1 φE,N,time=1 φE,V,flies=1

time flies V N

φT,<S>,V=1 φT,V,N=1 φT,N,</S>=1 φE,V,time=1 φE,N,flies=1

time flies N N

φT,<S>,N=1 φT,N,N=1 φT,N,</S>=1 φE,N,time=1 φE,N,flies=1

time flies V V

φT,<S>,V=1 φT,V,V=1 φT,V,</S>=1 φE,V,time=1 φE,V,flies=1 wT,<S>,N=1 wE,N,time=1 wT,V,</S>=1

time flies N V φ( )*w=3 time flies V N time flies N N time flies V V

slide-19
SLIDE 19

19

Sequential Data Modeling – Conditional Random Fields

exp(φ( )*w)=2.72 exp(φ( )*w)=7.39 exp(φ( )*w)=1.00

The Softmax Function

  • Turn into probabilities by taking exponent and

normalizing (the Softmax function)

  • Take the exponent and normalize

time flies N V exp(φ( )*w)=20.08 time flies V N time flies N N time flies V V

P(Y∣X)= e

w⋅ϕ(Y , X )

∑ ̃

Y e w⋅ϕ( ̃ Y , X )

P(N V | time flies)=.6437 P(N N | time flies)=.2369 P(V V | time flies)=0.0872 P(V N | time flies)=0.0320

slide-20
SLIDE 20

20

Sequential Data Modeling – Conditional Random Fields

Calculating Edge Features

  • Like perceptron, can calculate features for each edge

<S> N V N V </S> time flies

φT,<S>,N=1 φT,<S>,V=1 φE,N,time=1 φE,V,time=1 φT,N,N=1 φT,V,V=1 φT,V,N=1 φT,N,V=1 φE,V,flies=1 φE,N,flies=1 φE,N,flies=1 φE,V,flies=1 φT,V,</S>=1 φT,N,</S>=1

slide-21
SLIDE 21

21

Sequential Data Modeling – Conditional Random Fields

Calculating Edge Probabilities

  • Calculate scores, and take exponent

<S> N V N V </S> time flies

ew*φ=7.39 P=.881 ew*φ=1.00 P=.119 ew*φ=1.00 P=.237 ew*φ=1.00 P=.032 ew*φ=1.00 P=.644 ew*φ=2.72 P=.731 ew*φ=1.00 P=.269 ew*φ=1.00 P=.087

  • This is now the same form as the HMM
  • Can use the Viterbi algorithm
  • Calculate probabilities using forward-backward
slide-22
SLIDE 22

22

Sequential Data Modeling – Conditional Random Fields

Conditional Random Fields

slide-23
SLIDE 23

23

Sequential Data Modeling – Conditional Random Fields

Maximizing CRF Likelihood

  • Want to maximize the likelihood for sequences
  • For convenience, we consider the log likelihood
  • Want to find gradient for stochastic gradient descent

P(Y∣X)= e

w⋅ϕ(Y , X )

∑ ̃

Y e w⋅ϕ( ̃ Y , X )

log P(Y∣X)=w⋅ϕ(Y , X)−log∑ ̃

Y e w⋅ ϕ( ̃ Y , X)

d d w log P(Y∣X)

̂ w=argmax

w

∏i P(Y i∣Xi;w)

slide-24
SLIDE 24

24

Sequential Data Modeling – Conditional Random Fields

Deriving a CRF Gradient:

log P(Y∣X) = w⋅ϕ(Y , X)−log∑ ̃

Y e w⋅ϕ( ̃ Y , X )

= w⋅ϕ(Y , X)−log Z d d w log P(Y∣X) = ϕ(Y , X)− d d w log∑ ̃

Y e w⋅ϕ( ̃ Y , X)

= ϕ(Y , X)−1 Z ∑ ̃

Y

d d w e

w⋅ϕ( ̃ Y , X )

= ϕ(Y , X)−∑ ̃

Y

e

w⋅ϕ( ̃ Y , X)

Z ϕ( ̃ Y , X) = ϕ(Y , X)−∑ ̃

Y P( ̃

Y∣X)ϕ( ̃ Y , X)

slide-25
SLIDE 25

25

Sequential Data Modeling – Conditional Random Fields

In Other Words...

  • To get the gradient we:

d d w log P(Y∣X)=ϕ(Y , X)−∑ ̃

Y P( ̃

Y∣X)ϕ ( ̃ Y , X)

add the correct feature vector subtract the expectation of the features

slide-26
SLIDE 26

26

Sequential Data Modeling – Conditional Random Fields

Example

φ( ) φ( ) φ( )

time flies N V

φ( ) φT,<S>,N=1 φT,N,V=1 φT,V,</S>=1 φE,N,time=1 φE,V,flies=1

time flies V N

φT,<S>,V=1 φT,V,N=1 φT,N,</S>=1 φE,V,time=1 φE,N,flies=1

time flies N N

φT,<S>,N=1 φT,N,N=1 φT,N,</S>=1 φE,N,time=1 φE,N,flies=1

time flies V V

φT,<S>,V=1 φT,V,V=1 φT,V,</S>=1 φE,V,time=1 φE,V,flies=1

P=.644 P=.237 P=.087 P=.032

φT,<S>,N, φE,N,time = 1-.644-.237 = .119 φT,N,V = 1-.644 = .356 φT,V,</S>, φE,V,flies = 1-.644-.087 = .269 φT,V,N = 0-.032 = -.032 φT,N,N = 0-.237 = -.237 φT,V,V = 0-.087 = -.087 φT,<S>,V, φE,V,time = 0-.032-.087 = -.119 φT,N,</S>, φE,V,flies = 0-.032-.237 = -.269

slide-27
SLIDE 27

27

Sequential Data Modeling – Conditional Random Fields

Combinatorial Explosion

  • Problem!: The number of hypotheses is exponential.

d d w log P(Y∣X)=ϕ(Y , X)−∑ ̃

Y P( ̃

Y∣X)ϕ ( ̃ Y , X)

O(T|X|)

T = number of tags

slide-28
SLIDE 28

28

Sequential Data Modeling – Conditional Random Fields

Calculate Feature Expectations using Edge Probabilities!

  • If we know the edge probabilities, just multiply them!

<S> N V time

φT,<S>,N=1 φT,<S>,V=1 φE,N,time=1 φE,V,time=1

ew*φ=7.39 P=.881 ew*φ=1.00 P=.119 φT,<S>,N, φE,N,time = 1-.881 = .119 φT,<S>,V, φE,V,time = 0-.119 = -.119 φT,<S>,N, φE,N,time = 1-.644-.237 = .119 φT,<S>,V, φE,V,time = 0-.032-.087 = -.119

Same answer as when we explicitly expand all Y!

slide-29
SLIDE 29

29

Sequential Data Modeling – Conditional Random Fields

CRF Training Procedure

create map w for I iterations for each labeled pair X, Y in the data gradient = φ(Y,X) calculate eφ(y,x)*w for each edge run forward-backward algorithm to get P(edge) for each edge gradient -= P(edge)*φ(edge) w += α * gradient

  • Can perform stochastic gradient descent, like logistic

regression

  • Only major difference is gradient calculation
  • Learning rate α
slide-30
SLIDE 30

30

Sequential Data Modeling – Conditional Random Fields

Learning Algorithms

slide-31
SLIDE 31

31

Sequential Data Modeling – Conditional Random Fields

Batch Learning

create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw

  • Online Learning: Update after each example
  • Batch Learning: Update after all examples

Online Stochastic Gradient Descent

create map w for I iterations for each labeled pair x, y in the data gradient += α * dP(y|x)/dw w += gradient

Batch Stochastic Gradient Descent

slide-32
SLIDE 32

32

Sequential Data Modeling – Conditional Random Fields

Batch Learning Algorithms: Newton/Quasi-Newton Methods

  • Newton-Raphson Method:
  • Choose how far to update using the second-order

derivatives (the Hessian matrix)

  • Faster convergence, but |w|*|w| time and memory
  • Limited Memory Broyden-Fletcher-Goldfarb-Shanno

algorithm (L-BFGS):

  • Guesses second-order derivatives from first-order
  • Most widely used?
  • Library: http://www.chokkan.org/software/liblbfgs/
  • More information:

http://homes.cs.washington.edu/~galen/files/quasi- newton-notes.pdf

slide-33
SLIDE 33

33

Sequential Data Modeling – Conditional Random Fields

Online Learning vs. Batch Learning

  • Online:
  • In general, simpler mathematical derivation
  • Often converges faster
  • Batch:
  • More stable (does not change based on order)
  • Trivially parallelizable
slide-34
SLIDE 34

34

Sequential Data Modeling – Conditional Random Fields

Regularization

slide-35
SLIDE 35

35

Sequential Data Modeling – Conditional Random Fields

Cannot Distinguish Between Large and Small Classifiers

  • For these examples:
  • Which classifier is better?
  • 1 he saw a bird in the park

+1 he saw a robbery in the park Classifier 1 he +3 saw

  • 5

a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1

slide-36
SLIDE 36

36

Sequential Data Modeling – Conditional Random Fields

Cannot Distinguish Between Large and Small Classifiers

  • For these examples:
  • Which classifier is better?
  • 1 he saw a bird in the park

+1 he saw a robbery in the park Classifier 1 he +3 saw

  • 5

a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 Probably classifier 2! It doesn't use irrelevant information.

slide-37
SLIDE 37

37

Sequential Data Modeling – Conditional Random Fields

Regularization

  • A penalty on adding extra weights
  • L2 regularization:
  • Big penalty on large weights,

small penalty on small weights

  • High accuracy
  • L1 regularization:
  • Uniform increase whether large
  • r small
  • Will cause many weights to

become zero → small model

  • 2
  • 1

1 2 1 2 3 4 5 L2 L1

slide-38
SLIDE 38

38

Sequential Data Modeling – Conditional Random Fields

Regularization in Logistic Regression/CRF

  • To do so in logistic regression/CRF, we add the

penalty to the log likelihood (for the whole corpus)

  • c adjusts the strength of the regularization
  • smaller: more freedom to fit the data
  • larger: less freedom to fit the data, better generalization
  • L1 also used, slightly more difficult to optimize

̂ w=argmax

w

(∏i P(Y i∣X i; w))−c∑w∈w w

2

L2 Regularization

slide-39
SLIDE 39

39

Sequential Data Modeling – Conditional Random Fields

Conclusion

slide-40
SLIDE 40

40

Sequential Data Modeling – Conditional Random Fields

Conclusion

  • Logistic regression is a probabilistic classifier
  • Conditional random fields are probabilistic structured

discriminative prediction models

  • Can be trained using
  • Online stochastic gradient descent (like peceptron)
  • Batch learning using a method such as L-BFGS
  • Regularization can help solve problems of overfitting
slide-41
SLIDE 41

41

Sequential Data Modeling – Conditional Random Fields

Thank You!