sequential data modeling conditional random fields
play

Sequential Data Modeling - Conditional Random Fields Graham Neubig - PowerPoint PPT Presentation

Sequential Data Modeling Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Sequential Data Modeling Conditional Random Fields Prediction


  1. Sequential Data Modeling – Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1

  2. Sequential Data Modeling – Conditional Random Fields Prediction Problems Given x, predict y 2

  3. Sequential Data Modeling – Conditional Random Fields Prediction Problems Given x, predict y A book review Is it positive? Binary Oh, man I love this book! Prediction yes (2 choices) This book is so boring... no A tweet Its language Multi-class On the way to the park! English Prediction 公園に行くなう! (several choices) Japanese A sentence Its parts-of-speech Structured Prediction N VBD DET NN I read a book (millions of choices) I read a book 3

  4. Sequential Data Modeling – Conditional Random Fields Logistic Regression 4

  5. Sequential Data Modeling – Conditional Random Fields Example we will use: ● Given an introductory sentence from Wikipedia ● Predict whether the article is about a person Given Predict Gonso was a Sanron sect priest (754-827) Yes! in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is No! a historical site located at Magura, Maizuru City, Kyoto Prefecture. ● This is binary classification (of course!) 5

  6. Sequential Data Modeling – Conditional Random Fields Review: Linear Prediction Model ● Each element that helps us predict is a feature contains “priest” contains “(<#>-<#>)” contains “site” contains “Kyoto Prefecture” ● Each feature has a weight, positive if it indicates “yes”, and negative if it indicates “no” w contains “priest” = 2 w contains “(<#>-<#>)” = 1 w contains “site” = -3 w contains “Kyoto Prefecture” = -1 ● For a new example, sum the weights Kuya (903-972) was a priest 2 + -1 + 1 = 2 born in Kyoto Prefecture. ● If the sum is at least 0: “yes”, otherwise: “no” 6

  7. Sequential Data Modeling – Conditional Random Fields Review: Mathematical Formulation sign ( w ⋅ϕ( x )) y = I sign ( ∑ i = 1 w i ⋅ϕ i ( x )) = ● x: the input ● φ(x) : vector of feature functions {φ 1 (x), φ 2 (x), …, φ I (x)} ● w : the weight vector {w 1 , w 2 , …, w I } ● y: the prediction, +1 if “yes”, -1 if “no” ● (sign(v) is +1 if v >= 0, -1 otherwise) 7

  8. Sequential Data Modeling – Conditional Random Fields Perceptron and Probabilities P ( y ∣ x ) ● Sometimes we want the probability ● Estimating confidence in predictions ● Combining with other systems ● However, perceptron only gives us a prediction y = sign ( w ⋅ϕ( x )) In other words: 1 ) P ( y = 1 ∣ x )= 1 if w ⋅ϕ( x )≥ 0 0.5 x | y ( p P ( y = 1 ∣ x )= 0 if w ⋅ϕ( x )< 0 0 -10 -5 0 5 10 8 w*phi(x)

  9. Sequential Data Modeling – Conditional Random Fields The Logistic Function ● The logistic function is a “softened” version of the function used in the perceptron w ⋅ ϕ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Perceptron Logistic Function 1 1 ) ) 0.5 x 0.5 x | | y y ( ( p p 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) ● Can account for uncertainty 9 ● Differentiable

  10. Sequential Data Modeling – Conditional Random Fields Logistic Regression ● Train based on conditional likelihood ● Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ∏ i P ( y i ∣ x i ; w ) ̂ w = argmax w ● How do we solve this? 10

  11. Sequential Data Modeling – Conditional Random Fields Review: Perceptron Training Algorithm create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) y' = predict_one (w, phi) if y' != y w += y * phi ● In other words ● Try to classify each training example ● Every time we make a mistake, update the weights 11

  12. Sequential Data Modeling – Conditional Random Fields Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 12

  13. Sequential Data Modeling – Conditional Random Fields Gradient of the Logistic Function ● Take the derivative of the probability ) w ⋅ ϕ( x ) 0.4 d d e x ( i d w P ( y = 1 ∣ x ) = h p d w w ⋅ϕ( x ) 1 + e * 0.2 w d / w ⋅ϕ( x ) ) e x | 0 ϕ( x ) = y ( -10 -5 0 5 10 p w ⋅ϕ( x ) ) 2 ( 1 + e d w*phi(x) w ⋅ϕ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e −ϕ ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 13

  14. Sequential Data Modeling – Conditional Random Fields Example: Initial Update ● Set α=1, initialize w = 0 y = -1 x = A site , located in Maizuru , Kyoto 0 d e w ⋅ϕ( x )= 0 d w P ( y =− 1 ∣ x ) − 2 ϕ( x ) = 0 ) ( 1 + e − 0.25 ϕ( x ) = w ← w +− 0.25 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “,” = -0.5 w unigram “site” = -0.25 w unigram “in” = -0.25 w unigram “located” = -0.25 14 w unigram “Kyoto” = -0.25

  15. Sequential Data Modeling – Conditional Random Fields Example: Second Update y = 1 x = Shoken , monk born in Kyoto -0.5 -0.25 -0.25 1 d e w ⋅ϕ( x )=− 1 d w P ( y = 1 ∣ x ) 2 ϕ( x ) = 1 ) ( 1 + e = 0.196 ϕ( x ) w ← w + 0.196 ϕ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “Shoken” = 0.196 w unigram “,” = -0.304 w unigram “site” = -0.25 w unigram “monk” = 0.196 w unigram “in” = -0.054 w unigram “located” = -0.25 w unigram “born” = 0.196 15 w unigram “Kyoto” = -0.054

  16. Sequential Data Modeling – Conditional Random Fields Calculating Optimal Sequences, Probabilities 16

  17. Sequential Data Modeling – Conditional Random Fields Sequence Likelihood ● Logistic regression considered probability of y ∈{− 1, + 1 } P ( y ∣ x ) ● What if we want to consider probability of a sequence? X i I visited Nara Y i PRN VBD NNP P ( Y ∣ X ) 17

  18. Sequential Data Modeling – Conditional Random Fields Calculating Multi-class Probabilities ● Each sequence has it's own feature vector time flies φ( ) φ T,<S>,N =1 φ T,N,V =1 φ T,V,</S> =1 φ E,N,time =1 φ E,V,flies =1 N V time flies φ( ) φ T,<S>,V =1 φ T,V,N =1 φ T,N,</S> =1 φ E,V,time =1 φ E,N,flies =1 V N time flies φ( ) φ T,<S>,N =1 φ T,N,N =1 φ T,N,</S> =1 φ E,N,time =1 φ E,N,flies =1 N N time flies φ( ) φ T,<S>,V =1 φ T,V,V =1 φ T,V,</S> =1 φ E,V,time =1 φ E,V,flies =1 V V ● Use weights for each feature to calculate scores w T,<S>,N =1 w T,V,</S> =1 w E,N,time =1 time flies time flies φ ( )* w =3 φ ( )* w =0 N V V N time flies time flies 18 φ ( )* w =2 φ ( )* w =1 N N V V

  19. Sequential Data Modeling – Conditional Random Fields The Softmax Function ● Turn into probabilities by taking exponent and normalizing (the Softmax function) w ⋅ϕ( Y , X ) e P ( Y ∣ X )= ∑ ̃ w ⋅ϕ( ̃ Y , X ) Y e ● Take the exponent and normalize time flies time flies exp( φ ( )* w )=20.08 exp( φ ( )* w )=1.00 N V V N time flies time flies exp( φ ( )* w )=7.39 exp( φ ( )* w )=2.72 N N V V P(V N | time flies)=0.0320 P(N V | time flies)=.6437 P(N N | time flies)=.2369 P(V V | time flies)=0.0872 19

  20. Sequential Data Modeling – Conditional Random Fields Calculating Edge Features ● Like perceptron, can calculate features for each edge φ E,N,flies =1 time flies φ T,N,N =1 φ E,N,time =1 N N φ T,N,</S> =1 φ T,<S>,N =1 φ E,N,flies =1 φ T,V,N =1 <S> </S> φ E,V,flies =1 φ T,N,V =1 φ E,V,time =1 φ T,V,</S> =1 V V φ T,<S>,V =1 φ E,V,flies =1 φ T,V,V =1 20

  21. Sequential Data Modeling – Conditional Random Fields Calculating Edge Probabilities ● Calculate scores, and take exponent time flies e w*φ =1.00 P=.237 e w*φ =7.39 N N e w*φ =1.00 P=.881 P=.269 e w*φ =1.00 P=.032 <S> </S> e w*φ =1.00 P=.644 e w*φ =1.00 e w*φ =2.72 P=.119 V V P=.731 e w*φ =1.00 P=.087 ● This is now the same form as the HMM ● Can use the Viterbi algorithm 21 ● Calculate probabilities using forward-backward

  22. Sequential Data Modeling – Conditional Random Fields Conditional Random Fields 22

  23. Sequential Data Modeling – Conditional Random Fields Maximizing CRF Likelihood ● Want to maximize the likelihood for sequences w ⋅ϕ( Y , X ) e ∏ i P ( Y i ∣ X i ; w ) P ( Y ∣ X )= ̂ w = argmax ∑ ̃ w ⋅ϕ( ̃ Y , X ) Y e w ● For convenience, we consider the log likelihood log P ( Y ∣ X )= w ⋅ϕ( Y , X )− log ∑ ̃ ϕ( ̃ w ⋅ Y , X ) Y e ● Want to find gradient for stochastic gradient descent d d w log P ( Y ∣ X ) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend