nlp programming tutorial 6 advanced discriminative
play

NLP Programming Tutorial 6 - Advanced Discriminative Learning - PowerPoint PPT Presentation

NLP Programming Tutorial 6 Advanced Discriminative Learning NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1 NLP Programming Tutorial 6 Advanced


  1. NLP Programming Tutorial 6 – Advanced Discriminative Learning NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1

  2. NLP Programming Tutorial 6 – Advanced Discriminative Learning Review: Classifiers and the Perceptron 2

  3. NLP Programming Tutorial 6 – Advanced Discriminative Learning Prediction Problems Given x, predict y 3

  4. NLP Programming Tutorial 6 – Advanced Discriminative Learning Example we will use: ● Given an introductory sentence from Wikipedia ● Predict whether the article is about a person Give Predict n Gonso was a Sanron sect priest (754-827) Yes! in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is No! a historical site located at Magura, Maizuru City, Kyoto Prefecture. ● This is binary classification 4

  5. NLP Programming Tutorial 6 – Advanced Discriminative Learning Mathematical Formulation sign ( w ⋅ ϕ ( x )) y = I sign ( ∑ i = 1 w i ⋅ϕ i ( x )) = ● x: the input ● φ(x) : vector of feature functions {φ 1 (x), φ 2 (x), …, φ I (x)} ● w : the weight vector {w 1 , w 2 , …, w I } ● y: the prediction, +1 if “yes”, -1 if “no” ● (sign(v) is +1 if v >= 0, -1 otherwise) 5

  6. NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) y' = predict_one (w, phi) if y' != y update_weights (w, phi, y) ● In other words ● Try to classify each training example ● Every time we make a mistake, update the weights ● Many different online learning algorithms ● The most simple is the perceptron 6

  7. NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron Weight Update w ← w + y ϕ( x ) ● In other words: ● If y=1, increase the weights for features in φ (x) – Features for positive examples get a higher weight ● If y=-1, decrease the weights for features in φ (x) – Features for negative examples get a lower weight → Every time we update, our predictions get better! update_weights ( w, phi, y ) for name, value in phi : w [ name ] += value * y 7

  8. NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent and Logistic Regression 8

  9. NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron and Probabilities P ( y ∣ x ) ● Sometimes we want the probability ● Estimating confidence in predictions ● Combining with other systems ● However, perceptron only gives us a prediction y = sign ( w ⋅ϕ( x )) 1 In other words: p(y|x) 0.5 P ( y = 1 ∣ x )= 1 if w ⋅ ϕ ( x )≥ 0 P ( y = 1 ∣ x )= 0 if w ⋅ϕ ( x )< 0 0 -10 -5 0 5 10 9 w*phi(x)

  10. NLP Programming Tutorial 6 – Advanced Discriminative Learning The Logistic Function ● The logistic function is a “softened” version of the function used in the perceptron w ⋅ ϕ ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Perceptron Logistic Function 1 1 p(y|x) p(y|x) 0.5 0.5 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) ● Can account for uncertainty 10 ● Differentiable

  11. NLP Programming Tutorial 6 – Advanced Discriminative Learning Logistic Regression ● Train based on conditional likelihood ● Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ∏ i P ( y i ∣ x i ; w ) ̂ w = argmax w ● How do we solve this? 11

  12. NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 12

  13. NLP Programming Tutorial 6 – Advanced Discriminative Learning Gradient of the Logistic Function ● Take the derivative of the probability 0.4 dp(y|x)/dw*phi(x) w ⋅ ϕ ( x ) d d e 0.3 d w P ( y = 1 ∣ x ) = d w w ⋅ϕ( x ) 0.2 1 + e 0.1 w ⋅ ϕ ( x ) e 0 ϕ ( x ) = -10 -5 0 5 10 w ⋅ϕ( x ) ) 2 ( 1 + e w*phi(x) w ⋅ϕ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e −ϕ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 13

  14. NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Initial Update ● Set α=1, initialize w = 0 y = -1 x = A site , located in Maizuru , Kyoto 0 d e w ⋅ ϕ ( x )= 0 d w P ( y =− 1 ∣ x ) − 2 ϕ( x ) = 0 ) ( 1 + e − 0.25 ϕ( x ) = w ← w +− 0.25 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “,” = -0.5 w unigram “site” = -0.25 w unigram “in” = -0.25 w unigram “located” = -0.25 14 w unigram “Kyoto” = -0.25

  15. NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Second Update y = 1 x = Shoken , monk born in Kyoto -0.5 -0.25 -0.25 1 d e w ⋅ϕ( x ) =− 1 d w P ( y = 1 ∣ x ) 2 ϕ ( x ) = 1 ) ( 1 + e 0.196 ϕ( x ) = w ← w + 0.196 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “Shoken” = 0.196 w unigram “,” = -0.304 w unigram “site” = -0.25 w unigram “monk” = 0.196 w unigram “in” = -0.054 w unigram “located” = -0.25 w unigram “born” = 0.196 15 w unigram “Kyoto” = -0.054

  16. NLP Programming Tutorial 6 – Advanced Discriminative Learning SGD Learning Rate? ● How to set the learning rate α? ● Usually decay over time: α= 1 C + t parameter number of samples ● Or, use held-out data, and reduce the learning rate when the likelihood rises 16

  17. NLP Programming Tutorial 6 – Advanced Discriminative Learning Classification Margins 17

  18. NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X 18

  19. NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X ● Answer: Probably the dashed line. ● Why?: It has a larger margin. 19

  20. NLP Programming Tutorial 6 – Advanced Discriminative Learning What is a Margin? ● The distance between the classification plane and the nearest example: O X O X O X 20

  21. NLP Programming Tutorial 6 – Advanced Discriminative Learning Support Vector Machines ● Most famous margin-based classifier ● Hard Margin: Explicitly maximize the margin ● Soft Margin: Allow for some mistakes ● Usually use batch learning ● Batch learning: slightly higher accuracy, more stable ● Online learning: simpler, less memory, faster convergence ● Learn more about SVMs: http://disi.unitn.it/moschitti/material/Interspeech2010-Tutorial.Moschitti.pdf ● Batch learning libraries: LIBSVM, LIBLINEAR, SVMLite 21

  22. NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning with a Margin ● Penalize not only mistakes, but also correct answers under a margin create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) val = w * phi * y if val <= margin ★ update_weights (w, phi, y) (A correct classifier will always make w * phi * y > 0) If margin = 0, this is the perceptron algorithm 22

  23. NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization 23

  24. NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 a +0.5 bird -1 robbery +1 in +5 the -3 24 park -2

  25. NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 Probably classifier 2! a +0.5 It doesn't use bird -1 irrelevant information. robbery +1 in +5 the -3 25 park -2

  26. NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization ● A penalty on adding extra weights ● L2 regularization: 5 ● Big penalty on large weights, 4 small penalty on small weights ● High accuracy 3 L2 ● L1 regularization: 2 L1 ● Uniform increase whether large 1 or small 0 -2 -1 0 1 2 ● Will cause many weights to become zero → small model 26

  27. NLP Programming Tutorial 6 – Advanced Discriminative Learning L1 Regularization in Online Learning ● After update, reduce the weight by a constant c update_weights ( w, phi, y, c ) for name, value in w : ★ If abs. value < c, if abs ( value ) < c : ★ set weight to zero w [ name ] = 0 ★ else : ★ If value > 0, w[ name ] -= sign ( value ) * c decrease by c ★ for name, value in phi : If value < 0, w [ name ] += value * y increase by c 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend