NLP Programming Tutorial 6 - Advanced Discriminative Learning - PowerPoint PPT Presentation

NLP Programming Tutorial 6 – Advanced Discriminative Learning NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1

NLP Programming Tutorial 6 – Advanced Discriminative Learning Review: Classifiers and the Perceptron 2

NLP Programming Tutorial 6 – Advanced Discriminative Learning Prediction Problems Given x, predict y 3

NLP Programming Tutorial 6 – Advanced Discriminative Learning Example we will use: ● Given an introductory sentence from Wikipedia ● Predict whether the article is about a person Give Predict n Gonso was a Sanron sect priest (754-827) Yes! in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is No! a historical site located at Magura, Maizuru City, Kyoto Prefecture. ● This is binary classification 4

NLP Programming Tutorial 6 – Advanced Discriminative Learning Mathematical Formulation sign ( w ⋅ ϕ ( x )) y = I sign ( ∑ i = 1 w i ⋅ϕ i ( x )) = ● x: the input ● φ(x) : vector of feature functions {φ 1 (x), φ 2 (x), …, φ I (x)} ● w : the weight vector {w 1 , w 2 , …, w I } ● y: the prediction, +1 if “yes”, -1 if “no” ● (sign(v) is +1 if v >= 0, -1 otherwise) 5

NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) y' = predict_one (w, phi) if y' != y update_weights (w, phi, y) ● In other words ● Try to classify each training example ● Every time we make a mistake, update the weights ● Many different online learning algorithms ● The most simple is the perceptron 6

NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron Weight Update w ← w + y ϕ( x ) ● In other words: ● If y=1, increase the weights for features in φ (x) – Features for positive examples get a higher weight ● If y=-1, decrease the weights for features in φ (x) – Features for negative examples get a lower weight → Every time we update, our predictions get better! update_weights ( w, phi, y ) for name, value in phi : w [ name ] += value * y 7

NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent and Logistic Regression 8

NLP Programming Tutorial 6 – Advanced Discriminative Learning Perceptron and Probabilities P ( y ∣ x ) ● Sometimes we want the probability ● Estimating confidence in predictions ● Combining with other systems ● However, perceptron only gives us a prediction y = sign ( w ⋅ϕ( x )) 1 In other words: p(y|x) 0.5 P ( y = 1 ∣ x )= 1 if w ⋅ ϕ ( x )≥ 0 P ( y = 1 ∣ x )= 0 if w ⋅ϕ ( x )< 0 0 -10 -5 0 5 10 9 w*phi(x)

NLP Programming Tutorial 6 – Advanced Discriminative Learning The Logistic Function ● The logistic function is a “softened” version of the function used in the perceptron w ⋅ ϕ ( x ) P ( y = 1 ∣ x )= e w ⋅ϕ( x ) 1 + e Perceptron Logistic Function 1 1 p(y|x) p(y|x) 0.5 0.5 0 0 -10 -5 0 5 10 -10 -5 0 5 10 w*phi(x) w*phi(x) ● Can account for uncertainty 10 ● Differentiable

NLP Programming Tutorial 6 – Advanced Discriminative Learning Logistic Regression ● Train based on conditional likelihood ● Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ∏ i P ( y i ∣ x i ; w ) ̂ w = argmax w ● How do we solve this? 11

NLP Programming Tutorial 6 – Advanced Discriminative Learning Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● For every training example, calculate the gradient (the direction that will increase the probability of y) ● Move in that direction, multiplied by learning rate α 12

NLP Programming Tutorial 6 – Advanced Discriminative Learning Gradient of the Logistic Function ● Take the derivative of the probability 0.4 dp(y|x)/dw*phi(x) w ⋅ ϕ ( x ) d d e 0.3 d w P ( y = 1 ∣ x ) = d w w ⋅ϕ( x ) 0.2 1 + e 0.1 w ⋅ ϕ ( x ) e 0 ϕ ( x ) = -10 -5 0 5 10 w ⋅ϕ( x ) ) 2 ( 1 + e w*phi(x) w ⋅ϕ( x ) d d w ( 1 − e d d w P ( y =− 1 ∣ x ) w ⋅ϕ( x ) ) = 1 + e w ⋅ϕ( x ) e −ϕ( x ) = w ⋅ϕ( x ) ) 2 ( 1 + e 13

NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Initial Update ● Set α=1, initialize w = 0 y = -1 x = A site , located in Maizuru , Kyoto 0 d e w ⋅ ϕ ( x )= 0 d w P ( y =− 1 ∣ x ) − 2 ϕ( x ) = 0 ) ( 1 + e − 0.25 ϕ( x ) = w ← w +− 0.25 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “,” = -0.5 w unigram “site” = -0.25 w unigram “in” = -0.25 w unigram “located” = -0.25 14 w unigram “Kyoto” = -0.25

NLP Programming Tutorial 6 – Advanced Discriminative Learning Example: Second Update y = 1 x = Shoken , monk born in Kyoto -0.5 -0.25 -0.25 1 d e w ⋅ϕ( x ) =− 1 d w P ( y = 1 ∣ x ) 2 ϕ ( x ) = 1 ) ( 1 + e 0.196 ϕ( x ) = w ← w + 0.196 ϕ ( x ) w unigram “Maizuru” = -0.25 w unigram “A” = -0.25 w unigram “Shoken” = 0.196 w unigram “,” = -0.304 w unigram “site” = -0.25 w unigram “monk” = 0.196 w unigram “in” = -0.054 w unigram “located” = -0.25 w unigram “born” = 0.196 15 w unigram “Kyoto” = -0.054

NLP Programming Tutorial 6 – Advanced Discriminative Learning SGD Learning Rate? ● How to set the learning rate α? ● Usually decay over time: α= 1 C + t parameter number of samples ● Or, use held-out data, and reduce the learning rate when the likelihood rises 16

NLP Programming Tutorial 6 – Advanced Discriminative Learning Classification Margins 17

NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X 18

NLP Programming Tutorial 6 – Advanced Discriminative Learning Choosing between Equally Accurate Classifiers ● Which classifier is better? Dotted or Dashed? O X O X O X ● Answer: Probably the dashed line. ● Why?: It has a larger margin. 19

NLP Programming Tutorial 6 – Advanced Discriminative Learning What is a Margin? ● The distance between the classification plane and the nearest example: O X O X O X 20

NLP Programming Tutorial 6 – Advanced Discriminative Learning Support Vector Machines ● Most famous margin-based classifier ● Hard Margin: Explicitly maximize the margin ● Soft Margin: Allow for some mistakes ● Usually use batch learning ● Batch learning: slightly higher accuracy, more stable ● Online learning: simpler, less memory, faster convergence ● Learn more about SVMs: http://disi.unitn.it/moschitti/material/Interspeech2010-Tutorial.Moschitti.pdf ● Batch learning libraries: LIBSVM, LIBLINEAR, SVMLite 21

NLP Programming Tutorial 6 – Advanced Discriminative Learning Online Learning with a Margin ● Penalize not only mistakes, but also correct answers under a margin create map w for I iterations for each labeled pair x, y in the data phi = create_features (x) val = w * phi * y if val <= margin ★ update_weights (w, phi, y) (A correct classifier will always make w * phi * y > 0) If margin = 0, this is the perceptron algorithm 22

NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization 23

NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 a +0.5 bird -1 robbery +1 in +5 the -3 24 park -2

NLP Programming Tutorial 6 – Advanced Discriminative Learning Cannot Distinguish Between Large and Small Classifiers ● For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park ● Which classifier is better? Classifier 1 Classifier 2 he +3 bird -1 saw -5 robbery +1 Probably classifier 2! a +0.5 It doesn't use bird -1 irrelevant information. robbery +1 in +5 the -3 25 park -2

NLP Programming Tutorial 6 – Advanced Discriminative Learning Regularization ● A penalty on adding extra weights ● L2 regularization: 5 ● Big penalty on large weights, 4 small penalty on small weights ● High accuracy 3 L2 ● L1 regularization: 2 L1 ● Uniform increase whether large 1 or small 0 -2 -1 0 1 2 ● Will cause many weights to become zero → small model 26

NLP Programming Tutorial 6 – Advanced Discriminative Learning L1 Regularization in Online Learning ● After update, reduce the weight by a constant c update_weights ( w, phi, y, c ) for name, value in w : ★ If abs. value < c, if abs ( value ) < c : ★ set weight to zero w [ name ] = 0 ★ else : ★ If value > 0, w[ name ] -= sign ( value ) * c decrease by c ★ for name, value in phi : If value < 0, w [ name ] += value * y increase by c 27

NLP Programming Tutorial 6 - Advanced Discriminative Learning - PowerPoint PPT Presentation

NLP Programming Tutorial 6 Advanced Discriminative Learning NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1 NLP Programming Tutorial 6 Advanced

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

NLP Programming Tutorial 2 - Bigram Language Models Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 1 - Unigram Language Models Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 8 - Phrase Structure Parsing Graham Neubig Nara Institute of Science

NLP Programming Tutorial 12 - Dependency Parsing Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 6 - Kana-Kanji Conversion Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 13 - Beam and A* Search Graham Neubig Nara Institute of Science and

NLP Programming Tutorial 3 - The Perceptron Algorithm Graham Neubig Nara Institute of Science

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Engineering Access Control Policies for Provenance-aware Systems Lianshan Sun 12 , Jaehong Park 2

Successful ELM Suppressions in a Wide Range of q 95 Using Low n RMPs in KSTAR and its

Interplay of Hidden Order, Quantum Criticality and Superconductivity in the Physics of 2D Heavy,

City of San Ramon Architectural Review Board Presentation 11 December 2019 Master Plan

Global Ocean Carbon Uptake: Magnitude, Variability and Trends Results from a RECCAP synthesis Rik

New Cosmological Hydrodynamic Code Developments Jihye Shin 1 , Juhan Kim 2 , Sungsoo S. Kim 1 ,

On the Combination of two Decompositive Multi-Label Classification Methods Grigorios Tsoumakas 1 ,

Failure-Atomic Updates of Application Data in a Linux File System -- FAST2015 short paper