lecture 5 logistic regression
play

Lecture 5: Logistic Regression Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 5: Logistic Regression Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center : 1 t r d a n P a w e i w v e e i R v r e v O CS447


  1. CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 5: 
 Logistic Regression Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

  2. : 1 t r d a n P a w e i w v e e i R v r e v O CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

  3. 
 Probabilistic classifiers We want to find the most likely class y for the input x : y * = argmax y P ( Y = y | X = x ) P ( Y = y | X = x ) : y The probability that the class label is 
 x when the input feature vector is 
 y * = argmax y f ( y ) y * y f ( y ) Let be the that maximizes 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  4. P ( Y | X ) Modeling with Bayes Rule Bayes Rule relates P ( Y | X ) P ( X | Y ) P ( Y ) to and : P ( Y | X ) = P ( Y , X ) P ( X ) Posterior = P ( X | Y ) P ( Y ) P ( X ) Likelihood Prior ∝ P ( X | Y ) P ( Y ) P ( Y ∣ X ) Bayes rule: The posterior is proportional 
 to the prior times the likelihood P ( Y ) P ( X | Y ) 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  5. P ( Y | X ) Posterior P ( Y ∣ X ) Modeling with Bayes Rule Y Probability of the label 
 X after having seen the data Bayes Rule relates P ( Y | X ) P ( X | Y ) P ( Y ) to and : P ( Y | X ) = P ( Y , X ) P ( X ) Posterior = P ( X | Y ) P ( Y ) P ( X ) Likelihood Prior P ( X ∣ Y ) P ( Y ) X Y Probability of the data 
 Probability of the label 
 Likelihood Prior ∝ P ( X | Y ) P ( Y ) Y X according to class independent of the data P ( Y ∣ X ) Bayes rule: The posterior is proportional 
 to the prior times the likelihood P ( Y ) P ( X | Y ) 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  6. 
 Using Bayes Rule for our classifier y * = argmax y P ( Y ∣ X ) P ( X ∣ Y ) P ( Y ) = argmax y [ Bayes Rule ] 
 P ( X ) [ P ( X ) doesn’t 
 = argmax y P ( X ∣ Y ) P ( Y ) change argmax y ] 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  7. Classification more generally Class Feature Raw Feature Classifier function Label(s) Data vector Before we can use a classifier on our data, we have to map the data to “feature” vectors 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  8. Feature engineering as a prerequisite 
 for classification To talk about classification mathematically, we assume 
 each input item is represented as a ‘ feature’ vector x = (x 1 ….x N ) — Each element in x is one feature. — The number of elements/features N is fixed, and may be very large. — x has to capture all the information about the item that the classifier needs. But the raw data points (e.g. documents to classify) 
 are typically not in vector form. Before we can train a classifier, we therefore have to first define 
 a suitable feature function that maps raw data points to vectors. In practice, feature engineering (designing suitable feature functions) is very important for accurate classification. 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  9. 
 Probabilistic classifiers y * A probabilistic classifier returns the most likely class 
 x for input : y * = argmax y P ( Y = y | X = x ) [Last class:] Naive Bayes uses Bayes Rule: y * = argmax y P ( y ∣ x ) = argmax y P ( x ∣ y ) P ( y ) Naive Bayes models the joint distribution of the class and the data: P ( x ∣ y ) P ( y ) = P ( x , y ) Joint models are also called generative models because we can view them 
 as stochastic processes that generate (labeled) items: P ( x ∣ y ) y P ( y ) x Sample/pick a label with , and then an item with P ( y ∣ x ) [Today:] Logistic Regression models directly This is also called a discriminative or conditional model, because it only models the probability of the class given the input, and not of the raw data itself. 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  10. Key questions for today’s class What do we mean by generative vs. discriminative models/classifiers? Why is it difficult to incorporate complex features 
 into a generative model like Naive Bayes? How can we use (standard or multinomial) logistic regression for (binary or multiclass) classification? How can we train logistic regression models with (stochastic) gradient descent ? 10 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  11. 
 Today’s class Part 1: Review and Overview 
 Part 2: From generative to discriminative classifiers 
 (Logistic Regression 
 and Multinomial Regression) Part 3: Learning Logistic Regression Models 
 with (Stochastic) Gradient Descent Reading: Chapter 5 (Jurafsky & Martin, 3rd Edition) 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  12. m o r F o : t 2 e t r v a i t P e a v r i e t n a s n l e e G i d m o i r M c s y i D t i l i b a b o r P CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 12

  13. 
 Probabilistic classifiers y * A probabilistic classifier returns the most likely class 
 x for input : y * = argmax y P ( Y = y | X = x ) [Last class:] Naive Bayes uses Bayes Rule: y * = argmax y P ( y ∣ x ) = argmax y P ( x ∣ y ) P ( y ) Naive Bayes models the joint distribution of the class and the data: P ( x ∣ y ) P ( y ) = P ( x , y ) Joint models are also called generative models because we can view them 
 as stochastic processes that generate (labeled) items: P ( x ∣ y ) y P ( y ) x Sample/pick a label with , and then an item with P ( y ∣ x ) [Today:] Logistic Regression models directly This is also called a discriminative or conditional model, because it only models the probability of the class given the input, and not of the raw data itself. 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  14. 
 (Directed) Graphical Models Graphical models are a visual notation 
 for probability models. 
 Each node represents a distribution 
 over one random variable: P ( X ) : X Arrows represent dependencies (i.e. what other random variables the current node is conditioned on) P ( Y ) P ( X ∣ Y ) P ( Y ) P ( Z ) P ( X ∣ Y , Z ) Y X Y X Z 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  15. Generative vs Discriminative Models In classification: x = ( x 1 , …, x n ) — The data is observed (shaded nodes). y — The label is hidden (and needs to be inferred) Generative Model 
 Discriminative Model 
 (Naive Bayes) (Logistic Regression) P ( x ∣ y ) P ( y ∣ x ) Y Y X 1 X i X n X 1 X i X n 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  16. 
 P ( Y = y ∣ X = x ) How do we model 
 x such that we can compute it for any ? x We’ve probably never seen any particular 
 that we want to classify at test time. Even if we could define and compute probability distributions 
 P ( Y = y ∣ X i = x i ) 
 Σ y j ∈ Y P ( Y = y j ∣ X i = x i ) = 1 Good! sums to 1 with P ( Y ) x i ∈ x = ( x 1 , …, x i , …, x n ) for any single feature … ….we can’t just multiply these probabilities together 
 y j ∈ Y x to get one distribution over all for a given 
 P ( Y = y ∣ X = x ) := ∑ y j ∈ Y [ ∏ P ( Y = y j ∣ X i = x i ) ] < 1 Bad! i =1... n does not sum to 1 P ( Y ) 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  17. The sigmoid function σ ( x ) σ ( x ) The sigmoid function maps 
 x any real number to the range (0,1): e x 1 σ ( x ) = e x + 1 = 1 + e − x 1 0.5 0 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  18. 
 σ () x Using with feature vectors σ () We can use the sigmoid to express a Bernoulli distribution Coin flips: P ( Tails ) = 1 − P ( Heads ) = 1 − σ ( x ) P ( Heads ) = σ ( x ) and σ () But to use the sigmoid for binary classification, 
 P ( Y ∈ {0,1} ∣ x = X ) we need to model the conditional probability 
 x ∈ X such that it depends on the particular feature vector Also: We don’t know how important each feature (element) 
 x i of for our particular classification task is… x = ( x 1 , …, x n ) σ () … and we need to feed a single real number into ! Solution: Assign (learn) a vector of feature weights f = ( f 1 , …, f n ) n ∑ σ ( fx ) and compute to obtain a single real, and then fx = f i x i i =1 18 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend