pattern recognition 2019 linear models for classification
play

Pattern Recognition 2019 Linear Models for Classification Ad - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 55 Classification Problems We are concerned with the problems of 1 Predicting the class


  1. Pattern Recognition 2019 Linear Models for Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 55

  2. Classification Problems We are concerned with the problems of 1 Predicting the class of an object, on the basis of a number of variables that describe the object. 2 Estimating the class probabilities of an object. Interconnected, since prediction is usually based on the estimated probabilities. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 2 / 55

  3. Examples of Classification Problems Churn: is customer going to leave for a competitor? SPAM filter: e-mail message is SPAM or not? Medical diagnosis: does patient have breast cancer? Handwritten digit recognition. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 3 / 55

  4. Classification Problems In this kind of classification problem there is a target variable t that assumes values in an unordered discrete set. An important special case is when there are only two classes, in which case we usually choose t ∈ { 0 , 1 } . The goal of a classification procedure is to predict the target value (class label) given a set of input values x = { x 1 , . . . , x D } measured on the same object. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 4 / 55

  5. Classification Problems At a particular point x the value of t is not uniquely determined. It can assume both its values with respective probabilities that depend on the location of the point x in the input space. We write p ( C 1 | x ) = 1 − p ( C 2 | x ) = y ( x ) . The goal of a classification procedure is to produce an estimate of y ( x ) at every input point. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 5 / 55

  6. Two types of approaches to classification Discriminative Models (“regression”; section 4.3). Generative Models (“density estimation”; section 4.2). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 6 / 55

  7. Discriminative Models Discriminative methods only model the conditional distribution of t given x . The probability distribution of x itself is not modeled. For the binary classification problem: y ( x ) = p ( C 1 | x ) = p ( t = 1 | x ) = f ( x , w ) where f ( x , w ) is some deterministic function of x . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 7 / 55

  8. Discriminative Models Examples of discriminative classification methods: Linear probability model Logistic regression Feed-forward neural networks . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 8 / 55

  9. Generative Models An alternative paradigm for estimating y ( x ) is based on density estimation. Here Bayes’ theorem y ( x ) = p ( C 1 | x ) p ( C 1 ) p ( x |C 1 ) = p ( C 1 ) p ( x |C 1 ) + p ( C 2 ) P ( x |C 2 ) is applied where p ( x |C k ) are the class conditional probability density functions and p ( C k ) are the unconditional (“prior”) probabilities of each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 9 / 55

  10. Generative Models Examples of generative classification methods: Linear/Quadratic Discriminant Analysis, Naive Bayes classifier, . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 10 / 55

  11. Discriminative Models: linear probability model In the linear probability model, we assume that: p ( t = 1 | x ) = E [ t | x ] = w ⊤ x Problem: The linear function w ⊤ x is not guaranteed to produce values between 0 and 1. Negative probabilities and probabilities bigger than 1 go against the axioms of probability. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 11 / 55

  12. Linear response function 1 0 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 12 / 55

  13. Logistic regression Logistic response function e w ⊤ x E [ t | x ] = p ( t = 1 | x ) = 1 + e w ⊤ x or (divide numerator and denominator by e w ⊤ x ) 1 1 + e − w ⊤ x = (1 + e − w ⊤ x ) − 1 p ( t = 1 | x ) = (4.59 and 4.87) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 13 / 55

  14. Logistic Response Function 1.0 0.5 0.0 0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 14 / 55

  15. Linearization: the logit transformation Since p ( t = 1 | x ) and p ( t = 0 | x ) have to add up to one, it follows that: 1 p ( t = 0 | x ) = 1 + e w ⊤ x Hence, p ( t = 1 | x ) p ( t = 0 | x ) = e w ⊤ x Therefore � p ( t = 1 | x ) � = w ⊤ x ln p ( t = 0 | x ) The ratio p ( t = 1 | x ) / p ( t = 0 | x )) is called the odds . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 15 / 55

  16. Linear Separation Assign to class t = 1 if p ( t = 1 | x ) > p ( t = 0 | x ), i.e. if p ( t = 1 | x ) p ( t = 0 | x ) > 1 This is true if � p ( t = 1 | x ) � ln > 0 p ( t = 0 | x ) So � 1 if w ⊤ x > 0 Assign to class t = 0 otherwise Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 16 / 55

  17. Maximum Likelihood Estimation t = 1 if heads, t = 0 if tails. µ = p ( t = 1). One coin flip p ( t ) = µ t (1 − µ ) 1 − t Note that p (1) = µ , and p (0) = 1 − µ as required. Sequence of N independent coin flips N � µ t n (1 − µ ) 1 − t n p ( t ) = p ( t 1 , t 2 , ..., t N ) = n =1 which defines the likelihood function when viewed as a function of µ . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 17 / 55

  18. Maximum Likelihood Estimation In a sequence of 10 coin flips we observe t = (1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 0). The corresponding likelihood function is p ( t | µ ) = µ · (1 − µ ) · µ · µ · (1 − µ ) · µ · µ · µ · µ · (1 − µ ) = µ 7 (1 − µ ) 3 The corresponding loglikelihood function is ln p ( t | µ ) = ln( µ 7 (1 − µ ) 3 ) = 7 ln µ + 3 ln(1 − µ ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 18 / 55

  19. Computing the maximum To determine the maximum we take the derivative and equate it to zero d ln p ( t | µ ) = 7 3 µ − 1 − µ = 0 d µ which yields maximum likelihood estimate µ ML = 0 . 7. This is just the relative frequency of heads in the sample. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 19 / 55

  20. Loglikelihood function for t = (1 , 0 , 1 , 1 , 0 , 1 , 1 , 1 , 1 , 0) −10 −15 −20 −25 −30 0.0 0.2 0.4 0.6 0.8 1.0 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 20 / 55

  21. ML estimation for logistic regression Now probability of success p ( t n = 1) depends on the value of x n : p ( t n = 1 | x n ) = (1 + e − w ⊤ x n ) − 1 = y n p ( t n = 0 | x n ) = (1 + e w ⊤ x n ) − 1 1 − y n = we can represent its probability distribution as follows n (1 − y n ) 1 − t n p ( t n ) = y t n t n ∈ { 0 , 1 } ; n = 1 , . . . , N Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 21 / 55

  22. ML estimation for logistic regression Example p ( t n ) n x n t n (1 + e w 0 +8 w 1 ) − 1 1 8 0 (1 + e w 0 +12 w 1 ) − 1 2 12 0 (1 + e − w 0 − 15 w 1 ) − 1 3 15 1 (1 + e − w 0 − 10 w 1 ) − 1 4 10 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 22 / 55

  23. LR: likelihood function Since the t n observations are independent: N N � � n (1 − y n ) 1 − t n y t n p ( t | w ) = p ( t n ) = (4.89) n =1 n =1 Or, taking minus the natural log: N � n (1 − y n ) 1 − t n y t n − ln p ( t | w ) = − ln n =1 N � = − { t n ln y n + (1 − t n ) ln(1 − y n ) } (4.90) n =1 This is called the cross-entropy error function. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 23 / 55

  24. LR: error function Since for the logistic regression model (1 + e − w ⊤ x n ) − 1 y n = (1 + e w ⊤ x n ) − 1 1 − y n = we get � � N � t n ln(1 + e − w ⊤ x n ) + (1 − t n ) ln(1 + e w ⊤ x n ) E ( w ) = n =1 Non-linear function of the parameters. No closed form solution. Error function globally convex. Estimate with e.g. gradient descent . . . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 24 / 55

  25. Fitted Response Function Substitute maximum likelihood estimates into the response function to obtain the fitted response function e w ⊤ ML x p ( t = 1 | x ) = ˆ 1 + e w ⊤ ML x Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 25 / 55

  26. Example: Programming Assignment Model the probability of successfully completing a programming assignment. Explanatory variable: “programming experience”. We find w 0 = − 3 . 0597 and w 1 = 0 . 1615, so e − 3 . 0597+0 . 1615 x n p ( t = 1 | x n ) = ˆ 1 + e − 3 . 0597+0 . 1615 x n 14 months of programming experience: e − 3 . 0597+0 . 1615(14) p ( t = 1 | x = 14) = ˆ 1 + e − 3 . 0597+0 . 1615(14) ≈ 0 . 31 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 26 / 55

  27. Interpretation of weights In case of a single predictor variable, the odds of t = 1 are given by: p ( t = 1 | x ) p ( t = 0 | x ) = e w 0 + w 1 x If we increase x by 1 unit, then the odds become: e w 0 + w 1 ( x +1) = e w 0 + w 1 x + w 1 = e w 0 + w 1 x e w 1 , since e a + b = e a × e b . We have e w 1 = e 0 . 1615 ≈ 1 . 175 Hence, every extra month of programming experience increases the odds of success by 17 . 5%. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 27 / 55

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend