applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Logistic - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives what are linear classifiers logistic


  1. Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

  2. Learning objectives Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification 2

  3. Motivation Motivation we have seen KNN for classification we see more classifiers today (linear classifiers) Logistic Regression is the most commonly reported data science method used at work souce: 2017 Kaggle survey 3 . 1

  4. Classification problem Classification problem R D ( n ) ∈ dataset of inputs x ( n ) ∈ {0, … , C } and discrete targets y ( n ) ∈ {0, 1} binary classification y linear classification : decision boundaries are linear ⊤ linear decision boundary w x + b how do we find these boundaries? different approaches give different linear classifiers 3 . 2

  5. Using linear regression Using linear regression first idea 1 ∑ n =1 N ∗ ⊤ ( n ) I ( y ( n ) c )) 2 fit a linear model to each class c: w = arg min ( w − = x w c 2 c c ^ ( n ) ⊤ ( n ) = arg max class label for a new instance is then y w x c c ⊤ ⊤ x = decision boundary between any two classes w w x c ′ c recall 1 ⊤ x = [1, x ] T w x 2 example where are the decision boundaries? T w x 3 but the instances are linearly separable T w x 1 we should be able to find these boundaries where is the problem? x 1 3 . 3

  6. Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ⊤ a x − b x = 0 decision boundary is here ⊤ w x > 0 ⊤ ( a − b ) x = 0 { ⊤ y = 1 w x > 0 ⊤ w x = 0 ⊤ ⊤ y = 0 w x < 0 w x < 0 so one weight vector is enough 3 . 4

  7. Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ( n ) correctly classified w x = 100 > 0 2 99 2 L2 loss due to this instance: (100 − 1) = ′ ⊤ ( n ) = −2 < 0 in correctly classified w x 2 L2 loss due to this instance: (−2 − 1) = 9 correct prediction can have higher loss than the incorrect one! solution: we should try squashing all positive instance together and all the negative ones together 3 . 5

  8. Logistic function Logistic function ⊤ ⊤ Idea: apply a squashing function to w x → σ ( w x ) desirable property of σ : R → R ⊤ all are squashed close together w x > 0 all are squashed together ⊤ w x < 0 logistic function has these properties the decision boundary is 1 ⊤ ⊤ 1 w x = 0 ⇔ σ ( w x ) = ⊤ σ ( w x ) = 2 ⊤ 1+ e − w x still a linear decision boundary T w x 3 . 6

  9. Logistic regression: Logistic regression: model model 1 ⊤ ( x ) = σ ( w x ) = f w ⊤ 1+ e − w x z logit logistic function squashing function activation function note the linear decision boundary 3 . 7

  10. Logistic regression: the loss Logistic regression: the loss first idea use the misclassification error 1 I ( y ( ^ , y ) =  sign( = ^ − )) L 0/1 y y 2 ⊤ σ ( w x ) not a continuous function (in w) hard to optimize 3 . 8

  11. Logistic regression: the loss Logistic regression: the loss second idea use the L2 loss 1 ^ 2 ( ^ , y ) = ( y − ) L 2 y y 2 ⊤ σ ( w x ) thanks to squashing, the previous problem is resolved loss is continuous still a problem: hard to optimize (non-convex in w) 3 . 9

  12. Logistic regression: the loss Logistic regression: the loss third idea use the cross-entropy loss ( ^ , y ) = − y log( ^ ) − (1 − y ) log(1 − ^ ) L CE y y y ⊤ σ ( w x ) it is convex in w probabilistic interpretation (soon!) 3 . 10 Winter 2020 | Applied Machine Learning (COMP551)

  13. Cost Cost function function we need to optimize the cost wrt. parameters first: simplify N ( n ) ⊤ ( n ) ( n ) ⊤ ( n ) J ( w ) = ∑ n =1 − y log( σ ( w x )) − (1 − y ) log(1 − σ ( w x )) substitute logistic function log ( ) = ⊤ 1 substitute logistic function − w x − log ( 1 + e ) ⊤ 1+ e − w x log ( 1 − ) = log ( ) = ⊤ 1 1 − log ( 1 + e w x ) ⊤ ⊤ 1+ e − w 1+ e w x x ⊤ ⊤ ( n ) − w x ( n ) simplified cost N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e 4 . 1

  14. Cost function Cost function implementing the ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = ∑ n =1 log ( 1 + ) + (1 − y ) log ( 1 + w x ) simplified cost: y e e def cost(w, # D X, # N x D y # N ): z = np.dot(X,w) #N x 1 J = np.mean( y * np.log1p(np.exp(-z)) + (1-y) * np.log1p(np.exp(z)) ) return J why not np.log(1 + np.exp(-z)) ? ϵ log(1 + ϵ ) for small , suffers from floating point inaccuracies In [3]: np.log(1+1e-100) x 2 x 3 Out[3]: 0.0 log(1 + ϵ ) = ϵ − + − ... In [4]: np.log1p(1e-100) 2 3 Out[4]: 1e-100 4 . 2

  15. Example Example: binary classification : binary classification classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 2 classes (blue vs others) 1 features (petal width + bias) 4 . 3

  16. Example Example: binary classification : binary classification we have two weights associated with bias + petal width J ( w ) as a function of these weights w = [0, 0] w 0 bias w ∗ ∗ ∗ σ ( w + x ) w 0 1 w 1 x (petal width) 4 . 4

  17. Gradient Gradient how did we find the optimal weights? (in contrast to linear regression, no closed form solution) cost: ⊤ ( n ) ⊤ ( n ) ( n ) − w x ( n ) N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e ⊤ ( n ) ⊤ ( n ) ( n ) e − w ( n ) x e w x taking partial derivative ∂ ( n ) ( n ) J ( w ) = − y + (1 − ) ∑ n x x y ∂ w ⊤ ( n ) ⊤ ( n ) d d 1+ e − w 1+ e w x x d ( n ) y ( n ) ( n ) ( n ) y ( n ) ^ ( n ) ^ ( n ) ^ ( n ) ( n ) = − x (1 − ) + (1 − ) = ( − ) ∑ n y y x y x y d d d ( n ) y ^ ( n ) ( n ) gradient ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) σ ( w x ) ⊤ ( n ) w x compare to gradient for linear regression ( n ) y ^ ( n ) ( n ) ∇ J ( w ) = ( − ) ∑ n x y 4 . 5 Winter 2020 | Applied Machine Learning (COMP551)

  18. Probabilistic view of logistic regression Probabilistic view of logistic regression probabilistic interpretation of logistic regression 1 ⊤ ^ = ( y = 1 ∣ x ) = = σ ( w x ) y p w ⊤ 1+ e − w x ^ logit function is the inverse of logistic y ⊤ log = w x 1− ^ y the log-ratio of class probabilities is linear likelihood probability of data as a function of model parameters ( n ) ( n ) ( n ) ⊤ ( n ) ^ ( n ) y ( n ) L ( w ) = p ( y ∣ ) = Bernoulli( y ; σ ( w x )) ^ ( n ) 1− y ( n ) x = (1 − ) w y y is a function of w ( n ) ^ ( n ) = 1 is the probability of y y not a probability distribution function ^ ( n ) y ( n ) likelihood of the dataset L ( w ) = ^ ( n ) 1− y ( n ) N ( n ) ( n ) N ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w 5 . 1

  19. Maximum likelihood & logistic regression Maximum likelihood & logistic regression ^ ( n ) y ( n ) likelihood ^ ( n ) 1− y ( n ) N ( n ) ( n ) N L ( w ) = ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w maximum likelihood use the model that maximizes the likelihood of observations ∗ w = arg max L ( w ) w likelihood value blows up for large N, work with log-likelihood instead (same maximum) log likelihood N ( n ) ( n ) max log p ( y ∣ ) w ∑ n =1 x w ( n ) ^ ( n ) ( n ) ^ ( n ) N = max log( ) + (1 − y ) log(1 − ) w ∑ n =1 y y y = min J ( w ) the cross entropy cost function! w so using cross-entropy loss in logistic regression is maximizing conditional likelihood 5 . 2

  20. Maximum likelihood & linear regression Maximum likelihood & linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − ⊤ 2 1 cond. probability p ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 mean μ ⊤ y w x σ 2 variance σ standard deviation (don't confuse with logistic function) x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 3

  21. Maximum likelihood & Maximum likelihood & linear regression linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − 1 cond. probability p ⊤ 2 ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 T y N ( n ) ( n ) w x likelihood L ( w ) = ( y ∣ ) ∏ n =1 p x w 1 ( n ) ⊤ ( n ) 2 log likelihood ℓ( w ) = ∑ n − ( y − ) + constants w x 2 σ 2 1 ∑ n ∗ ( n ) ⊤ ( n ) 2 optimal params. w = arg max ℓ( w ) = arg min ( y − ) w x w 2 w linear least squares! x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend