Applied Machine Learning Applied Machine Learning Logistic - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification 2

Motivation Motivation we have seen KNN for classification we see more classifiers today (linear classifiers) Logistic Regression is the most commonly reported data science method used at work souce: 2017 Kaggle survey 3 . 1

Classification problem Classification problem R D ( n ) ∈ dataset of inputs x ( n ) ∈ {0, … , C } and discrete targets y ( n ) ∈ {0, 1} binary classification y linear classification : decision boundaries are linear ⊤ linear decision boundary w x + b how do we find these boundaries? different approaches give different linear classifiers 3 . 2

Using linear regression Using linear regression first idea 1 ∑ n =1 N ∗ ⊤ ( n ) I ( y ( n ) c )) 2 fit a linear model to each class c: w = arg min ( w − = x w c 2 c c ^ ( n ) ⊤ ( n ) = arg max class label for a new instance is then y w x c c ⊤ ⊤ x = decision boundary between any two classes w w x c ′ c recall 1 ⊤ x = [1, x ] T w x 2 example where are the decision boundaries? T w x 3 but the instances are linearly separable T w x 1 we should be able to find these boundaries where is the problem? x 1 3 . 3

Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ⊤ a x − b x = 0 decision boundary is here ⊤ w x > 0 ⊤ ( a − b ) x = 0 { ⊤ y = 1 w x > 0 ⊤ w x = 0 ⊤ ⊤ y = 0 w x < 0 w x < 0 so one weight vector is enough 3 . 4

Using linear regression Using linear regression first idea ⊤ ⊤ so we are fitting 2 linear models a x , b x y ∈ {0, 1} Binary classification ⊤ ( n ) correctly classified w x = 100 > 0 2 99 2 L2 loss due to this instance: (100 − 1) = ′ ⊤ ( n ) = −2 < 0 in correctly classified w x 2 L2 loss due to this instance: (−2 − 1) = 9 correct prediction can have higher loss than the incorrect one! solution: we should try squashing all positive instance together and all the negative ones together 3 . 5

Logistic function Logistic function ⊤ ⊤ Idea: apply a squashing function to w x → σ ( w x ) desirable property of σ : R → R ⊤ all are squashed close together w x > 0 all are squashed together ⊤ w x < 0 logistic function has these properties the decision boundary is 1 ⊤ ⊤ 1 w x = 0 ⇔ σ ( w x ) = ⊤ σ ( w x ) = 2 ⊤ 1+ e − w x still a linear decision boundary T w x 3 . 6

Logistic regression: Logistic regression: model model 1 ⊤ ( x ) = σ ( w x ) = f w ⊤ 1+ e − w x z logit logistic function squashing function activation function note the linear decision boundary 3 . 7

Logistic regression: the loss Logistic regression: the loss first idea use the misclassification error 1 I ( y ( ^ , y ) =  sign( = ^ − )) L 0/1 y y 2 ⊤ σ ( w x ) not a continuous function (in w) hard to optimize 3 . 8

Logistic regression: the loss Logistic regression: the loss second idea use the L2 loss 1 ^ 2 ( ^ , y ) = ( y − ) L 2 y y 2 ⊤ σ ( w x ) thanks to squashing, the previous problem is resolved loss is continuous still a problem: hard to optimize (non-convex in w) 3 . 9

Logistic regression: the loss Logistic regression: the loss third idea use the cross-entropy loss ( ^ , y ) = − y log( ^ ) − (1 − y ) log(1 − ^ ) L CE y y y ⊤ σ ( w x ) it is convex in w probabilistic interpretation (soon!) 3 . 10 Winter 2020 | Applied Machine Learning (COMP551)

Cost Cost function function we need to optimize the cost wrt. parameters first: simplify N ( n ) ⊤ ( n ) ( n ) ⊤ ( n ) J ( w ) = ∑ n =1 − y log( σ ( w x )) − (1 − y ) log(1 − σ ( w x )) substitute logistic function log ( ) = ⊤ 1 substitute logistic function − w x − log ( 1 + e ) ⊤ 1+ e − w x log ( 1 − ) = log ( ) = ⊤ 1 1 − log ( 1 + e w x ) ⊤ ⊤ 1+ e − w 1+ e w x x ⊤ ⊤ ( n ) − w x ( n ) simplified cost N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e 4 . 1

Cost function Cost function implementing the ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = ∑ n =1 log ( 1 + ) + (1 − y ) log ( 1 + w x ) simplified cost: y e e def cost(w, # D X, # N x D y # N ): z = np.dot(X,w) #N x 1 J = np.mean( y * np.log1p(np.exp(-z)) + (1-y) * np.log1p(np.exp(z)) ) return J why not np.log(1 + np.exp(-z)) ? ϵ log(1 + ϵ ) for small , suffers from floating point inaccuracies In [3]: np.log(1+1e-100) x 2 x 3 Out[3]: 0.0 log(1 + ϵ ) = ϵ − + − ... In [4]: np.log1p(1e-100) 2 3 Out[4]: 1e-100 4 . 2

Example Example: binary classification : binary classification classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 = 50 N c species of Iris flower our setting 2 classes (blue vs others) 1 features (petal width + bias) 4 . 3

Example Example: binary classification : binary classification we have two weights associated with bias + petal width J ( w ) as a function of these weights w = [0, 0] w 0 bias w ∗ ∗ ∗ σ ( w + x ) w 0 1 w 1 x (petal width) 4 . 4

Gradient Gradient how did we find the optimal weights? (in contrast to linear regression, no closed form solution) cost: ⊤ ( n ) ⊤ ( n ) ( n ) − w x ( n ) N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e ⊤ ( n ) ⊤ ( n ) ( n ) e − w ( n ) x e w x taking partial derivative ∂ ( n ) ( n ) J ( w ) = − y + (1 − ) ∑ n x x y ∂ w ⊤ ( n ) ⊤ ( n ) d d 1+ e − w 1+ e w x x d ( n ) y ( n ) ( n ) ( n ) y ( n ) ^ ( n ) ^ ( n ) ^ ( n ) ( n ) = − x (1 − ) + (1 − ) = ( − ) ∑ n y y x y x y d d d ( n ) y ^ ( n ) ( n ) gradient ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) σ ( w x ) ⊤ ( n ) w x compare to gradient for linear regression ( n ) y ^ ( n ) ( n ) ∇ J ( w ) = ( − ) ∑ n x y 4 . 5 Winter 2020 | Applied Machine Learning (COMP551)

Probabilistic view of logistic regression Probabilistic view of logistic regression probabilistic interpretation of logistic regression 1 ⊤ ^ = ( y = 1 ∣ x ) = = σ ( w x ) y p w ⊤ 1+ e − w x ^ logit function is the inverse of logistic y ⊤ log = w x 1− ^ y the log-ratio of class probabilities is linear likelihood probability of data as a function of model parameters ( n ) ( n ) ( n ) ⊤ ( n ) ^ ( n ) y ( n ) L ( w ) = p ( y ∣ ) = Bernoulli( y ; σ ( w x )) ^ ( n ) 1− y ( n ) x = (1 − ) w y y is a function of w ( n ) ^ ( n ) = 1 is the probability of y y not a probability distribution function ^ ( n ) y ( n ) likelihood of the dataset L ( w ) = ^ ( n ) 1− y ( n ) N ( n ) ( n ) N ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w 5 . 1

Maximum likelihood & logistic regression Maximum likelihood & logistic regression ^ ( n ) y ( n ) likelihood ^ ( n ) 1− y ( n ) N ( n ) ( n ) N L ( w ) = ( y ∣ ) = (1 − ) ∏ n =1 ∏ n =1 p x y y w maximum likelihood use the model that maximizes the likelihood of observations ∗ w = arg max L ( w ) w likelihood value blows up for large N, work with log-likelihood instead (same maximum) log likelihood N ( n ) ( n ) max log p ( y ∣ ) w ∑ n =1 x w ( n ) ^ ( n ) ( n ) ^ ( n ) N = max log( ) + (1 − y ) log(1 − ) w ∑ n =1 y y y = min J ( w ) the cross entropy cost function! w so using cross-entropy loss in logistic regression is maximizing conditional likelihood 5 . 2

Maximum likelihood & linear regression Maximum likelihood & linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − ⊤ 2 1 cond. probability p ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 mean μ ⊤ y w x σ 2 variance σ standard deviation (don't confuse with logistic function) x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 3

Maximum likelihood & Maximum likelihood & linear regression linear regression squared error loss also has max-likelihood interpretation ⊤ 2 ( y − w x ) − 1 cond. probability p ⊤ 2 ( y ∣ x ) = N ( y ∣ w x , σ ) = e 2 σ 2 w 2 πσ 2 T y N ( n ) ( n ) w x likelihood L ( w ) = ( y ∣ ) ∏ n =1 p x w 1 ( n ) ⊤ ( n ) 2 log likelihood ℓ( w ) = ∑ n − ( y − ) + constants w x 2 σ 2 1 ∑ n ∗ ( n ) ⊤ ( n ) 2 optimal params. w = arg max ℓ( w ) = arg min ( y − ) w x w 2 w linear least squares! x image: http://blog.nguyenvq.com/blog/2009/05/12/linear-regression-plot-with-normal-curves-for-error-sideways/ 5 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Applied Machine Learning Applied Machine Learning Logistic - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives what are linear classifiers logistic

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak

The Log-Linear Model The flu example from last class is actually one of our most common

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1

Combining probabilities with log-linear pooling : application to spatial data Denis Allard 1 ,

Workshop 10.4: Generalized linear models Murray Logan August 16, 2016 Table of contents 1

A Stack-based Algorithm for Neural Lattice Rescoring Gaurav Kumar Center for Language and Speech

Lecture 5 Linear Models Lin ZHANG, PhD School of Software Engineering Tongji University Fall

A Brief History of Lognormal and Power Law Distributions and an Application to File Size