Applied Machine Learning Logistic and Softmax Regression Siamak - PowerPoint PPT Presentation

Applied Machine Learning Logistic and Softmax Regression Siamak Ravanbakhsh COMP 551 (Fall 2020)

Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification

Motivation Logistic Regression is the most commonly reported data science method used at work souce: 2017 Kaggle survey

Classification problem R D ( n ) ∈ dataset of inputs x ( n ) ∈ {0, … , C } and discrete targets y ( n ) ∈ {0, 1} binary classification y linear classification : ⊤ linear decision boundary w x + b how do we find these boundaries? different approaches give different linear classifiers

Using linear regression first idea consider binary classification fit a linear model to predict the label y ∈ {−1, 1} ⊤ w x > 0 ⊤ set the decision boundary as w x = 0 ⊤ w x < 0 { ⊤ y = 1 w x > 0 given a new instance assign the label accordingly ⊤ y = −1 w x < 0

Using linear regression first idea consider binary classification fit a linear model to predict the label y ∈ {−1, 1} ⊤ ( n ) correctly classified w x = 100 > 0 2 99 2 L2 loss due to this instance: (100 − 1) = ′ ⊤ ( n ) = −2 < 0 in correctly classified w x 2 L2 loss due to this instance: (−2 − 1) = 9 correct prediction can have higher loss than the incorrect one! solution: we should try squashing all positive instance together and all the negative ones together

Logistic function ⊤ ⊤ Idea: apply a squashing function to w x → σ ( w x ) desirable property of σ : R → R ⊤ all are squashed close together w x > 0 all are squashed together ⊤ w x < 0 logistic function has these properties the decision boundary is 1 ⊤ ⊤ 1 w x = 0 ⇔ σ ( w x ) = ⊤ σ ( w x ) = 2 ⊤ 1+ e − w x still a linear decision boundary T w x

Logistic regression: model 1 ⊤ f ( x ) = σ ( w x ) = w ⊤ 1+ e − w x z logit logistic function squashing function activation function classifiers for different weights

Logistic regression: model example x = [1, x ] recall the way we included a bias parameter 1 the input feature is generated uniformly in [-5,5] for all the values less than 2 we have y=1 and y=0 otherwise a good fit to this data is the one shown (green) ⊤ 1 f ( x ) = σ ( w x ) = w ⊤ 1+ e − w x in the model shown w ≈ [9.1, −4.5] ^ = σ (−4.5 x + 9.1) that is y 1 x 1 what is our model's decision boundary?

Logistic regression: the loss to find a good model, we need to define the cost (loss) function the best model is the one with the lowest cost cost is the some of loss values for individual points first idea use the misclassification error 1 I ( y = ( , y ) = ^  sign( − ^ )) L 0/1 y y 2 ⊤ σ ( w x ) not a continuous function (in w) hard to optimize

Logistic regression: the loss second idea use the L2 loss 1 ^ 2 L ( , y ) = ^ ( y − ) 2 y y 2 ⊤ σ ( w x ) thanks to squashing, the previous problem is resolved loss is continuous still a problem: hard to optimize (non-convex in w)

Logistic regression: the loss use the cross-entropy loss third idea ( , y ) = ^ − y log( ) − ^ (1 − y ) log(1 − ) ^ L CE y y y ⊤ σ ( w x ) it is convex in w probabilistic interpretation (soon!) examples ( y = 1, ^ = .9) = − log(.9) ( y = 1, ^ = .5) = − log(.5) smaller than L y L y CE CE ( y = 0, ^ = .9) = − log(.1) larger than ( y = 0, ^ = .5) = − log(.5) L y L y CE CE COMP 551 | Fall 2020

Cost function optional we need to optimize the cost wrt. parameters first: simplify N ( n ) ⊤ ( n ) ( n ) ⊤ ( n ) J ( w ) = ∑ n =1 − y log( σ ( w x )) − (1 − y ) log(1 − σ ( w x )) substitute logistic function log ( ) = ⊤ 1 substitute logistic function − w x − log ( 1 + e ) ⊤ 1+ e − w x log ( 1 − ) = log ( ) = ⊤ 1 1 − log ( 1 + e w x ) ⊤ ⊤ 1+ e − w 1+ e w x x ⊤ ⊤ ( n ) − w x ( n ) simplified cost N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e

Cost function implementing the optional ⊤ ⊤ N ( n ) − w x ( n ) J ( w ) = ∑ n =1 log ( 1 + ) + (1 − y ) log ( 1 + w x ) simplified cost: y e e def cost(w, # D x, # N x D y # N ): z = np.dot(x,w) #N x 1 J = np.mean( y * np.log1p(np.exp(-z)) + (1-y) * np.log1p(np.exp(z)) ) return J why not np.log(1 + np.exp(-z)) ? ϵ log(1 + ϵ ) for small , suffers from floating point inaccuracies In [3]: np.log(1+1e-100) x 2 x 3 Out[3]: 0.0 log(1 + ϵ ) = ϵ − + − ... In [4]: np.log1p(1e-100) 2 3 Out[4]: 1e-100

Example: binary classification classification on Iris flowers dataset : (a classic dataset originally used by Fisher) samples with D=4 features, for each of C=3 N = 50 c species of Iris flower our setting 2 classes (blue vs others) 1 features (petal width + bias)

Example: binary classification we have two weights associated with bias + petal width J ( w ) as a function of these weights w = [0, 0] w 0 bias w ∗ ∗ ∗ σ ( w + w x ) 0 1 w 1 x (petal width)

Gradient how did we find the optimal weights? (in contrast to linear regression, no closed form solution) cost: ⊤ ( n ) ⊤ ( n ) ( n ) − w x ( n ) N J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + w x ∑ n =1 ) y e e ⊤ ( n ) ⊤ ( n ) ( n ) e − w ( n ) x e w x taking partial derivative ∂ ( n ) ( n ) J ( w ) = − y + (1 − ) ∑ n x x y ∂ w d ⊤ ( n ) ⊤ ( n ) d d 1+ e − w 1+ e w x x ( n ) y ( n ) ( n ) ( n ) y ( n ) ^ ( n ) ^ ( n ) ^ ( n ) ( n ) = − x (1 − ) + (1 − ) = ( − ) ∑ n y y x y x y d d d ( n ) y ^ ( n ) ( n ) gradient ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) σ ( w x ) ( n ) y compare to gradient for linear regression ^ ( n ) ( n ) ∇ J ( w ) = ( − ) ∑ n x y ⊤ ( n ) w x COMP 551 | Fall 2020

Probabilistic view ⊤ Interpret the prediction as class probability ^ = p ( y = 1 ∣ x ) = σ ( w x ) y w ⊤ the log-ratio of class probabilities is linear ^ σ ( w x ) 1 ⊤ y log = log = log = w x 1− y ^ 1− σ ( w x ) ⊤ ⊤ e − w x logit function is the inverse of logistic so we have a Bernoulli likelihood ^ ( n ) y ( n ) ( n ) ( n ) ( n ) ⊤ ( n ) ^ ( n ) 1− y ( n ) p ( y ∣ ; w ) = Bernoulli( y ; σ ( w x )) x = (1 − ) y y conditional likelihood of the labels given the inputs ^ ( n ) y ( n ) ^ ( n ) 1− y ( n ) N N ( n ) ( n ) L ( w ) = p ( y ∣ ; w ) = (1 − ) ∏ n =1 ∏ n =1 x y y

Maximum likelihood & logistic regression ^ ( n ) y ( n ) ^ ( n ) 1− y ( n ) N ( n ) ( n ) N likelihood L ( w ) = p ( y ∣ ; w ) = (1 − ) ∏ n =1 ∏ n =1 x y y log likelihood find w that maximizes ∗ N ( n ) ( n ) w = max ∑ n =1 log p ( y ∣ ; w ) x w ( n ) ^ ( n ) ( n ) ^ ( n ) N = max log( ) + (1 − y ) log(1 − ) w ∑ n =1 y y y = min J ( w ) the cross entropy cost function! w so using cross-entropy loss in logistic regression is maximizing conditional likelihood we saw a similar interpretation for linear regression (L2 loss maximizes the conditional Gaussian likelihood) COMP 551 | Fall 2020

Multiclass classification using this probabilistic view we extend logistic regression to multiclass setting binary classification : Bernoulli likelihood: ^ 1− y ^ ∈ [0, 1] Bernoulli( y ∣ ^ ) = ^ y (1 − ) y y y y subject to ^ = σ ( z ) = σ ( w x ) T using logistic function to ensure this y C classes: categorical likelihood I ( y = c ) C Categorical( y ∣ ^ ) = ^ c ∏ c =1 ∑ c y ^ c = 1 y y subject to achieved using softmax function

Softmax generalization of logistic to > 2 classes: σ : R → (0, 1) logistic : produces a single probability probability of the second class is (1 − σ ( z )) R C → Δ C softmax: C p ∈ Δ → p = 1 ∑ c =1 recall: probability simplex c c e zc ^ c = softmax( z ) = y so ^ = 1 ∑ c y c zc ′ ∑ c =1 C e ′ e 2 example 1 softmax([1, 1, 2, 0]) = [ e , e , , ] 2 2 2 2 2 e + e +1 2 e + e +1 2 e + e +1 2 e + e +1 softmax([10, 100, −1]) ≈ [0, 1, 0] if input values are large, softmax becomes similar to argmax so similar to logistic this is also a squashing function

Multiclass classification C classes: categorical likelihood I ( y = c ) C Categorical( y ∣ ^ ) = ^ c ∏ c =1 using softmax to enforce sum-to-one constraint y y c ⊤ e w x 1⊤ C ⊤ ^ c = softmax([ w x , … , w x ]) = y c c ′ ⊤ w x ∑ c ′ e so we have on parameter vector for each class c ⊤ to simplify equations we write z = w x c e zc ^ c = softmax([ z , … , z ]) = y 1 C c zc ′ ∑ c ′ e

Likelihood for multiclass classification C classes: categorical likelihood I ( y = c ) using softmax to enforce sum-to-one constraint C Categorical( y ∣ ^ ) = ^ c ∏ c =1 y y e zc c ⊤ z = ^ c = softmax([ z , … , z ]) = w x y where 1 c C c zc ′ ∑ c ′ e substituting softmax in Categorical likeihood: ( n ) I ( y likelihood ( n ) ( n ) = c ) N C L ({ w }) = softmax([ z , … , z ]) ∏ n =1 ∏ c =1 c c 1 C ( n ) I ( y = c ) ( ) ( n ) e zc N C = ∏ n =1 ∏ c =1 ( n ) zc ′ ∑ c ′ e

Applied Machine Learning Logistic and Softmax Regression Siamak - PowerPoint PPT Presentation

Applied Machine Learning Logistic and Softmax Regression Siamak Ravanbakhsh COMP 551 (Fall 2020) Learning objectives what are linear classifiers logistic regression model loss function maximum likelihood view multi-class classification

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 17: Midterm

Logistic regression Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark

CSE 158 Lecture 10 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday!

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models Shanshan Wu , Sujay

The Power of Binary 0, 1, 10, 11, 100, 101, 110, 111... What is Binary? a binary number

Binary Numbers Wolfgang Schreiner Research Institute for Symbolic Computation (RISC-Linz)

The Decimal Number System ICS3U: Introduction to Computer Science The number system that we use