Multiclass Logistic Regression, Multilayer Perceptron Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 4 Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Logistic Regression p ( C ∣ x ) 0 An extension of perceptron, which models the conditional probabilities of and of p ( C ∣ x ) 1 . Logistic regression can in fact handle also more than two classes, which we will see shortly. Logistic regression employs the following parametrization of the conditional class probabilities: p ( C ∣ x ) = σ ( x w + b ) t 1 p ( C ∣ x ) = 1 − p ( C ∣ x ), 0 1 σ where is a sigmoid function 1 σ ( x ) = . 1 + e − x It can be trained using the SGD algorithm. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 2/29

Logistic Regression To give some meaning to the sigmoid function, starting with 1 p ( C ∣ x ) = σ ( y ( x ; w )) = 1 1 + e − y ( x ; w ) we can arrive at p ( C ∣ x ) ( p ( C ) 1 y ( x ; w ) = log , ∣ x ) 0 y ( x ; w ) where the prediction of the model is called a logit and it is a logarithm of odds of the two classes probabilities. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 3/29

Logistic Regression y ( x ; w ) = x w T To train the logistic regression , we use MLE (the maximum likelihood p ( C ∣ x ; w ) = σ ( y ( x ; w )) 1 estimation). Note that . X = {( x , t ), ( x , t ), … , ( x , t )} 1 1 2 2 N N Therefore, the loss for a batch is 1 ∑ L ( X ) = − log( p ( C ∣ x ; w )). t i i N i X ∈ R N × D t ∈ {0, +1} N α ∈ R + Input : Input dataset ( , ), learning rate . w ← 0 N until convergence (or until patience is over), process batch of examples: 1 ∑ i g ← ∇ − log( p ( C ∣ x ; w )) w t i N i w ← w − α g NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 4/29

Linearity in Logistic Regression NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 5/29

Generalized Linear Models The logistic regression is in fact an extended linear regression. A linear regression model, which a is followed by some activation function , is called generalized linear model : p ( t ∣ x ; w , b ) = a ( y ( x ; w , b ) ) = a ( x w + T b ). Name Activation Distribution Loss Gradient MSE ∝ E ( y ( x ) − t ) 2 ( y ( x ) − t ) ⋅ x linear regression identity ? NLL ∝ E − log( p ( t ∣ x )) σ ( x ) ( a ( y ( x )) − t ) ⋅ x logistic regression Bernoulli NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 6/29

Mean Square Error as MLE During regression, we predict a number, not a real probability distribution. In order to generate a distribution, we might consider a distribution with the mean of the predicted value and a fixed σ 2 variance – the most general such a distribution is the normal distribution. NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 7/29

Mean Square Error as MLE 2 p ( t ∣ x ; w ) = N ( t ; y ( x ; w ), σ ) Therefore, assume our model generates a distribution . Now we can apply MLE and get N ∑ p ( X ; w ) = arg max arg min − log p ( t ∣ x ; w ) i i w w i =1 N 2 1 ( t − y ( x ; w )) ∑ − i i = − arg min log e 2 σ 2 2 πσ 2 w i =1 N 2 ( t − y ( x ; w )) ∑ 2 −1/2 i i = − arg min N log(2 πσ ) + − 2 σ 2 w i =1 N N 2 ( t − y ( x ; w )) ∑ ∑ 2 i i 1 = arg min = arg min N ( t − y ( x ; w )) . i i 2 σ 2 w w i =1 i =1 NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 8/29

Generalized Linear Models We have therefore extended the GLM table to Name Activation Distribution Loss Gradient ( y ( x ) − t ) ⋅ x Normal NLL ∝ MSE linear regression identity NLL ∝ E − log( p ( t ∣ x )) σ ( x ) ( a ( y ( x )) − t ) ⋅ x logistic regression Bernoulli NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 9/29

Multiclass Logistic Regression K To extend the binary logistic regression to a multiclass case with classes, we: W ∈ R D × K K generate outputs, each with its own set of weights, so that for , T T y ( x ; W ) = x W , or in other words, y ( x ; W ) = x ( W ) ∗, i i softmax generalize the sigmoid function to a function, such that e y i softmax( y ) = . i ∑ j y e j Note that the original sigmoid function can be written as 1 e x σ ( x ) = softmax ( [ x 0] ) = = . 0 e + e 0 1 + e − x x The resulting classifier is also known as multinomial logistic regression , maximum entropy classifier or softmax regression . NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 10/29

Multiclass Logistic Regression softmax From the definition of the function e y i softmax( y ) = , i ∑ j y e j y ( x ; W ) it is natural to obtain the interpretation of the model outputs as logits : y ( x ; W ) = log( p ( C ∣ x ; W )) + c . i i c The constant is present, because the output of the model is overparametrized (the probability of for example the last class could be computed from the remaining ones). This is connected to the fact that softmax is invariant to addition of a constant: + c e y e y e c i i softmax( y + c ) = = ⋅ = softmax( y ) . i i + c ∑ j ∑ j y y e c e e j j NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 11/29

Multiclass Logistic Regression The difference between softmax and sigmoid output can be compared on the binary case, where the binary logistic regression model outputs are p ( C ∣ x ; w ) ( p ( C ) 1 y ( x ; w ) = log , ∣ x ; w ) 0 while the outputs of the softmax variant with two outputs can be interpreted as y ( x ; W ) = log( p ( C ∣ x ; W )) + y ( x ; W ) = log( p ( C ∣ x ; W )) + c c 0 0 1 1 and . y ( x ; W ) p ( C ∣ x ) 0 1 If we consider to be zero, the model can then predict only the probability , − log( p ( C ∣ x ; W )) c 0 and the constant is fixed to , recovering the original interpretation. K − 1 = 0 K y K Therefore, we could produce only outputs for -class classification and define , resulting in the interpretation of the model outputs analogous to the binary case: p ( C ∣ x ; W ) ( p ( C ) i y ( x ; W ) = log . i ∣ x ; W ) K NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 12/29

Multiclass Logistic Regression softmax Using the function, we naturally define that e ( x W ) T i T p ( C ∣ x ; W ) = softmax( x W ) = . i i ( x W ) ∑ j T e j We can then use MLE and train the model using stochastic gradient descent. X ∈ R N × D t ∈ {0, 1, … , K − 1} N α ∈ R + Input : Input dataset ( , ), learning rate . w ← 0 N until convergence (or until patience is over), process batch of examples: 1 ∑ i g ← ∇ − log( p ( C ∣ x ; w )) w t i N i w ← w − α g NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 13/29

Multiclass Logistic Regression Note that the decision regions of the binary/multiclass logistic regression are convex (and therefore connected). x x A B To see this, consider and in the same decision R k region . x Any point lying on the line connecting them is their x = λ x + (1 − λ ) x A B linear combination, , and from y ( x ) = W x the linearity of it follows that y ( x ) = λ y ( x ) + (1 − λ ) y ( x ). A B ( x ) y ( x ) ( x ) f f k A A k B Given that was the largest among and also given that was the largest y ( x ) ( x ) y ( x ) f B k among , it must be the case that is the largest among all . NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 14/29

Generalized Linear Models The multiclass logistic regression can now be added to the GLM table: Name Activation Distribution Loss Gradient ( y ( x ) − t ) ⋅ x NLL ∝ MSE linear regression identity Normal NLL ∝ E − log( p ( t ∣ x )) ( a ( y ( x )) − t ) ⋅ x σ ( x ) logistic regression Bernoulli NLL ∝ E − log( p ( t ∣ x )) ( a ( y ( x )) − 1 ) ⋅ softmax( x ) x multiclass t categorical logistic regression NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 15/29

Poisson Regression There exist several others GLMs, we now describe a last one, this time for regression and not for classification. Compared to regular linear regression, where we assume the output distribution is normal, we turn our attention to Poisson distribution . Poisson Distribution Poisson distribution is a discrete distribution suitable for modeling the probability of a given number of events occurring in a fixed time interval, if these events occur with a known rate and independently of each other. k − λ λ e P (x = k ; λ ) = k ! x It is easy to show that if has Poisson distribution, E [ x ] = λ Var( x ) = λ NPFL129, Lecture 4 Refresh GLM MSE as MLE MulticlassLogisticReg PoissonReg MLP UniversalApproximation 16/29

Multiclass Logistic Regression, Multilayer Perceptron Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 4 Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Introduction to Machine Learning Multilayer Perceptron Barnabs Pczos The Multilayer

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Implementing a Multilayer Perceptron from Scratch Implementing a Multilayer Perceptron from

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

CS 472 - Perceptron 1 Basic Neuron CS 472 - Perceptron 2 Expanded Neuron CS 472 - Perceptron

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Supervised Classification with Logistic Regression CMSC 470 Marine Carpuat The Perceptron What

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Beyond GLM: The potential for a generic likelihood toolbox Peter Dalgaard Department of

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture18: Logistic regression

Generalized Linear Factor Models: a local EM estimation Xavier Bry a, Christian Lavergne ab and

Introduction to GSEM in Stata Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Linear Regression Let us assume that the target variable and the inputs are related by the

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear

Machine Learning for Computational Linguistics May 3, 2016 regression non-parametric neighbors

Sambuz

Useful Links

Newsletter

Mail Us