Linear Discriminant Analysis and Logistic Regression Matthieu R. - PDF document

1 (1) (7) (6) so that the log-likelihood takes the form (5) Linear Discriminant Analysis (LDA) is an attempt to improve on of the shortcomings of Naive Bayes, namely the assumption that given a label, the features are independent. Instead, LDA models the features as jointly Gaussian, with a covariance matrix that is class-independent . given above. Tie likelihood of the parameters is Proof. Tie MLE for the prior class distributions was already derived in Lecture 4. What is perhaps (4) (3) (2) (MLEs) for LDA are ECE 6254 - Spring 2020 - Lecture 7 v1.0 - revised January 30, 2020 Linear Discriminant Analysis and Logistic Regression Matthieu R. Bloch 1 Linear Discriminant Analysis Specifically, let x = [ x 1 , · · · , x d ] ⊺ ∈ R d be a random feature vector and let y be the label. LDA posits that given y the feature vector x has a Gaussian distribution P x | y ∼ N ( µ k , Σ ) . Note that the mean µ k is class dependent but the covariance matrix Σ is class independent. It will be convenient to denote a Gaussian multivariate distribution with parameters µ and Σ by � � 1 − 1 2( x − µ ) ⊺ Σ − 1 ( x − µ ) ϕ ( x ; µ , Σ ) ≜ . d 1 2 | Σ | 2 exp (2 π ) Given this model, LDA then performs a parameter estimation of µ k and Σ , as well as of the prior π k on the data. Lemma 1.1. Let N k be the number of data points with label k . Tie Maximum Likelihood Estimators π k = N k ∀ k ˆ N , µ k = 1 � ∀ k ˆ x i N k i : y i = k K − 1 Σ = 1 � � ˆ µ k ) ⊺ ( x i − ˆ µ k )( x i − ˆ N k =0 i : y i = k a bit surprising is that the joint MLE for all the parameters θ ≜ ( { π k } k , { µ k } , Σ ) takes the form K − 1 N � � { y i = k } π 1 ϕ ( x i ; µ k , Σ ) 1 { y i = k } L ( θ ) = k i =1 k =0 K − 1 N � � ln π k − 1 − N 2 ln (2 π ) − N � � 2( x i − µ k ) ⊺ Σ − 1 ( x i − µ k ) 1 { y i = k } 2 ln | Σ | ℓ ( θ ) = i =1 k =0 K − 1 K − 1 N − 1 { y i = k } ( x i − µ k ) ⊺ Σ − 1 ( x i − µ k ) − N 2 ln (2 π ) − N � � � 2 ln | Σ | = N k ln π k + . 2 k =0 k =0 i =1 � �� ℓ 1 ( θ ) ℓ 2 ( θ )

2 (8) Proof. Tie first part of the lemma follows by remembering that for a plug-in classifier, we have (14) Lemma 1.2. Tie LDA classifier is this again in the context of bias-variance tradeoff. points gets large. In practice, you could choose any other estimator of your liking, we will discuss yields You might notice that the covariance estimator is biased , but the bias vanishes as the number of (13) we obtain (check the matrix cookbook for the derivation rules) (12) (11) tr (10) (9) ECE 6254 - Spring 2020 - Lecture 7 v1.0 - revised January 30, 2020 Note that { π k } do not interact with { µ k } and Σ . Consequently, the MLE of { π } k is the one we N where N k = � N studied previously and π k = N k i =1 1 { y i = k } . Let us focus on maximizing ℓ 2 ( θ ) . Taking the gradient with respect to µ k and setting it to 0 K − 1 N ∂ℓ 2 ( θ ) − 1 { y i = k } � � � � − 2 Σ − 1 x i + 2 Σ − 1 µ k = ∂ µ k 2 k =0 i =1    � = Σ − 1 x i − N k µ k  x i : y i = k = 0 Conveniently, note that Σ − 1 (assumed non-singular) does not enter the equation and we obtain � 1 µ k = x i : y i = k x i . N k Finally, to take the gradient with respect to Σ , we rewrite ℓ 2 ( θ ) as K − 1 N − 1 { y i = k } − N 2 ln (2 π ) − N � � � � ( x i − µ k ) ⊺ Σ − 1 ( x i − µ k ) ℓ 2 ( θ ) = 2 ln | Σ | 2 i =1 k =0     K − 1   = − 1 − N 2 ln (2 π ) − N � � ( x i − µ k )( x i − µ k ) ⊺  Σ − 1  2 ln | Σ |   2 tr   k =0 x i : y i = k   � �� ≜ S ∂ℓ 2 ( θ ) = − 1 = 1 − Σ − 1 S Σ − 1 − N Σ − 1 � 2 Σ − 1 ( S Σ − 1 − N I ) = 0 � ∂ Σ 2 Again, for Σ − 1 non singular, we obtain Σ = S ■ N . � 1 � µ k ) ⊺ ˆ − 1 ( x − ˆ 2( x − ˆ µ k ) − log ˆ h LDA ( x ) = argmin Σ π k k For K = 2 , the LDA classifier is a linear classifier.

3 (17) (23) argmax LDA. You should check for yourself that With Vapnik’s word of caution in mind, let us revisit one last time the binary classifier with LDA is, in Vapnik’s words, that ” one should solve the [classification] problem directly and never solve An natural extension of LDA is Quadratic Discriminant Analysis (QDA), in which we allow (15) without other tricks (dimensionality reduction, structured covariance) that we will discuss later. test in (22) is simply checking on what side of the hyperplane the point x lies. (22) (21) (19) (16) (18) ECE 6254 - Spring 2020 - Lecture 7 v1.0 - revised January 30, 2020 h ( x ) ≜ argmax k η k ( x ) . Here, η k ( x ) = argmax P y | x ( k | x ) k k ( a ) = argmax P x | y ( x | k )ˆ π k k ( b ) � log P x | y + log ˆ � = argmax π k k � � 2 � � � � 1 − 1 µ k ) ⊺ ˆ − 1 ( x − ˆ d � ˆ � � − log 2( x − ˆ = argmax (2 π ) Σ Σ µ k ) + log ˆ π k 2 � k � 1 � ( c ) µ k ) ⊺ ˆ − 1 ( x − ˆ = argmin 2( x − ˆ µ k ) − log ˆ Σ π k , k where ( a ) follows by Bayes’ rule and the fact that P x does not depend on k ; ( b ) follows because x �→ log x is increasing; ( c ) follows by dropping all the terms that do not depend on k and the fact that argmax x f ( x ) = argmin x − f ( x ) . For K = 2 , notice that the classifier is effectively performing the test η 0 ( x ) ≶ η 1 ( x ) ⇔ 1 π 0 ≷ 1 − 1 ( x − ˆ − 1 ( x − ˆ µ 0 ) ⊺ ˆ µ 1 ) ⊺ ˆ 2( x − ˆ Σ µ 0 ) − log ˆ 2( x − ˆ Σ µ 1 ) − log ˆ π 1 (20) − 1 x + 1 − 1 ˆ − 1 x + 1 − 1 ˆ µ ⊺ µ ⊺ µ ⊺ µ ⊺ 0 ˆ 0 ˆ 1 ˆ 1 ˆ π 0 ≷ − ˆ ⇔ − ˆ Σ 2 ˆ Σ µ 0 − log ˆ Σ 2 ˆ Σ µ 1 − log ˆ π 1 x + 1 − 1 ˆ π 0 − 1 − 1 ˆ − 1 µ 0 ) ⊺ ˆ µ ⊺ µ ⊺ 0 ˆ 1 ˆ ≷ 0 ⇔ (ˆ µ 1 − ˆ Σ 2 ˆ Σ µ 0 − log ˆ 2 ˆ Σ µ 1 + log ˆ π 1 � �� ≜ w ≜ b ⇔ w ⊺ x + b ≷ 0 . Tie set H ≜ { x ∈ R d : w ⊺ x + b = 0 } is a hyperplane , which is an affine subspace of R d of dimension d − 1 . H acts as a linear boundary between the two classes that we are trying to distinguish, and the ■ To conclude on LDA, note that the generative model P x | y ∼ N ( µ , Σ ) is rarely accurate. In addition, there are quite a few parameters to estimate, including K − 1 class priors, Kd means, 1 2 d ( d + 1) elements of covariance matrix. Tiis works well if N ≫ d but works poorly if N ≪ d the covariance matrix Σ k to vary with each class k . Tiis results in a quadratic decision boundary instead of the linear boundary established in Lemma 1.2. However, perhaps the biggest issue with a more general problem as an intermediate step [such as modeling P ( x | y ) ]. ”. With LDA, as should be clear from Lemma 1.1, we are actually modeling the entire joint distribution P x ,y , when we really only care about η k ( x ) for classification. µ 1 , ˆ π 1 ϕ ( x ; ˆ ˆ Σ ) 1 η 1 ( x ) = = 1 + exp ( − ( w ⊺ x + b )) , µ 1 , ˆ µ 0 , ˆ ˆ π 1 ϕ ( x ; ˆ Σ ) + ˆ π 0 ϕ ( x ; ˆ Σ )

4 (24) (27) parameters (see (22)) to compute these parameters as a function of the mean and covariance matrix of a Gaussian distribution. Tie direct estimation of these parameters leads to another linear classifier called the logistic classifier . Tiis is again a linear classifier. Note that LDA led to a similar classifier with the specific choice of (26) name and is defined as (25) ECE 6254 - Spring 2020 - Lecture 7 v1.0 - revised January 30, 2020 where w and b are defined as per (22). In other words, we do not need to estimate the full joint distribution. All that seems to be required are the parameters w and b , and LDA makes a detour 2 Logistic regression Tie key idea behind (binary) logistic regression is to assume that η 1 ( x ) is of the form 1 1 + exp ( − ( w ⊺ x + b )) ≜ 1 − η 0 ( x ) , w and ˆ and to directly estimate ˆ b from the data. One therefore obtains an estimate of the conditional distribution P y | x (1 | x ) as 1 η 1 ( x ) = . w ⊺ x + ˆ 1 + exp ( − (ˆ b )) 1 Since the function x �→ 1+ e − x is called the logistic map , the corresponding classifier inherited the � � η 1 ( x ) ⩾ 1 � � w ⊺ x + ˆ b ⩾ 0 h LR ( x ) = 1 = 1 ˆ . 2 b = 1 − 1 ˆ µ 0 − 1 − 1 ˆ µ 1 + log ˆ π 1 − 1 (ˆ µ ⊺ µ ⊺ w = ˆ 0 ˆ 1 ˆ ˆ µ 1 − ˆ µ 0 ) 2 ˆ 2 ˆ Σ Σ Σ ˆ π 0 Note that this not what the MLE of (ˆ w , b ) would result in, and we will analyze this in more details.

Linear Discriminant Analysis and Logistic Regression Matthieu R. - PDF document

1 (1) (7) (6) so that the log-likelihood takes the form (5) Linear Discriminant Analysis (LDA) is an attempt to improve on of the shortcomings of Naive Bayes, namely the assumption that given a label, the features are independent. Instead,

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Discriminant Analysis using Logistic Regression OLS1D XL4E: V0D XL4E : OLS1D V0D XL4E : OLS1D V0D

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit

Searching Often we are not given an algorithm to solve a problem, but only a specification of

How to Speed Up Software Migration and Resulting Problem: . . . Modernization: Successful

Counting independent sets in middle two layers of Boolean lattice Lina Li Joint work with

ICTP/Psi-k/CECAM School on Electron-Phonon Physics from First Principles Trieste, 19-23 March

Stage-structured Populations Brook Milligan Department of Biology New Mexico State University

5 2020

WHY? 1 26/01/18 1. Can be used to simulate molecular motion in a viscous medium, without

Multicriteria Optimization Some continuous and discrete dynamics Guillaume Garrigos Institut de