Lecture #13: Discriminant Analysis Data Science 1 CS 109A, STAT - PowerPoint PPT Presentation

Lecture #13: Discriminant Analysis Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave

Lecture Outline Discriminant Analysis LDA for one predictor LDA for p > 1 QDA Comparison of Classification Methods (so far) 2

Discriminant Analysis 3

Classification Methods By the end of Module 2, we will have learned the following classification methods: 1. Logistic Regression 2. k -NN 3. Discriminant Analysis 4. Classification Trees Today’s lecture is focused on Discriminant Analysis: linear (LDA) and quadratic (QDA). Wednesday’s lecture will cover Classification Trees. 4

Linear Discriminant Analysis (LDA) Linear discriminant analaysis (LDA) takes a different approach to classification than logistic regression. Rather than attempting to model the conditional distribution of Y given X , P ( Y = k | X = x ) , LDA models the distribution of the predictors X given the different categories that Y takes on, P ( X = x | Y = k ) . In order to flip these distributions around to model P ( X = x | Y = k ) an analyst uses Bayes’ theorem. In this setting with one feature (one X ), Bayes’ theorem can then be written as: f k ( x ) π k P ( Y = k | X = x ) = ∑ K j =1 f j ( x ) π j What does this mean? 5

Linear Discriminant Analysis (LDA) f k ( x ) π k P ( Y = k | X = x ) = ∑ K j =1 f j ( x ) π j The left hand side, P ( Y = k | X = x ) , is called the posterior probability and gives the probability that the observation is in the k th category given the feature, X , takes on a specific value, x . The numerator on the right is conditional distribution of the feature within category k , f k ( x ) , times the prior probability that observation is in the k th category. The Bayes’ classifier is then selected. That is the observation assigned to the group for which the posterior probability is the largest. 6

Inventor of LDA: R.A. Fisher The ’Father’ of Statistics. More famous for work in genetics (statistically concluded that Mendel’s genetic experiments were ’massaged’). Novel statistical work includes: 1. Experimental Design 2. ANOVA 3. F-test (why do you think it’s called the F -test?) 4. Exact test for 2x2 tables 5. Maximum Likelihood Theory 6. Use of α = 0 . 05 significance level: ฀The value for which P = .05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.฀ 7. And so much more... 7

LDA for one predictor 8

One common assumption is that comes from a Normal distribution: exp In shorthand notation, this is often written as , meaning, the distribution of the feature within category is Normally distributed with mean and variance . LDA for one predictor LDA has the simplest form when there is just one predictor/feature ( p = 1 ). In order to estimate f k ( x ) , we have to assume it comes from a specific distribution. If X is quantitative, what distribution do you think we should use? 9

LDA for one predictor LDA has the simplest form when there is just one predictor/feature ( p = 1 ). In order to estimate f k ( x ) , we have to assume it comes from a specific distribution. If X is quantitative, what distribution do you think we should use? One common assumption is that f k ( x ) comes from a Normal distribution: − ( x − µ k ) 2 1 ( ) exp f k ( x ) = . 2 σ 2 √ 2 πσ 2 k k In shorthand notation, this is often written as k ) , meaning, the distribution of the feature X | Y = k ∼ N ( µ k , σ 2 X within category k is Normally distributed with mean µ k and variance σ 2 k . 9

So we take the log of this expression and rearrange to simplify our maximization... LDA for one predictor (cont.) An extra assumption that the variances are equal, σ 2 1 = σ 2 2 = ... = σ 2 K will simplify our lives. Plugging this assumed ‘likelihood’ (aka, distribution) into the Bayes’ formula (to get the posterior) results in: ( − ( x − µ k ) 2 ) 1 2 πσ 2 exp π k √ 2 σ 2 P ( Y = k | X = x ) = ( − ( x − µ j ) 2 ) ∑ K 2 πσ 2 exp 1 j =1 π j √ 2 σ 2 The Bayes classifier will be the one that maximizes this over all values chosen for x . How should we maximize? 10

LDA for one predictor (cont.) An extra assumption that the variances are equal, σ 2 1 = σ 2 2 = ... = σ 2 K will simplify our lives. Plugging this assumed ‘likelihood’ (aka, distribution) into the Bayes’ formula (to get the posterior) results in: ( − ( x − µ k ) 2 ) 1 2 πσ 2 exp π k √ 2 σ 2 P ( Y = k | X = x ) = ( − ( x − µ j ) 2 ) ∑ K 2 πσ 2 exp 1 j =1 π j √ 2 σ 2 The Bayes classifier will be the one that maximizes this over all values chosen for x . How should we maximize? So we take the log of this expression and rearrange to simplify our maximization... 10

This is equivalent to choosing a decision boundary for for which Intuitively, why does this expression make sense? What do we use in practice? LDA for one predictor (cont.) So in order to perform classification, we maximize the following simplified expression: σ 2 − µ 2 δ k ( x ) = xµ k 2 σ 2 + log π k k How does this simplify if we have just two classes ( K = 2 ) and if we set our prior probabilities to be equal? 11

LDA for one predictor (cont.) So in order to perform classification, we maximize the following simplified expression: σ 2 − µ 2 δ k ( x ) = xµ k 2 σ 2 + log π k k How does this simplify if we have just two classes ( K = 2 ) and if we set our prior probabilities to be equal? This is equivalent to choosing a decision boundary for x for which µ 2 1 − µ 2 2( µ 1 − µ 2 ) = µ 1 + µ 2 2 x = 2 Intuitively, why does this expression make sense? What do we use in practice? 11

LDA for one predictor (cont.) In practice we don’t know the true mean, variance, and prior. So we estimate them with the classical estimates, and plug-them into the expression: µ k = 1 ∑ ˆ x i n k i : y i = k and K 1 σ 2 = ∑ ∑ µ k ) 2 ˆ ( x i − ˆ n − K k =1 i : y i = k where n is the total sample size and n k is the sample size within class k (thus, n = ∑ n k ). 12

LDA for one predictor (cont.) This classifier works great if the classes are about equal in proportion, but can easily be extended to unequal class sizes. Instead of assuming all priors are equal, we instead set the priors to match the ’prevalence’ in the data set: π k = ˆ ˆ n k / n Note: we can use a prior probability from knowledge of the subject as well; for example, if we expect the test set to have a different prevalence than the training set. How could we do this in the Cancer data set in HW 6? 13

LDA for one predictor (cont.) Plugging all of these estimates back into the original logged maximization formula we get: µ 2 δ k ( x ) = x ˆ µ k σ 2 − ˆ ˆ σ 2 + log ˆ k π k ˆ 2ˆ Thus this classifier is called the linear discriminant classifier: this discriminant function is a linear function of x . 14

Illustration of LDA when p = 1 15

LDA for p > 1 16

This means that the vector of for an observation has a multidimensional normal distribution with a mean vector, , and a covariance matrix, . LDA when p > 1 LDA generalizes ’nicely’ to the case when there is more than one predictor. Instead of assuming the one predictor is Normally distributed, it assumes that the set of predictors for each class is ’multivariate normal distributed’ (shorthand: MVN). What does that mean? 17

LDA when p > 1 LDA generalizes ’nicely’ to the case when there is more than one predictor. Instead of assuming the one predictor is Normally distributed, it assumes that the set of predictors for each class is ’multivariate normal distributed’ (shorthand: MVN). What does that mean? This means that the vector of X for an observation has a multidimensional normal distribution with a mean vector, µ , and a covariance matrix, Σ . 17

MVN distribution for 2 variables Here is a visualization of the Multivariate Normal distribution with 2 variables: 18

What do and look like? MVN distribution The joint PDF of the Multivariate Normal distribution, ⃗ µ, Σ) , is: X ∼ MV N ( ⃗ 1 ( − 1 ) µ ) T Σ − 1 ( ⃗ 2 π p /2 | Σ | 1/2 exp f ( ⃗ x ) = 2( ⃗ x − ⃗ x − ⃗ µ ) where ⃗ x is a p dimensional vector and | Σ | is the determinant of the p × p covariance matrix. Let’s do a quick dimension analysis sanity check... 19

MVN distribution The joint PDF of the Multivariate Normal distribution, ⃗ µ, Σ) , is: X ∼ MV N ( ⃗ 1 ( − 1 ) µ ) T Σ − 1 ( ⃗ 2 π p /2 | Σ | 1/2 exp f ( ⃗ x ) = 2( ⃗ x − ⃗ x − ⃗ µ ) where ⃗ x is a p dimensional vector and | Σ | is the determinant of the p × p covariance matrix. Let’s do a quick dimension analysis sanity check... What do ⃗ µ and Σ look like? 19

Lecture #13: Discriminant Analysis Data Science 1 CS 109A, STAT - PowerPoint PPT Presentation

Lecture #13: Discriminant Analysis Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Discriminant Analysis LDA for one predictor LDA for p > 1 QDA Comparison of

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Lecture 14: Discriminant Analysis CS109A Introduction to Data Science Pavlos Protopapas and Kevin

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Introduction to Machine Learning Classification: Discriminant Analysis

Discriminant Analysis James H. Steiger Department of Psychology and Human Development Vanderbilt

Discriminant Analysis Aleix M. Martinez aleix@ece.osu.edu PCA Eigenfaces (PCA) 1 Linear

Comparisons of discriminant analysis techniques for high- dimensional correlated data Line H.

Discriminant Analysis using Logistic Regression OLS1D XL4E: V0D XL4E : OLS1D V0D XL4E : OLS1D V0D

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

SNP Chip Data Analysis for Whole Genome Association Studies I nforSense Limited Sha Liao , Life

The Importance of Research to the Global Cotton Sector 7 th Asian Cotton Research and Development

Pedagogy Involved in Teaching Large Interprofessional Classes Strategies, Assessment and the

Evidence-based teaching in introductory biology Scott Freeman, Department of Biology University

Taking Valium Before Presentation A common practice with dependency is coupling the drug with the

validation for helium ion-beam therapy MCMA 2017, Napoli Stewart Mac Mein PhD Student

Dual Credit What is Dual Credit? Simultaneously earn high school and college credit in one

Get Information and forms From Ms. Mattingly ASAP New to Dual Credit students must see Ms.