Pattern Recognition 2019 Linear Models for Classification (2) Ad - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57

Two types of approaches to classification Discriminative Models (“regression”; section 4.3): only model conditional distribution of t given x . Generative Models (“density estimation”; section 4.2): model joint distribution of t and x . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 2 / 57

Generative Models In classification we want to estimate p ( C k | x ). In generative models, we use Bayes’ rule p ( C k ) p ( x |C k ) p ( C k | x ) = , � K j =1 p ( C j ) p ( x |C j ) where p ( x |C j ) are the class conditional probability distributions and p ( C j ) are the unconditional (”prior”) probabilities of each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 3 / 57

Generative Models The training data are partitioned into subsets D = {D 1 , . . . , D K } with the same class label. Data in D j is used to used to estimate p ( x |C j ). Prior probabilities p ( C j ) are estimated from observed class values. These estimates are plugged into Bayes’ formula to obtain probability estimates ˆ p ( C k | x ). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 4 / 57

Generative Models: example (discrete features) Test mailing data: respondents non-respondents age male female total male female total 18-25 15 10 25 7 3 10 26-35 15 20 35 10 10 20 36-50 10 10 20 10 20 30 51-64 10 5 15 40 40 80 65+ 5 0 5 40 20 60 total 55 45 100 107 93 200 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 5 / 57

Generative Models: example p (respondent) = 100 / 300 = 1 / 3 and ˆ ˆ p (non-respondent) = 2 / 3. respondents non-respondents age male female total male female total 18-25 0.15 0.10 0.25 0.035 0.015 0.05 26-35 0.15 0.20 0.35 0.05 0.05 0.10 36-50 0.10 0.10 0.20 0.05 0.10 0.15 51-64 0.10 0.05 0.15 0.20 0.20 0.40 65+ 0.05 0 0.05 0.20 0.10 0.30 total 0.55 0.45 1 0.535 0.465 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 6 / 57

Using Bayes’ Rule Estimated probability of response for a 18-25 year old male (R=Respondent, M=Male): p (18 − 25 , M | R )ˆ ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 15 × 1 / 3 = 0 . 15 × 1 / 3 + 0 . 035 × 2 / 3 ≈ 0 . 68 Assign person to respondents because this is the class with the highest estimated probability for 18-25 year old males. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 7 / 57

Curse of Dimensionality D input variables with m possible values each: have to estimate m D − 1 probabilities per group. For D = 10 and m = 5: 5 10 − 1 = 9 , 765 , 624 probabilities. If N = 1000, almost all cells are empty; we have 1000 / 9765624 ≈ 0 . 0001 observations per cell. Curse of dimensionality : in high dimensions almost all of the input space is empty. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 8 / 57

Naive Bayes Assumption Assume the input variables are independent within each group, i.e. p ( x |C k ) = p ( x 1 |C k ) p ( x 2 |C k ) · · · p ( x D |C k ) Instead of m D − 1 parameters, we only have to estimate D ( m − 1) parameters per group. So with D = 10 and m = 5, we only have to estimate 40 probabilities per group. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 9 / 57

Using Naive Bayes Estimated probability of response for a 18-25 year old male with naive Bayes p (18 − 25 | R )ˆ ˆ p ( M | R )ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 25 × 0 . 55 × 1 / 3 = 0 . 25 × 0 . 55 × 1 / 3 + 0 . 05 × 0 . 535 × 2 / 3 ≈ 0 . 72 Probability estimate is higher, but both models lead to the same allocation. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 10 / 57

Continuous features: normal distribution Suppose x ∼ N ( µ , Σ ), with � µ 1 � σ 2 � x 1 � � � σ 12 1 x = µ = Σ = σ 2 x 2 µ 2 σ 21 2 Correlation coefficient ρ 12 = σ 12 σ 1 σ 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 11 / 57

Contour Plot 1: independent, same variance x 2 4 2 0 -2 -4 x 1 -4 -2 0 2 4 µ 1 = 0, µ 2 = 0, ρ 12 = 0, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 12 / 57

Contour Plot 2: positive correlation 30 x 2 28 26 24 22 20 x 1 6 8 10 12 14 µ 1 = 10, µ 2 = 25, ρ 12 = 0 . 7, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 13 / 57

Contour Plot 3: negative correlation 10 x 2 8 6 4 2 0 10 12 14 16 18 20 x 1 µ 1 = 15, µ 2 = 5, ρ 12 = − 0 . 6, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 14 / 57

Multivariate Normal Distribution D variables, i.e. x = [ x 1 , . . . , x D ] ⊤     σ 2 µ 1 σ 12 σ 13 . . . σ 1 D 1 σ 2 µ 2 σ 21 σ 23 . . . σ 2 D     2     µ = Σ = . .     . . . .     σ 2 µ D σ D 1 σ D 2 σ D 3 . . . D Formula for normal probability density: � � 1 − 1 2( x − µ ) ⊤ Σ − 1 ( x − µ ) p ( x ) = (2 π ) D / 2 | Σ | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 15 / 57

Normality Assumption in Classification If in class k x ∼ N ( µ k , Σ k ) then the form of p ( x |C k ) is � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) (2 π ) D / 2 | Σ k | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 16 / 57

Normality Assumption in Classification Estimating p ( x |C k ) comes down to estimating the mean vector µ k , and the covariance matrix Σ k for each class. If there are D variables in x , then there are D means in the mean vector and D ( D + 1) / 2 elements in the covariance matrix, making a total of ( D 2 + 3 D ) / 2 parameters to be estimated for each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 17 / 57

Optimal Allocation Rule Assign x to group k if p ( C k | x ) is larger than p ( C j | x ) for all j � = k . Via Bayes formula this leads to the rule to assign to group k if p ( x |C k ) p ( C k ) > p ( x |C j ) p ( C j ) ∀ j � = k (since the denominator cancels) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 18 / 57

Optimal Allocation Rule for Normal Densities Fill in the formula for the normal density for p ( x |C k ). Then we get the following optimal allocation rule: Assign to group k if � � p ( C k ) − 1 2( x − µ k ) ⊤ Σ − 1 (2 π ) D / 2 | Σ k | 1 / 2 exp k ( x − µ k ) > � � p ( C j ) − 1 2( x − µ j ) ⊤ Σ − 1 ( x − µ j ) (2 π ) D / 2 | Σ j | 1 / 2 exp j for all j � = k Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 19 / 57

Optimal Allocation Rule for Normal Densities Take natural logarithm: � � � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) p ( C k ) ln (2 π ) D / 2 | Σ k | 1 / 2 exp = − D 2 ln(2 π ) − (1 / 2) ln | Σ k | − (1 / 2)( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) + ln p ( C k ) Cancel the terms that are common to all groups and multiply by − 2: ln | Σ k | + ( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) − 2 ln p ( C k ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 20 / 57

Optimal Allocation Rule for Normal Densities Discriminant function for class k ln | Σ k | − 2 ln p ( C k ) + ( x − µ k ) ⊤ Σ − 1 d k ( x ) = k ( x − µ k ) ln | Σ k | − 2 ln p ( C k ) + µ ⊤ k Σ − 1 − 2 µ ⊤ k Σ − 1 + x ⊤ Σ − 1 = k x k x k µ k � �� constant linear quadratic Assign to class k if d k ( x ) < d j ( x ) for all j � = k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 21 / 57

Estimation Estimate p ( C k ), µ k , Σ k from training data: p ( t = k ) = N k p ( C k ) = ˆ ˆ N where N k is number of observations from group k . The mean of x i in group k is estimated by: x i , k = 1 � µ i , k = ¯ ˆ x n , i N k n : t n = k for k = 1 , . . . , K and i = 1 , . . . , D . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 22 / 57

Estimation Unbiased estimate of the covariance between x i and x j in group k : 1 � Σ ij ˆ k = ( x n , i − ¯ x i , k )( x n , j − ¯ x j , k ) N k − 1 n : t n = k for k = 1 , . . . , K and i , j = 1 , . . . , D . If j = i , this is the variance of x i in group k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 23 / 57

Numeric Example Training data: x 1 x 2 t 2 4 1 3 6 1 4 14 1 4 18 2 5 10 2 6 8 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 24 / 57

Estimates Group 1: � 3 � 1 � � p ( C 1 ) = 3 6 = 1 5 ˆ ˆ x 1 = Σ 1 = ¯ 8 5 28 2 Group 2: � 5 � � � p ( C 2 ) = 3 6 = 1 1 − 5 ˆ ˆ ¯ x 2 = Σ 2 = 12 − 5 28 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 25 / 57

Pattern Recognition 2019 Linear Models for Classification (2) Ad - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57 Two types of approaches to classification

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Pattern Recognition 2019 Linear Models for Classification Ad Feelders Universiteit Utrecht Ad

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

Pattern Recognition Linear Models for Classification Extra Slides Ad Feelders Universiteit

CS 7616 Pattern Recognition Introduction Aaron Bobick School of Interactive Computing

Linear Manifold Embeddings of Pattern Clusters Robert Haralick Rave Harpaz Pattern Recognition

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

Filter Banks SPEECH RECOGNITION 40833 1 2 Spectral Analysis Models (a) Pattern Recognition

Pattern Review Pattern Name and Classification: A descriptive and unique name that helps in

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Pattern Recognition CSE 802 Michigan State University Spring 2017 Lecture 1, January 9, 2017

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition

New England Children with Genetic Disorders and Health Care Reform: Information and

Design Technology Bennett Lauber, Chief Experience Officer July 31, 2018 @HISA_HIC #HIC18 The

7/1/2013 ABIM Certification Exam: Nephrology Division of Nephrology Department of Medicine

Teaching probability and statistics from a purely Bayesian point of view Sanjoy Mahajan Olin

Interventions, Measures, Data and Workforce PHC Design Team #2 July 31, 2018 10:00 11:30 AM

social dynamics of economic performance in a creative industry motion picture and video production

1 Why Why flow flow control? control? sender

EPL606 Transport Layer Outline Transport Layer Services TCP Overview Segment