pattern recognition 2019 linear models for classification
play

Pattern Recognition 2019 Linear Models for Classification (2) Ad - PowerPoint PPT Presentation

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57 Two types of approaches to classification


  1. Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht December 11, 2019 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 1 / 57

  2. Two types of approaches to classification Discriminative Models (“regression”; section 4.3): only model conditional distribution of t given x . Generative Models (“density estimation”; section 4.2): model joint distribution of t and x . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 2 / 57

  3. Generative Models In classification we want to estimate p ( C k | x ). In generative models, we use Bayes’ rule p ( C k ) p ( x |C k ) p ( C k | x ) = , � K j =1 p ( C j ) p ( x |C j ) where p ( x |C j ) are the class conditional probability distributions and p ( C j ) are the unconditional (”prior”) probabilities of each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 3 / 57

  4. Generative Models The training data are partitioned into subsets D = {D 1 , . . . , D K } with the same class label. Data in D j is used to used to estimate p ( x |C j ). Prior probabilities p ( C j ) are estimated from observed class values. These estimates are plugged into Bayes’ formula to obtain probability estimates ˆ p ( C k | x ). Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 4 / 57

  5. Generative Models: example (discrete features) Test mailing data: respondents non-respondents age male female total male female total 18-25 15 10 25 7 3 10 26-35 15 20 35 10 10 20 36-50 10 10 20 10 20 30 51-64 10 5 15 40 40 80 65+ 5 0 5 40 20 60 total 55 45 100 107 93 200 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 5 / 57

  6. Generative Models: example p (respondent) = 100 / 300 = 1 / 3 and ˆ ˆ p (non-respondent) = 2 / 3. respondents non-respondents age male female total male female total 18-25 0.15 0.10 0.25 0.035 0.015 0.05 26-35 0.15 0.20 0.35 0.05 0.05 0.10 36-50 0.10 0.10 0.20 0.05 0.10 0.15 51-64 0.10 0.05 0.15 0.20 0.20 0.40 65+ 0.05 0 0.05 0.20 0.10 0.30 total 0.55 0.45 1 0.535 0.465 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 6 / 57

  7. Using Bayes’ Rule Estimated probability of response for a 18-25 year old male (R=Respondent, M=Male): p (18 − 25 , M | R )ˆ ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 15 × 1 / 3 = 0 . 15 × 1 / 3 + 0 . 035 × 2 / 3 ≈ 0 . 68 Assign person to respondents because this is the class with the highest estimated probability for 18-25 year old males. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 7 / 57

  8. Curse of Dimensionality D input variables with m possible values each: have to estimate m D − 1 probabilities per group. For D = 10 and m = 5: 5 10 − 1 = 9 , 765 , 624 probabilities. If N = 1000, almost all cells are empty; we have 1000 / 9765624 ≈ 0 . 0001 observations per cell. Curse of dimensionality : in high dimensions almost all of the input space is empty. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 8 / 57

  9. Naive Bayes Assumption Assume the input variables are independent within each group, i.e. p ( x |C k ) = p ( x 1 |C k ) p ( x 2 |C k ) · · · p ( x D |C k ) Instead of m D − 1 parameters, we only have to estimate D ( m − 1) parameters per group. So with D = 10 and m = 5, we only have to estimate 40 probabilities per group. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 9 / 57

  10. Using Naive Bayes Estimated probability of response for a 18-25 year old male with naive Bayes p (18 − 25 | R )ˆ ˆ p ( M | R )ˆ p ( R ) p ( R | 18 − 25 , M ) ˆ = p (18 − 25 , M ) ˆ 0 . 25 × 0 . 55 × 1 / 3 = 0 . 25 × 0 . 55 × 1 / 3 + 0 . 05 × 0 . 535 × 2 / 3 ≈ 0 . 72 Probability estimate is higher, but both models lead to the same allocation. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 10 / 57

  11. Continuous features: normal distribution Suppose x ∼ N ( µ , Σ ), with � µ 1 � σ 2 � x 1 � � � σ 12 1 x = µ = Σ = σ 2 x 2 µ 2 σ 21 2 Correlation coefficient ρ 12 = σ 12 σ 1 σ 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 11 / 57

  12. Contour Plot 1: independent, same variance x 2 4 2 0 -2 -4 x 1 -4 -2 0 2 4 µ 1 = 0, µ 2 = 0, ρ 12 = 0, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 12 / 57

  13. Contour Plot 2: positive correlation 30 x 2 28 26 24 22 20 x 1 6 8 10 12 14 µ 1 = 10, µ 2 = 25, ρ 12 = 0 . 7, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 13 / 57

  14. Contour Plot 3: negative correlation 10 x 2 8 6 4 2 0 10 12 14 16 18 20 x 1 µ 1 = 15, µ 2 = 5, ρ 12 = − 0 . 6, σ 2 1 = σ 2 2 = 1 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 14 / 57

  15. Multivariate Normal Distribution D variables, i.e. x = [ x 1 , . . . , x D ] ⊤     σ 2 µ 1 σ 12 σ 13 . . . σ 1 D 1 σ 2 µ 2 σ 21 σ 23 . . . σ 2 D     2     µ = Σ = . .     . . . .     σ 2 µ D σ D 1 σ D 2 σ D 3 . . . D Formula for normal probability density: � � 1 − 1 2( x − µ ) ⊤ Σ − 1 ( x − µ ) p ( x ) = (2 π ) D / 2 | Σ | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 15 / 57

  16. Normality Assumption in Classification If in class k x ∼ N ( µ k , Σ k ) then the form of p ( x |C k ) is � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) (2 π ) D / 2 | Σ k | 1 / 2 exp Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 16 / 57

  17. Normality Assumption in Classification Estimating p ( x |C k ) comes down to estimating the mean vector µ k , and the covariance matrix Σ k for each class. If there are D variables in x , then there are D means in the mean vector and D ( D + 1) / 2 elements in the covariance matrix, making a total of ( D 2 + 3 D ) / 2 parameters to be estimated for each class. Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 17 / 57

  18. Optimal Allocation Rule Assign x to group k if p ( C k | x ) is larger than p ( C j | x ) for all j � = k . Via Bayes formula this leads to the rule to assign to group k if p ( x |C k ) p ( C k ) > p ( x |C j ) p ( C j ) ∀ j � = k (since the denominator cancels) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 18 / 57

  19. Optimal Allocation Rule for Normal Densities Fill in the formula for the normal density for p ( x |C k ). Then we get the following optimal allocation rule: Assign to group k if � � p ( C k ) − 1 2( x − µ k ) ⊤ Σ − 1 (2 π ) D / 2 | Σ k | 1 / 2 exp k ( x − µ k ) > � � p ( C j ) − 1 2( x − µ j ) ⊤ Σ − 1 ( x − µ j ) (2 π ) D / 2 | Σ j | 1 / 2 exp j for all j � = k Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 19 / 57

  20. Optimal Allocation Rule for Normal Densities Take natural logarithm: � � � � 1 − 1 2( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) p ( C k ) ln (2 π ) D / 2 | Σ k | 1 / 2 exp = − D 2 ln(2 π ) − (1 / 2) ln | Σ k | − (1 / 2)( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) + ln p ( C k ) Cancel the terms that are common to all groups and multiply by − 2: ln | Σ k | + ( x − µ k ) ⊤ Σ − 1 k ( x − µ k ) − 2 ln p ( C k ) Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 20 / 57

  21. Optimal Allocation Rule for Normal Densities Discriminant function for class k ln | Σ k | − 2 ln p ( C k ) + ( x − µ k ) ⊤ Σ − 1 d k ( x ) = k ( x − µ k ) ln | Σ k | − 2 ln p ( C k ) + µ ⊤ k Σ − 1 − 2 µ ⊤ k Σ − 1 + x ⊤ Σ − 1 = k x k x k µ k � �� � � �� � � �� � constant linear quadratic Assign to class k if d k ( x ) < d j ( x ) for all j � = k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 21 / 57

  22. Estimation Estimate p ( C k ), µ k , Σ k from training data: p ( t = k ) = N k p ( C k ) = ˆ ˆ N where N k is number of observations from group k . The mean of x i in group k is estimated by: x i , k = 1 � µ i , k = ¯ ˆ x n , i N k n : t n = k for k = 1 , . . . , K and i = 1 , . . . , D . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 22 / 57

  23. Estimation Unbiased estimate of the covariance between x i and x j in group k : 1 � Σ ij ˆ k = ( x n , i − ¯ x i , k )( x n , j − ¯ x j , k ) N k − 1 n : t n = k for k = 1 , . . . , K and i , j = 1 , . . . , D . If j = i , this is the variance of x i in group k . Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 23 / 57

  24. Numeric Example Training data: x 1 x 2 t 2 4 1 3 6 1 4 14 1 4 18 2 5 10 2 6 8 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 24 / 57

  25. Estimates Group 1: � 3 � 1 � � p ( C 1 ) = 3 6 = 1 5 ˆ ˆ x 1 = Σ 1 = ¯ 8 5 28 2 Group 2: � 5 � � � p ( C 2 ) = 3 6 = 1 1 − 5 ˆ ˆ ¯ x 2 = Σ 2 = 12 − 5 28 2 Ad Feelders ( Universiteit Utrecht ) Pattern Recognition December 11, 2019 25 / 57

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend