supervised learning linear methods 1 2
play

Supervised Learning: Linear Methods (1/2) Applied Multivariate - PowerPoint PPT Presentation

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fishers Discriminant Analysis LDA: Example Quality control: Testset and


  1. Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics – Spring 2012

  2. Overview  Review: Conditional Probability  LDA / QDA: Theory  Fisher’s Discriminant Analysis  LDA: Example  Quality control: Testset and Crossvalidation  Case study: Text recognition 1

  3. Conditional Probability Sample space T: Med. Test positive T (Marginal) Probability: P(T), P(C) C: Patient has cancer C New sample space: New sample space: People with pos. test Conditional Probability: People with cancer P(T|C), P(C|T) P(C|T) P(T|C) small large Bayes Theorem: P ( C j T ) = P ( T j C ) P ( C ) posterior P ( T ) prior Class conditional probability 2

  4. One approach to supervised learning P ( C j X ) = P ( C ) P ( X j C ) » P ( C ) P ( X j C ) P ( X ) Prior / prevalence: Assume: Find some estimate Fraction of samples X j C » N ( ¹ c ; § c ) in that class Bayes rule: Choose class where P(C|X) is maximal (rule is “optimal” if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1|X) > 0.5 or - choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1 In Practice: Estimate 𝑄 𝐷 , 𝜈 𝐷 , Σ 𝐷 3

  5. ¡ ¢ QDA: Doing the math… 2 ( x ¡ ¹ c ) T § ¡ 1 1 ¡ 1 p (2 ¼ ) d j § C j exp C ( x ¡ ¹ c ) 𝑄 𝐷 𝑌 ~ 𝑄 𝐷 𝑄(𝑌|𝐷)  Use the fact: max 𝑄 𝐷 𝑌 max(log 𝑄 𝐷 𝑌 )  𝜀 𝑑 𝑦 = log 𝑄 𝐷 𝑌 = log 𝑄 𝐷 + log 𝑄 𝑌 𝐷 =  −1 𝑦 − 𝜈 𝐷 + 𝑑 1 1 2 𝑦 − 𝜈 𝐷 𝑈 Σ 𝐷 = log 𝑄 𝐷 − 2 log Σ 𝐷 − Prior Additional Sq. Mahalanobis distance term Choose class where 𝜀 𝑑 𝑦 is maximal   Special case: Two classes Decision boundary: Values of x where 𝜀 0 𝑦 = 𝜀 1 (𝑦) is quadratic in x  Quadratic Discriminant Analysis (QDA) 4

  6. Simplification  Assume same covariance matrix in all classes, i.e. 𝑌|𝐷 ~ 𝑂(𝜈 𝑑 , Σ) Fix for all classes 2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑑 = 1 1 𝜀 𝑑 𝑦 = log 𝑄 𝐷 − 2 log Σ −  2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑒 = Prior 1 Sq. Mahalanobis distance = log 𝑄 𝐷 − 1 + 𝑦 𝑈 Σ −1 𝜈 𝐷 − 𝑈 Σ −1 𝜈 𝐷 ) (= log 𝑄 𝐷 2 𝜈 𝐷 Decision boundary is linear in x  Linear Discriminant Analysis (LDA) 1 Classify to which class (assume equal prior)? • Physical distance in space is equal 0 • Classify to class 0, since Mahal. Dist. is smaller 5

  7. LDA vs. QDA + Only few parameters to - Many parameters to estimate; less accurate estimate; accurate estimates + More flexible - Inflexible (quadratic decision boundary) (linear decision boundary) 6

  8. Fisher’s Discriminant Analysis: Idea Find direction(s) in which groups are separated best • Class Y, predictors 𝑌 = 𝑌 1 , … , 𝑌 𝑒 1. Principal Component 𝑉 = 𝑥 𝑈 𝑌 1. Linear Discriminant • Find w so that groups are separated = along U best 1. Canonical Variable • Measure of separation: Rayleigh coefficient 𝐾 𝑥 = 𝐸(𝑉) 𝑊𝑏𝑠(𝑉) 2 where 𝐸 𝑉 = 𝐹 𝑉 𝑍 = 0 − 𝐹 𝑉 𝑍 = 1 • 𝐹 𝑌 𝑍 = 𝑘 = 𝜈 𝑘 , 𝑊𝑏𝑠 𝑌 𝑍 = 𝑘 = Σ 𝐹 𝑉 𝑍 = 𝑘 = 𝑥 𝑈 𝜈 𝑘 , 𝑊 𝑉 = 𝑥 𝑈 Σw • Concept extendable to many groups D(U) D(U) 𝐾 𝑥 small 𝐾 𝑥 large Var(U) Var(U) 7

  9. LDA and Linear Discriminants  - Direction with largest J(w): 1. Linear Discriminant (LD 1) - orthogonal to LD1, again largest J(w): LD 2 - etc.  At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s e.g.: 3 groups in 10 dimensions – need 2 LD’s  Computed using Eigenvalue Decomposition or Singular Value Decomposition Proportion of trace: Captured % of variance between group means for each LD  R: Function «lda» in package MASS does LDA and computes linear discriminants (also «qda» available) 8

  10. Example: Classification of Iris flowers Iris setosa Iris versicolor Classify according to sepal/petal length/width Iris virginica 9

  11. Quality of classification  Use training data also as test data: Overfitting Too optimistic for error on new data  Separate test data Test Training  Cross validation (CV; e.g. “leave -one-out cross validation): Every row is the test case once, the rest in the training data 10

  12. Measures for prediction error  Confusion matrix (e.g. 100 samples) Truth = 0 Truth = 1 Truth = 2 Estimate = 0 23 7 6 Estimate = 1 3 27 4 Estimate = 2 3 1 26  Error rate: 1 – sum(diagonal entries) / (number of samples) = = 1 – 76/100 = 0.24  We expect that our classifier predicts 24% of new observations incorrectly (this is just a rough estimate) 11

  13. Example: Digit recognition  7129 hand-written digits Sample of digits  Each (centered) digit was put in a 16*16 grid  Measure grey value in each part of the grid, i.e. 256 grey values Example with 8*8 grid 12

  14. Concepts to know  Idea of LDA / QDA  Meaning of Linear Discriminants  Cross Validation  Confusion matrix, error rate 13

  15. R functions to know  lda 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend