classification with generative models 2
play

Classification with generative models 2 DSE 210 Classification with - PDF document

Classification with generative models 2 DSE 210 Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Typically the


  1. Classification with generative models 2 DSE 210 Classification with parametrized models Classifiers with a fixed number of parameters can represent a limited set of functions. Learning a model is about picking a good approximation. Typically the x ’s are points in d -dimensional Euclidean space, R d . Two ways to classify: • Generative : model the individual classes. • Discriminative : model the decision boundary between the classes.

  2. The Bayes-optimal prediction Pr(x) P 3 (x) P 1 (x) P 2 (x) π 1= π 2= π 3= 10% 50% 40% x Labels Y = { 1 , 2 , . . . , k } , density Pr ( x ) = π 1 P 1 ( x ) + · · · + π k P k ( x ). For any x ∈ X and any label j , Pr ( y = j | x ) = Pr ( y = j ) Pr ( x | y = j ) π j P j ( x ) = P k Pr ( x ) i =1 π i P i ( x ) Bayes-optimal prediction: h ∗ ( x ) = arg max j π j P j ( x ). The winery prediction problem Which winery is it from, 1, 2, or 3? Using one feature (’Alcohol’), error rate is 29%. What if we use two features?

  3. The data set, again Training set obtained from 130 bottles • Winery 1: 43 bottles • Winery 2: 51 bottles • Winery 3: 36 bottles • For each bottle, 13 features: ’Alcohol’, ’Malic acid’, ’Ash’, ’Alcalinity of ash’,’Magnesium’, ’Total phenols’, ’Flavanoids’, ’Nonflavanoid phenols’, ’Proanthocyanins’, ’Color intensity’, ’Hue’, ’OD280/OD315 of diluted wines’, ’Proline’ Also, a separate test set of 48 labeled points. This time: ’Alcohol’ and ’Flavanoids’. Why it helps to add features Better separation between the classes! Error rate drops from 29% to 8%.

  4. Bivariate distributions Simplest option: treat each variable as independent. Example: For a large collection of people, measure the two variables H = height W = weight Independence would mean Pr ( H = h , W = w ) = Pr ( H = h ) Pr ( W = w ) , which would also imply E ( HW ) = E ( H ) E ( W ). Is this an accurate approximation? No: we’d expect height and weight to be positively correlated . Types of correlation H , W positively correlated. This also implies weight E ( HW ) > E ( H ) E ( W ) . height Y Y X X X , Y negatively correlated X , Y uncorrelated

  5. Pearson (1903): fathers and sons Heights of fathers and their full grown sons 78 76 74 72 Son’s height (inches) 70 68 66 64 62 60 58 58 60 62 64 66 68 70 72 74 76 78 Father’s height (inches) How to quantify the degree of correlation? Correlation pictures r = 0 r = 1 r = 0 . 25 r = − 0 . 25 r = 0 . 5 r = − 0 . 5 r = 0 . 75 r = − 0 . 75

  6. Covariance and correlation Suppose X has mean µ X and Y has mean µ Y . • Covariance cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y Maximized when X = Y , in which case it is var( X ). In general, it is at most std( X )std( Y ). • Correlation cov( X , Y ) corr( X , Y ) = std( X )std( Y ) This is always in the range [ − 1 , 1]. Covariance and correlation: example 1 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y cov( X , Y ) corr( X , Y ) = std( X )std( Y ) Pr ( x , y ) µ X = 0 x y − 1 − 1 1 / 3 µ Y = − 1 / 3 − 1 1 1 / 6 var( X ) = 1 1 − 1 1 / 3 var( Y ) = 8 / 9 1 1 1 / 6 cov( X , Y ) = 0 corr( X , Y ) = 0 In this case, X , Y are independent. Independent variables always have zero covariance and correlation.

  7. Covariance and correlation: example 2 cov( X , Y ) = E [( X − µ X )( Y − µ Y )] = E [ XY ] − µ X µ Y cov( X , Y ) corr( X , Y ) = std( X )std( Y ) Pr ( x , y ) µ X = 0 x y − 1 − 10 1 / 6 µ Y = 0 − 1 10 1 / 3 var( X ) = 1 1 − 10 1 / 3 var( Y ) = 100 1 10 1 / 6 cov( X , Y ) = − 10 / 3 corr( X , Y ) = − 1 / 3 In this case, X and Y are negatively correlated. Return to winery example Better separation between the classes! Error rate drops from 29% to 8%.

  8. The bivariate Gaussian Model class 1 by a bivariate Gaussian, parametrized by: ✓ ◆ ✓ ◆ 13 . 7 0 . 20 0 . 06 mean µ = and covariance matrix Σ = 3 . 0 0 . 06 0 . 12 The bivariate (2-d) Gaussian A distribution over ( x 1 , x 2 ) ∈ R 2 , parametrized by: • Mean ( µ 1 , µ 2 ) ∈ R 2 , where µ 1 = E ( X 1 ) and µ 2 = E ( X 2 )  Σ 11 � Σ 12 • Covariance matrix Σ = where Σ 21 Σ 22 8 9 Σ 11 = var( X 1 ) < = Σ 22 = var( X 2 ) Σ 12 = Σ 21 = cov( X 1 , X 2 ) : ; Density is highest at the mean, falls o ff in ellipsoidal contours.

  9. Density of the bivariate Gaussian • Mean ( µ 1 , µ 2 ) ∈ R 2 , where µ 1 = E ( X 1 ) and µ 2 = E ( X 2 )  Σ 11 � Σ 12 • Covariance matrix Σ = Σ 21 Σ 22  x 1 − µ 1  x 1 − µ 1 �! � T 1 − 1 Σ − 1 Density p ( x 1 , x 2 ) = 2 π | Σ | 1 / 2 exp x 2 − µ 2 x 2 − µ 2 2 Bivariate Gaussian: examples In either case, the mean is (1 , 1).  4 �  � 0 4 1 . 5 Σ = Σ = 0 1 1 . 5 1

  10. The decision boundary Go from 1 to 2 features: error rate goes from 29% to 8%. What kind of function is this? And, can we use more features?

  11. DSE 210: Probability and statistics Winter 2018 Worksheet 6 — Generative models 2 1. Would you expect the following pairs of random variables to be uncorrelated, positively correlated, or negatively correlated? (a) The weight of a new car and its price. (b) The weight of a car and the number of seats in it. (c) The age in years of a second-hand car and its current market value. 2. Consider a population of married couples in which every wife is exactly 0.9 of her husband’s age. What is the correlation between husband’s age and wife’s age? 3. Each of the following scenarios describes a joint distribution ( x, y ). In each case, give the parameters of the (unique) bivariate Gaussian that satisfies these properties. (a) x has mean 2 and standard deviation 1, y has mean 2 and standard deviation 0.5, and the correlation between x and y is − 0 . 5. (b) x has mean 1 and standard deviation 1, and y is equal to x . 4. Roughly sketch the shapes of the following Gaussians N ( µ, Σ ). For each, you only need to show a representative contour line which is qualitatively accurate (has approximately the right orientation, for instance). ✓ 0 ✓ 9 ◆ ◆ 0 (a) µ = and Σ = 0 0 1 ✓ 0 ◆ ✓ ◆ 1 − 0 . 75 (b) µ = and Σ = 0 − 0 . 75 1 5. For each of the two Gaussians in the previous problem, check your answer using Python: draw 100 random samples from that Gaussian and plot it. 6-1

  12. Linear algebra primer DSE 210 Data as vectors and matrices 6 5 4 3 2 1 0 0 1 2 3 4 5 6

  13. Matrix-vector notation Vector x ∈ R d : Matrix M ∈ R r × d :   M 11 M 12 M 1 d   x 1 · · · M 21 M 22 M 2 d x 2 · · ·     M =  . . .  ...   . . . x 3   x = . . .      .  .   M r 1 M r 2 M rd . · · ·   x d M ij = entry at row i , column j Transpose of vectors and matrices   1 6 x T =   x = has transpose   3   0   1 2 0 4 M T = M = 3 9 1 6 has transpose   . 8 7 0 2 • ( A T ) ij = A ji • ( A T ) T = A

  14. Adding and subtracting vectors and matrices Dot product of two vectors Dot product of vectors x , y ∈ R d : x · y = x 1 y 1 + x 2 y 2 + · · · + x d y d . What is the dot product between these two vectors? x 4 3 y 2 1 -4 -3 -2 -1 0 1 2 3 4

  15. Dot products and angles Dot product of vectors x , y 2 R d : x · y = x 1 y 1 + x 2 y 2 + · · · + x d y d . Tells us the angle between x and y : x x · y cos θ = k x k k y k . y θ x is orthogonal (at right angles) to y if and only if x · y = 0 When x , y are unit vectors (length 1): cos θ = x · y What is x · x ? Linear and quadratic functions In one dimension: • Linear: f ( x ) = 3 x + 2 • Quadratic: f ( x ) = 4 x 2 � 2 x + 6 In higher dimension, e.g. x = ( x 1 , x 2 , x 3 ): • Linear: 3 x 1 � 2 x 2 + x 3 + 4 • Quadratic: x 2 1 � 2 x 1 x 3 + 6 x 2 2 + 7 x 1 + 9

  16. Linear functions and dot products 5 4 3 Linear separator 2 4 x 1 + 3 x 2 = 12: 1 0 1 2 3 4 5 For x = ( x 1 , . . . , x d ) ∈ R d , linear separators are of the form: w 1 x 1 + w 2 x 2 + · · · + w d x d = c . Can write as w · x = c , for w = ( w 1 , . . . , w d ). More general linear functions A linear function from R 4 to R : f ( x 1 , x 2 , x 3 , x 4 ) = 3 x 1 − 2 x 3 A linear function from R 4 to R 3 : f ( x 1 , x 2 , x 3 , x 4 ) = (4 x 1 − x 2 , x 3 , − x 1 + 6 x 4 )

  17. Matrix-vector product Product of matrix M ∈ R r × d and vector x ∈ R d : The identity matrix The d × d identity matrix I d sends each x ∈ R d to itself.   1 0 0 0 · · · 0 1 0 0  · · ·    0 0 1 0 I d =   · · ·   . . . . ... . . . .   . . . .   0 0 0 1 · · ·

  18. Matrix-matrix product Product of matrix A ∈ R r × k and matrix B ∈ R k × p :

  19. Matrix products If A ∈ R r × k and B ∈ R k × p , then AB is an r × p matrix with ( i , j ) entry ( AB ) ij = (dot product of i th row of A and j th column of B ) k X = A i ` B ` j ` =1 • I k B = B and A I k = A • Can check: ( AB ) T = B T A T • For two vectors u , v ∈ R d , what is u T v ? Some special cases For vector x ∈ R d , what are x T x and xx T ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend