bayes decision theory i
play

Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE - PowerPoint PPT Presentation

Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter 2012 - UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an


  1. Bayes Decision Theory - I Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A – Winter 2012 - UCSD

  2. Statistical Learning from Data Goal: Given a relationship between a feature vector x and a vector y , and iid data samples ( x i ,y i ), find an approximating function f ( x )  y   ˆ ( ) x y f x y ( ) · f This is called training or learning . Two major types of learning: • Unsupervised Classification (aka Clustering) or Regression (“blind” curve fitting): only X is known. • Supervised Classification or Regression: both X and target value Y are known during training, only X is known at test time. 2

  3. Nearest Neighbor Classifier • The simplest possible classifier that one could think of: – It consists of assigning to a new, unclassified vector the same class label as that of the closest vector in the labeled training set – E.g. to classify the unlabeled point “ Red ”:  measure Red ’s distance to all other labeled training points  If the closest point to Red is labeled “A = square”, assign it to the class A  otherwise assign Red to the “B = circle” class • This works a lot better than what one might expect, particularly if there are a lot of labeled training points 3

  4. Nearest Neighbor Classifier • To define this classification procedure rigorously, define: – a Training Set D = {( x 1 ,y 1 ) , …, ( x n ,y n )} – x i is a vector of observations , y i is the class label – a new vector x to classify • The Decision Rule is  set y y * i where  * arg min ( , ) i d x x i  { 1 ,..., } i n – argmin means: “the i that minimizes the distance” 4

  5. Metrics • we have seen some examples: – R d -- Continuous functions Inner Product : Inner Product : d      T , ( ), ( ) ( ) ( ) x y x y x y f x g x f x g x dx i i  1 i Euclidean norm: norm 2 = ‘energy’:   2 d ( ) ( )  f x f x dx   2 T x x x x i  1 i Euclidean distance: Distance 2 = ‘energy’ of difference: d         2 2 ( , ) ( ) ( , ) [ ( ) ( )] d x y x y x y d f g f x g x dx i i  1 i 5

  6. Euclidean distance • We considered in detail the Euclidean distance d    2 ( , ) ( ) d x y x y i i  1 i x • Equidistant points to x? d      2 2 ( , ) ( ) d x y r x y r i i  i 1     2 2 2 – E.g. ( ) ( ) x y x y r 1 1 2 2 • The equidistant points to x are on spheres around x • Why would we need any other metric? 6

  7. Inner Products • fish example: – features are L = fish length, W = scale width – measure L in meters and W in milimeters  typical L: 0.70m for salmon, 0.40m for sea-bass  typical W: 35mm for salmon, 40mm for sea-bass – I have three fish  F 1 = (.7,35) F 2 = (.4, 40) F 3 = (.75, 37.8)  F 1 clearly salmon, F 2 clearly sea-bass, F 3 looks like salmon  yet d(F 1 ,F 3 ) = 2.8 > d(F 2 ,F 3 ) = 2.23 – there seems to be something wrong here – but if scale width is also measured in meters:  F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378)  and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right – the units are commensurate 7

  8. Inner Product • Suppose the scale width is also measured in meters: x – I have three fish  F 1 = (.7,.035) F 2 = (.4, .040) F 3 = (.75, .0378)  and now d(F 1 ,F 3 ) = .05 < d(F 2 ,F 3 ) = 0.35 – which seems to be right • The problem is that the Euclidean distance depends on the units (or scaling) of each axis x – e.g. if I multiply the second coordinate by 1,000     ' 2 2 ( , ) ( ) 1,000,000( ) d x y x y x y 1 1 2 2 The 2 nd coordinates influence on the distance increases 1,000-fold! • Often “right” units are not clear (e.g. car speed vs weight) 8

  9. Inner Products • W e need to work with the “ right ”, or at least “ better ”, units • Apply a transformation to get a “better” feature space x  ' Ax • examples: – Taking A = R , R proper and orthogonal, is R x equivalent to a rotation – Another important special case is scaling ( A = S , for S diagonal) SR         0 0 x x S 1 1 1 1           0 0                0 0      x x n n n n x – We can combine these two transformations by making taking A = SR 9

  10. Inner Products • what is the Euclidean inner product in the transformed space?         T    T T T T ' , ' , ' ' ( ) x y x My x y x y Ax Ay x A Ay    • Using the weighted inner product in the original T , x y x M y space, is the equivalent to working in the transformed space • More generally, what is a “good” M? – Let the data tell us! – one possibility is to take M to be the inverse of the covariance matrix      2 ( , ) 1 T ( ) ( ) d x y x y x y • This is the Mahalanobis distance – This distance is adapted to the data “scatter” and thereby yields “natural” units under a Gaussian assumption 10

  11. The multivariate Gaussian • Using Mahalanobis distance = assuming Gaussian data • Mahalanobis distance: Gaussian:   1 1         2 ( , ) 1 T         ( ) ( ) 1 d x x x  T  ( ) exp ( ) ( ) P x x x X     2 d ( 2 ) | | – Points of high probability are those of small distance to the center of the data distribution (mean) – Thus the Mahalanobis distance can be interpreted as the “right” norm for a certain type of non-Cartesian space 11

  12. The multivariate Gaussian • For Gaussian data , the Mahalanobis distance tells us all we could possibly know statistically about the data: – The pdf for a d-dimensional Gaussian of mean  and covariance  is   1 1         1  T  ( ) exp ( ) ( ) P x x x X     2 d ( 2 ) | | – This is equivalent to   1 1      2   ( ) exp , P x d x X   2 K which is the exponential of the negative Mahalanobis distance-squared up to a constant scaling factor K.  The constant K is needed only to ensure that the pdf integrates to 1 12

  13. “Optimal” Classifiers • Some metrics are “better” than others • The meaning of “better” is connected to how well adapted the metric is to the properties of the data • Can we be more rigorous? Can we have an “optimal” metric? What could we mean by “ optimal ”? • To talk about optimality we start by defining cost or loss y  x ˆ ( ) f x ˆ ( , ) L y y f (.) – Cost is a real-valued loss function that we want to minimize ˆ y – It depends on the true y and the prediction ˆ – The value of the cost tells us how good our predictor is y 13

  14. Loss Functions for Classification • Classification Problem: loss is function of classification errors – What types of errors can we have? – Two Types : False Positives and False Negatives  Consider a face detection problem  If you see these two images and say say = “face” say = “non - face”  you have a false-positive false-negative (miss) – Obviously, we have similar sub-classes for non-errors  true-positives and true-negatives – The positive/negative part reflects what we say (predict) – The true/false part reflects the reality of the situation 14

  15. Loss Functions • Are some errors more important than others? – Depends on the problem – Consider a snake looking for lunch  The snake likes to eat frogs  but dart frogs are highly poisonous  The snake must classify each frog that it sees, Y ∈ {“dart”, “regular”}  The losses are clearly different snake frog = frog = prediction dart regular  “regular” 0 “dart” 0 10 15

  16. Loss Functions • But not all snakes are the same – The one to the right is a dart frog predator – It also can classify each frog it sees, Y ∈ {“dart”, “regular”} , but it actually prefers to eat dart frogs and thus it might pass up a regular frog in its search for a tastier meal  However, other frogs are ok to eat too snake dart regular prediction frog frog “regular” 10 0 “dart” 0 10 16

  17. (Conditional) Risk as Average Cost • Given a loss function, denote the cost of classifying a data vector x generated from class j as i by    L j i • Conditioned on an observed data vector x , to measure how good the classifier is on the average if one (always) decides i use the (conditional) expected value of the loss , aka the (data-conditional) Risk , de f          ( , ) E{ | } ( | ) R x i L Y i x L j i P j x | Y X j • This means that the risk of classifying x as i is equal to – the sum, over all classes j, of the cost of classifying x as i when the truth is j times the conditional probability that the true class is j (where the conditioning is on the observed value of x) 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend