lecture 4 bayesian decision theory and max likelihood
play

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - PowerPoint PPT Presentation

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018


  1. Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

  2. Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018

  3. Recap Previous Lecture = å c l w w R a ( / ) x ( a / ) ( P / ) x i i j j = j 1 From a medical image, we want to classify (determine) whether it contains cancer tissues or not. w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 q = w w P ( )/ P ( ) a 2 1 Ground truths is always unknown for classifiers. 3 C. Long Lecture 4 January 30, 2018

  4. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 4 C. Long Lecture 4 January 30, 2018

  5. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 5 C. Long Lecture 4 January 30, 2018

  6. Error Bounds Exact error calculations could be difficult – easier • to estimate error bounds ! or min[P ( ω 1 / x ), P ( ω 2 / x ) ] P ( error ) 6 C. Long Lecture 4 January 30, 2018

  7. Error Bounds P ( error ) If the class conditional distributions are Gaussian , • then where 7 C. Long Lecture 4 January 30, 2018

  8. Error Bounds The Chernoff bound is obtained by minimizing e κ ( β ) •  This is a 1-D optimization problem, regardless to the dimensionality of the class conditional densities. 8 C. Long Lecture 4 January 30, 2018

  9. Error Bounds The Bhattacharyya bound is obtained by setting • β =0.5 Easier to compute than Chernoff error but looser . Note : the Chernoff and Bhattacharyya bounds will not • be good bounds if the densities are not Gaussian . 9 C. Long Lecture 4 January 30, 2018

  10. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 10 C. Long Lecture 4 January 30, 2018

  11. Receiver Operating Characteristic (ROC) Curve Every classifier typically employs some kind of a • threshold . q = w w P ( )/ P ( ) a 2 1 w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 Changing the threshold will affect the performance • of the classifier . ROC curves allow us to evaluate the performance • of a classifier using different thresholds . 11 C. Long Lecture 4 January 30, 2018

  12. Example: Person Authentication Authenticate a person using biometrics ( e . g ., • fingerprints ). There are two possible distributions ( i . e ., classes ): • 12 C. Long Lecture 4 January 30, 2018

  13. Example: Person Authentication Possible decisions : • (1) correct acceptance ( true positive ): X belongs to A , and we decide A (2) incorrect acceptance ( false positive ): X belongs to I , and we decide A (3) correct rejection ( true negative ): X belongs to I , and we decide I (4) incorrect rejection ( false negative ): X belongs to A , and we decide I correct rejection correct acceptance false negative false positive 13 C. Long Lecture 4 January 30, 2018

  14. ROC Curve correct rejection correct acceptance false negative false positive FPR : False Positive Rate (X-axis) TRR : True Postive Rate (Y-axis) 14 C. Long Lecture 4 January 30, 2018

  15. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 15 C. Long Lecture 4 January 30, 2018

  16. Missing Features Suppose x =( x 1 , x 2 ) is a test vector where x 1 is missing • and via x 2 = how can we classify it ? ˆ x 2  If we set x 1 equal to the average value , we will classify x as ω 3  But is larger ; should classify x as ω 2 ? w ˆ p x ( / ) 2 2 16 C. Long Lecture 4 January 30, 2018

  17. Missing Features Suppose x = [ x g , x b ] ( x g : good features , x b : bad features ) • Derive the Bayes rule using the good features : • marginalize posterior probability over bad features . 17 C. Long Lecture 4 January 30, 2018

  18. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 18 C. Long Lecture 4 January 30, 2018

  19. Compound Bayesian Decision Theory Sequential decision • Decide as each pattern ( e . g ., fish ) emerges .  Compound decision • Wait for n patterns ( e . g ., fish ) to emerge .  Make all n decisions jointly .  Could improve performance when consecutive states of nature are not be statistically independent . 19 C. Long Lecture 4 January 30, 2018

  20. Compound Bayesian Decision Theory Suppose denotes the n states of • nature where can take one of c values ω 1 , ω 2 , … , ω c ( i . e ., c categories ) Suppose is the prior probability of the n states of • nature . Suppose are n observed vectors . • It is unacceptable to simplify the problem of calculating P(ω) by assuming that the states of nature are independent. 20 C. Long Lecture 4 January 30, 2018

  21. Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 21 C. Long Lecture 4 January 30, 2018

  22. Intuition We could design an optimal classifier if we knew : • – ( priors ) – ( class conditional densities ) – Unfortunately , we rarely have this complete information ! Design a classifier from training data . • Samples are often too small for class conditional • estimation (large dimension of feature space) 22 C. Long Lecture 4 January 30, 2018

  23. Supervised Learning in a Nutshell 23 C. Long Lecture 4 January 30, 2018

  24. Statistical Estimation View Probabilities to the rescue : • x and y are random variables • • IID : Independent Identically Distributed • Both training & testing data sampled IID from P ( X , Y ) • Learn on training set • Have some hope of generalizing to test set • 24 C. Long Lecture 4 January 30, 2018

  25. Parameter Estimation Use a priori information about the problem • E . g .: Normality of Simplify problem • From estimating unknown distribution function • To estimating parameters • 25 C. Long Lecture 4 January 30, 2018

  26. Why Gaussians? Why does the entire world seem to always be harping • on about Gaussians ? – Central Limit Theorem ! – They’re easy ( and we like easy ) – Closely related to squared loss ( for regression ) – Mixture of Gaussians is sufficient to approximate many distributions 26 C. Long Lecture 4 January 30, 2018

  27. Parameter Parameter Parameter estimation Bayesian estimation: Maximum likelihood: parameters as random values of parameters variables having some known a are fixed but unknown priori distribution 27 C. Long Lecture 4 January 30, 2018

  28. Parameter Estimation Parameters in ML estimation are fixed but unknown ! • Best parameters are obtained by maximizing the • probability of obtaining the samples observed . Bayesian methods view the parameters as random • variables having some known distribution . In either approach , we use for our classification • rule 28 C. Long Lecture 4 January 30, 2018

  29. Maximum Likelihood Estimation: Independence Across Classes For each class we have a proposed density • with unknown parameters which we need to estimate . Since we assumed independence of data across the • classes , estimation is an identical procedure for all classes . To simplify notation , we drop sub - indexes and say that • we need to estimate parameters θ for density p ( x ) 29 C. Long Lecture 4 January 30, 2018

  30. Maximum-Likelihood Estimation Has good convergence properties as the sample • size increases Simpler than alternative techniques • General principle • Assume c datasets ( classes ) D 1, D 2, … , Dc • drawn independently according to Assume that has known parametric form • determined by parameter vector Further assume that Di gives no information about • ( ) 30 C. Long Lecture 4 January 30, 2018

  31. Maximum-Likelihood Estimation Use set of independent samples to estimate • Our goal is to determine ( value of that best • agrees with observed training data ) Note if D is fixed is not a density • 31 C. Long Lecture 4 January 30, 2018

  32. Example: Gaussian case Assume we have c classes and • Use the information provided by the training samples • to estimate each is associated with each category. Suppose that D contains n samples, • 32 C. Long Lecture 4 January 30, 2018

  33. Maximum-Likelihood Estimation is called the likelihood of w . r . t the set of • samples . ML estimate of is , by definition the value that • maximizes “It is the value of that best agrees with the actually observed training sample” 33 C. Long Lecture 4 January 30, 2018

  34. Optimal Estimation Let and let be the gradient operator • We define as the log likelihood function • New problem statement : • determine that maximizes the log likelihood • 34 C. Long Lecture 4 January 30, 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend