Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - PowerPoint PPT Presentation

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018

Recap Previous Lecture = å c l w w R a ( / ) x ( a / ) ( P / ) x i i j j = j 1 From a medical image, we want to classify (determine) whether it contains cancer tissues or not. w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 q = w w P ( )/ P ( ) a 2 1 Ground truths is always unknown for classifiers. 3 C. Long Lecture 4 January 30, 2018

Outline Bayesian Decision Theory • Error Bound • ROC • Missing Features • Compound Bayesian Decision Theory • Max Likelihood Estimation • Example with Real World Data • 4 C. Long Lecture 4 January 30, 2018

Error Bounds Exact error calculations could be difficult – easier • to estimate error bounds ! or min[P ( ω 1 / x ), P ( ω 2 / x ) ] P ( error ) 6 C. Long Lecture 4 January 30, 2018

Error Bounds P ( error ) If the class conditional distributions are Gaussian , • then where 7 C. Long Lecture 4 January 30, 2018

Error Bounds The Chernoff bound is obtained by minimizing e κ ( β ) •  This is a 1-D optimization problem, regardless to the dimensionality of the class conditional densities. 8 C. Long Lecture 4 January 30, 2018

Error Bounds The Bhattacharyya bound is obtained by setting • β =0.5 Easier to compute than Chernoff error but looser . Note : the Chernoff and Bhattacharyya bounds will not • be good bounds if the densities are not Gaussian . 9 C. Long Lecture 4 January 30, 2018

Receiver Operating Characteristic (ROC) Curve Every classifier typically employs some kind of a • threshold . q = w w P ( )/ P ( ) a 2 1 w l - l P ( )( ) q = 2 12 22 b w l - l P ( )( ) 1 21 11 Changing the threshold will affect the performance • of the classifier . ROC curves allow us to evaluate the performance • of a classifier using different thresholds . 11 C. Long Lecture 4 January 30, 2018

Example: Person Authentication Authenticate a person using biometrics ( e . g ., • fingerprints ). There are two possible distributions ( i . e ., classes ): • 12 C. Long Lecture 4 January 30, 2018

Example: Person Authentication Possible decisions : • (1) correct acceptance ( true positive ): X belongs to A , and we decide A (2) incorrect acceptance ( false positive ): X belongs to I , and we decide A (3) correct rejection ( true negative ): X belongs to I , and we decide I (4) incorrect rejection ( false negative ): X belongs to A , and we decide I correct rejection correct acceptance false negative false positive 13 C. Long Lecture 4 January 30, 2018

ROC Curve correct rejection correct acceptance false negative false positive FPR : False Positive Rate (X-axis) TRR : True Postive Rate (Y-axis) 14 C. Long Lecture 4 January 30, 2018

Missing Features Suppose x =( x 1 , x 2 ) is a test vector where x 1 is missing • and via x 2 = how can we classify it ? ˆ x 2  If we set x 1 equal to the average value , we will classify x as ω 3  But is larger ; should classify x as ω 2 ? w ˆ p x ( / ) 2 2 16 C. Long Lecture 4 January 30, 2018

Missing Features Suppose x = [ x g , x b ] ( x g : good features , x b : bad features ) • Derive the Bayes rule using the good features : • marginalize posterior probability over bad features . 17 C. Long Lecture 4 January 30, 2018

Compound Bayesian Decision Theory Sequential decision • Decide as each pattern ( e . g ., fish ) emerges .  Compound decision • Wait for n patterns ( e . g ., fish ) to emerge .  Make all n decisions jointly .  Could improve performance when consecutive states of nature are not be statistically independent . 19 C. Long Lecture 4 January 30, 2018

Compound Bayesian Decision Theory Suppose denotes the n states of • nature where can take one of c values ω 1 , ω 2 , … , ω c ( i . e ., c categories ) Suppose is the prior probability of the n states of • nature . Suppose are n observed vectors . • It is unacceptable to simplify the problem of calculating P(ω) by assuming that the states of nature are independent. 20 C. Long Lecture 4 January 30, 2018

Intuition We could design an optimal classifier if we knew : • – ( priors ) – ( class conditional densities ) – Unfortunately , we rarely have this complete information ! Design a classifier from training data . • Samples are often too small for class conditional • estimation (large dimension of feature space) 22 C. Long Lecture 4 January 30, 2018

Supervised Learning in a Nutshell 23 C. Long Lecture 4 January 30, 2018

Statistical Estimation View Probabilities to the rescue : • x and y are random variables • • IID : Independent Identically Distributed • Both training & testing data sampled IID from P ( X , Y ) • Learn on training set • Have some hope of generalizing to test set • 24 C. Long Lecture 4 January 30, 2018

Parameter Estimation Use a priori information about the problem • E . g .: Normality of Simplify problem • From estimating unknown distribution function • To estimating parameters • 25 C. Long Lecture 4 January 30, 2018

Why Gaussians? Why does the entire world seem to always be harping • on about Gaussians ? – Central Limit Theorem ! – They’re easy ( and we like easy ) – Closely related to squared loss ( for regression ) – Mixture of Gaussians is sufficient to approximate many distributions 26 C. Long Lecture 4 January 30, 2018

Parameter Parameter Parameter estimation Bayesian estimation: Maximum likelihood: parameters as random values of parameters variables having some known a are fixed but unknown priori distribution 27 C. Long Lecture 4 January 30, 2018

Parameter Estimation Parameters in ML estimation are fixed but unknown ! • Best parameters are obtained by maximizing the • probability of obtaining the samples observed . Bayesian methods view the parameters as random • variables having some known distribution . In either approach , we use for our classification • rule 28 C. Long Lecture 4 January 30, 2018

Maximum Likelihood Estimation: Independence Across Classes For each class we have a proposed density • with unknown parameters which we need to estimate . Since we assumed independence of data across the • classes , estimation is an identical procedure for all classes . To simplify notation , we drop sub - indexes and say that • we need to estimate parameters θ for density p ( x ) 29 C. Long Lecture 4 January 30, 2018

Maximum-Likelihood Estimation Has good convergence properties as the sample • size increases Simpler than alternative techniques • General principle • Assume c datasets ( classes ) D 1, D 2, … , Dc • drawn independently according to Assume that has known parametric form • determined by parameter vector Further assume that Di gives no information about • ( ) 30 C. Long Lecture 4 January 30, 2018

Maximum-Likelihood Estimation Use set of independent samples to estimate • Our goal is to determine ( value of that best • agrees with observed training data ) Note if D is fixed is not a density • 31 C. Long Lecture 4 January 30, 2018

Example: Gaussian case Assume we have c classes and • Use the information provided by the training samples • to estimate each is associated with each category. Suppose that D contains n samples, • 32 C. Long Lecture 4 January 30, 2018

Maximum-Likelihood Estimation is called the likelihood of w . r . t the set of • samples . ML estimate of is , by definition the value that • maximizes “It is the value of that best agrees with the actually observed training sample” 33 C. Long Lecture 4 January 30, 2018

Optimal Estimation Let and let be the gradient operator • We define as the log likelihood function • New problem statement : • determine that maximizes the log likelihood • 34 C. Long Lecture 4 January 30, 2018

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - PowerPoint PPT Presentation

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Bayesian Decision Theory with applications to Experimental Design Robbie Peck University of Bath

FAST ALGORITHMS FOR SURFACE EMBEDDED GRAPHS VIA HOMOLOGY The Final Exam of Kyle J. Fox

Max flow and min cost max flow Han Hoogeveen May 23, 2014 Basic problem description Given:

Soviet Rail Network, 1955 Reference: On the history of the transportation and maximum flow problems

Maximum Contiguous Subsequence Sum After todays class you will be able to: provide an example

The Comparison of ACI and MCB Methods for Choosing a Set that Contains the Optimal Dynamic

TENSOR NETWORK STATES FOR LATTICE GAUGE THEORIES about classical TNS simulations of a

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frdric Vivien 2 , 1

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay Mansour Princeton U Hebrew U

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - PowerPoint PPT Presentation

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C

Segmentation using Segmentation using Bayesian Decision Theory Bayesian Decision Theory

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Bayesian Decision Theory with applications to Experimental Design Robbie Peck University of Bath

FAST ALGORITHMS FOR SURFACE EMBEDDED GRAPHS VIA HOMOLOGY The Final Exam of Kyle J. Fox

Max flow and min cost max flow Han Hoogeveen May 23, 2014 Basic problem description Given:

Soviet Rail Network, 1955 Reference: On the history of the transportation and maximum flow problems

Maximum Contiguous Subsequence Sum After todays class you will be able to: provide an example

The Comparison of ACI and MCB Methods for Choosing a Set that Contains the Optimal Dynamic

TENSOR NETWORK STATES FOR LATTICE GAUGE THEORIES about classical TNS simulations of a

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 Frdric Vivien 2 , 1

Prediction from low-rank missing data Elad Hazan Roi Livni Yishay Mansour Princeton U Hebrew U

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for