Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - - PowerPoint PPT Presentation

lecture 4 bayesian decision theory and max likelihood
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation - - PowerPoint PPT Presentation

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Recap Previous Lecture 2 C. Long Lecture 4 January 30, 2018


slide-1
SLIDE 1

Lecture 4: Bayesian Decision Theory and Max Likelihood Estimation

  • Dr. Chengjiang Long

Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

slide-2
SLIDE 2
  • C. Long

Lecture 4 January 30, 2018 2

Recap Previous Lecture

slide-3
SLIDE 3
  • C. Long

Lecture 4 January 30, 2018 3

Recap Previous Lecture

From a medical image, we want to classify (determine) whether it contains cancer tissues

  • r not.

1

( / ) ( / ) ( / )

c i i j j j

R a a P l w w

=

= å x x

2 1

( )/ ( )

a

P P q w w =

2 12 22 1 21 11

( )( ) ( )( )

b

P P w l l q w l l

  • =
  • Ground truths is always unknown for classifiers.
slide-4
SLIDE 4
  • C. Long

Lecture 4 January 30, 2018 4

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-5
SLIDE 5
  • C. Long

Lecture 4 January 30, 2018 5

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-6
SLIDE 6
  • C. Long

Lecture 4 January 30, 2018 6

Error Bounds

  • Exact error calculations could be difficult – easier

to estimate error bounds!

  • r

min[P(ω1/x), P(ω2/x)] P(error)

slide-7
SLIDE 7
  • C. Long

Lecture 4 January 30, 2018 7

Error Bounds

  • If the class conditional distributions are Gaussian,

then where

P(error)

slide-8
SLIDE 8
  • C. Long

Lecture 4 January 30, 2018 8

Error Bounds

  • The Chernoff bound is obtained by minimizing eκ(β)
  • This is a 1-D optimization problem, regardless to the

dimensionality of the class conditional densities.

slide-9
SLIDE 9
  • C. Long

Lecture 4 January 30, 2018 9

Error Bounds

  • The Bhattacharyya bound is obtained by setting

β=0.5 Easier to compute than Chernoff error but looser.

  • Note: the Chernoff and Bhattacharyya bounds will not

be good bounds if the densities are not Gaussian.

slide-10
SLIDE 10
  • C. Long

Lecture 4 January 30, 2018 10

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-11
SLIDE 11
  • C. Long

Lecture 4 January 30, 2018 11

Receiver Operating Characteristic (ROC) Curve

  • Every classifier typically employs some kind of a

threshold.

  • Changing the threshold will affect the performance
  • f the classifier.
  • ROC curves allow us to evaluate the performance
  • f a classifier using different thresholds.

2 1

( )/ ( )

a

P P q w w =

2 12 22 1 21 11

( )( ) ( )( )

b

P P w l l q w l l

  • =
slide-12
SLIDE 12
  • C. Long

Lecture 4 January 30, 2018 12

Example: Person Authentication

  • Authenticate a person using biometrics (e.g.,

fingerprints).

  • There are two possible distributions (i.e., classes):
slide-13
SLIDE 13
  • C. Long

Lecture 4 January 30, 2018 13

Example: Person Authentication

  • Possible decisions:

(1) correct acceptance (true positive): X belongs to A, and we decide A (2) incorrect acceptance (false positive): X belongs to I, and we decide A (3) correct rejection (true negative): X belongs to I, and we decide I (4) incorrect rejection (false negative): X belongs to A, and we decide I

false positive correct acceptance correct rejection false negative

slide-14
SLIDE 14
  • C. Long

Lecture 4 January 30, 2018 14

ROC Curve

FPR: False Positive Rate (X-axis) TRR: True Postive Rate (Y-axis)

false positive correct acceptance correct rejection false negative

slide-15
SLIDE 15
  • C. Long

Lecture 4 January 30, 2018 15

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-16
SLIDE 16
  • C. Long

Lecture 4 January 30, 2018 16

Missing Features

  • Suppose x=(x1,x2) is a test vector where x1 is missing

and via x2 = how can we classify it?

  • If we set x1 equal to the average value, we will

classify x as ω3

  • But is larger; should classify x as ω2 ?

2

ˆ x

2 2

ˆ ( / ) p x w

slide-17
SLIDE 17
  • C. Long

Lecture 4 January 30, 2018 17

Missing Features

  • Suppose x=[xg, xb] (xg: good features, xb: bad features)
  • Derive the Bayes rule using the good features:

marginalize posterior probability

  • ver bad

features.

slide-18
SLIDE 18
  • C. Long

Lecture 4 January 30, 2018 18

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-19
SLIDE 19
  • C. Long

Lecture 4 January 30, 2018 19

Compound Bayesian Decision Theory

  • Sequential decision
  • Decide as each pattern (e.g., fish) emerges.
  • Compound decision
  • Wait for n patterns (e.g., fish) to emerge.
  • Make all n decisions jointly.

Could improve performance when consecutive states

  • f nature are not be statistically independent.
slide-20
SLIDE 20
  • C. Long

Lecture 4 January 30, 2018 20

Compound Bayesian Decision Theory

  • Suppose denotes the n states of

nature where can take one of c values ω1, ω2, …, ωc (i.e., c categories)

  • Suppose is the prior probability of the n states of

nature.

  • Suppose are n observed vectors.

It is unacceptable to simplify the problem of calculating P(ω) by assuming that the states of nature are independent.

slide-21
SLIDE 21
  • C. Long

Lecture 4 January 30, 2018 21

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-22
SLIDE 22
  • C. Long

Lecture 4 January 30, 2018 22

Intuition

  • We could design an optimal classifier if we knew:

– (priors) – (class conditional densities) – Unfortunately, we rarely have this complete information!

  • Design a classifier from training data.
  • Samples are often too small for class conditional

estimation (large dimension of feature space)

slide-23
SLIDE 23
  • C. Long

Lecture 4 January 30, 2018 23

Supervised Learning in a Nutshell

slide-24
SLIDE 24
  • C. Long

Lecture 4 January 30, 2018 24

Statistical Estimation View

  • Probabilities to the rescue:
  • x and y are random variables
  • IID: Independent Identically Distributed
  • Both training & testing data sampled IID from P(X,Y)
  • Learn on training set
  • Have some hope of generalizing to test set
slide-25
SLIDE 25
  • C. Long

Lecture 4 January 30, 2018 25

Parameter Estimation

  • Use a priori information about the problem

E.g.: Normality of

  • Simplify problem
  • From estimating unknown distribution function
  • To estimating parameters
slide-26
SLIDE 26
  • C. Long

Lecture 4 January 30, 2018 26

Why Gaussians?

  • Why does the entire world seem to always be harping
  • n about Gaussians?

– Central Limit Theorem! – They’re easy (and we like easy) – Closely related to squared loss (for regression) – Mixture of Gaussians is sufficient to approximate many distributions

slide-27
SLIDE 27
  • C. Long

Lecture 4 January 30, 2018 27

Parameter Parameter

Parameter estimation Maximum likelihood: values of parameters are fixed but unknown Bayesian estimation: parameters as random variables having some known a priori distribution

slide-28
SLIDE 28
  • C. Long

Lecture 4 January 30, 2018 28

Parameter Estimation

  • Parameters in ML estimation are fixed but unknown!
  • Best parameters are obtained by maximizing the

probability of obtaining the samples observed.

  • Bayesian methods view the parameters as random

variables having some known distribution.

  • In either approach, we use for our classification

rule

slide-29
SLIDE 29
  • C. Long

Lecture 4 January 30, 2018 29

Maximum Likelihood Estimation: Independence Across Classes

  • For each class we have a proposed density

with unknown parameters which we need to estimate.

  • Since we assumed independence of data across the

classes, estimation is an identical procedure for all classes.

  • To simplify notation, we drop sub-indexes and say that

we need to estimate parametersθ for density p(x)

slide-30
SLIDE 30
  • C. Long

Lecture 4 January 30, 2018 30

Maximum-Likelihood Estimation

  • Has good convergence properties as the sample

size increases

  • Simpler than alternative techniques
  • General principle
  • Assume c datasets (classes) D1, D2, …, Dc

drawn independently according to

  • Assume that has known parametric form

determined by parameter vector

  • Further assume that Di gives no information about

( )

slide-31
SLIDE 31
  • C. Long

Lecture 4 January 30, 2018 31

Maximum-Likelihood Estimation

  • Use set of independent samples to estimate
  • Our goal is to determine (value of that best

agrees with observed training data)

  • Note if D is fixed is not a density
slide-32
SLIDE 32
  • C. Long

Lecture 4 January 30, 2018 32

Example: Gaussian case

  • Assume we have c classes and
  • Use the information provided by the training samples

to estimate each is associated with each category.

  • Suppose that D contains n samples,
slide-33
SLIDE 33
  • C. Long

Lecture 4 January 30, 2018 33

Maximum-Likelihood Estimation

  • is called the likelihood of w.r.t the set of

samples.

  • ML estimate of is, by definition the value that

maximizes “It is the value of that best agrees with the actually

  • bserved training sample”
slide-34
SLIDE 34
  • C. Long

Lecture 4 January 30, 2018 34

Optimal Estimation

  • Let and let be the gradient operator
  • We define as the log likelihood function
  • New problem statement:
  • determine that maximizes the log likelihood
slide-35
SLIDE 35
  • C. Long

Lecture 4 January 30, 2018 35

Optimal Estimation

  • Local or global maximum
  • Local or global minimum
  • Saddle point
  • Boundary of parameter space
slide-36
SLIDE 36
  • C. Long

Lecture 4 January 30, 2018 36

Example of ML estimation: Unkonwn

  • Samples are drawn from a multivariate normal

population

  • , Therefore
  • The ML estimate for must satisfy:
slide-37
SLIDE 37
  • C. Long

Lecture 4 January 30, 2018 37

Example of ML estimation: Unkonwn

  • Multiplying by and rearranging, we obtain:
  • Just the arithmetic average of the samples of the

training samples!

slide-38
SLIDE 38
  • C. Long

Lecture 4 January 30, 2018 38

Example of ML estimation: Unkonwn and

  • Parameter:
  • Objective function:
slide-39
SLIDE 39
  • C. Long

Lecture 4 January 30, 2018 39

Example of ML estimation: Unkonwn and

  • Summation
  • Combining (1) and (2), one obtains:
slide-40
SLIDE 40
  • C. Long

Lecture 4 January 30, 2018 40

How good are these estimates?

  • Two measures of “goodness” are used for statistical

estimates

  • BIAS: how close is the estimate to the true value?
  • VARIANCE: how much does it change for different

datasets?

  • The bias-variance tradeoff
  • In most cases, you can only decrease one of them at the

expense of the other

slide-41
SLIDE 41
  • C. Long

Lecture 4 January 30, 2018 41

What is the bias of the ML estimate of the mean?

  • Therefore the mean is an unbiased estimate.
slide-42
SLIDE 42
  • C. Long

Lecture 4 January 30, 2018 42

What is the bias of the ML estimate of the variance?

 Thus, the ML estimate of variance is BIASED

  • This is because the ML estimate of variance uses instead
  • f

 How “bad” is this bias?

  • For the bias becomes zero asymptotically
  • The bias is only noticeable when we have very few samples,

in which case we should not be doing statistics in the first place!

 Notice that MATLAB uses an unbiased estimate of the

covariance

in the extreme case of n = 1, in which the expectation value

slide-43
SLIDE 43
  • C. Long

Lecture 4 January 30, 2018 43

Outline

  • Bayesian Decision Theory
  • Error Bound
  • ROC
  • Missing Features
  • Compound Bayesian Decision Theory
  • Max Likelihood Estimation
  • Example with Real World Data
slide-44
SLIDE 44
  • C. Long

Lecture 4 January 30, 2018 44

Example with real world data (1)

  • Image is acquired by the

ROSIS-03 optical sensor over the University of Pavia, Italy

  • Spatial dimension: 610 x 340

pixels

  • Spatial resolution: 1.3m per

pixel

  • Spectral dimension: 103

spectral channels (0.43/ 0.86 μm)

slide-45
SLIDE 45
  • C. Long

Lecture 4 January 30, 2018 45

Example with real world data (2)

slide-46
SLIDE 46
  • C. Long

Lecture 4 January 30, 2018 46

Example with real world data (3)

  • We split reference data into sets of training and test samples:
slide-47
SLIDE 47
  • C. Long

Lecture 4 January 30, 2018 47

Spectral Context for HS Image

slide-48
SLIDE 48
  • C. Long

Lecture 4 January 30, 2018 48

Spectral Context for HS Image

slide-49
SLIDE 49
  • C. Long

Lecture 4 January 30, 2018 49

Maximum Likelihood Classification

  • Feature vector: a vector of radiance values x for each

pixel

103 spectral bands dimensionality of the feature vector d=103

slide-50
SLIDE 50
  • C. Long

Lecture 4 January 30, 2018 50

Maximum Likelihood Classification

  • Samples of each class k are assumed to have a

Gaussian distribution

  • Parameters of distributions for each class are

estimated from the training samples, using the maximum likelihood estimates:

slide-51
SLIDE 51
  • C. Long

Lecture 4 January 30, 2018 51

Maximum Likelihood Classification

  • For each class k, P = [d(d+1)/2 + d] parameters have

to be estimated

  • If d = 103, P = 5459!
  • We have only from 231 to 548 training samples per

class

  • To avoid a significant parameter estimation error: P

<< mk (mk – number of training samples for class k)

slide-52
SLIDE 52
  • C. Long

Lecture 4 January 30, 2018 52

Maximum Likelihood Classification

  • Dimensionality reduction must be performed first, to

reduce the dimensionality d

  • The first 3 bands on the 103 band image are omitted
  • A 10 band image is obtained by averaging over every

10 bands (new d = 10)

slide-53
SLIDE 53
  • C. Long

Lecture 4 January 30, 2018 53

Maximum Likelihood Classification

  • 1) Parameters of Gaussian distributions for each

class are estimated

  • 2) The whole image is classified using K = 9

(number of classes) discriminant functions (MAP classification):

total number of training samples

slide-54
SLIDE 54
  • C. Long

Lecture 4 January 30, 2018 54

Maximum Likelihood Classification

slide-55
SLIDE 55
  • C. Long

Lecture 4 January 30, 2018 55

Maximum Likelihood Classification

slide-56
SLIDE 56
  • C. Long

Lecture 4 January 30, 2018 56

Conclusions for the classification example

  • Classification accuracies are high for most of the

classes

  • Other feature extraction (dimensionality reduction)

method can be used so that accuracies can be further improved

slide-57
SLIDE 57
  • C. Long

Lecture 4 January 30, 2018 57

Computational complexity

  • Example: complexity of a ML estimation of the

parameters in a classifier for Gaussian priors in d dimension, with n training samples

  • For each of c categories
  • Overall computational complexity for learning is
  • Computational complexity for classificaiton of one

sample is

slide-58
SLIDE 58
  • C. Long

Lecture 4 January 30, 2018 58

Computational complexity

  • Parallel implementations
  • Space complexity
  • Time complexity
  • Example: Estimation of the sample mean using d processors,

each adding n values

  • Space complexity: O(d)
  • Time complexity: O(n)
slide-59
SLIDE 59
  • C. Long

Lecture 4 January 30, 2018 59