Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer - - PowerPoint PPT Presentation

lecture 3 bayesian decision theory
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer - - PowerPoint PPT Presentation

Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Recap Previous Lecture 2 C. Long Lecture 3 January 28, 2018 Outline What's Beyesian


slide-1
SLIDE 1

Lecture 3: Bayesian Decision Theory

  • Dr. Chengjiang Long

Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

slide-2
SLIDE 2
  • C. Long

Lecture 3 January 28, 2018 2

Recap Previous Lecture

slide-3
SLIDE 3
  • C. Long

Lecture 3 January 28, 2018 3

Outline

  • What's Beyesian Decision Theory?
  • A More General Theory
  • Discriminant Function and Decision Boundary
  • Multivariate Gaussian Density
slide-4
SLIDE 4
  • C. Long

Lecture 3 January 28, 2018 4

Outline

  • What's Beyesian Decision Theory?
  • A More General Theory
  • Discriminant Function and Decision Boundary
  • Multivariate Gaussian Density
slide-5
SLIDE 5
  • C. Long

Lecture 3 January 28, 2018 5

Bayesian Decision Theory

  • Design classifiers to make decisions subject to

minimizing an expected "risk".

  • The simplest risk is the classification error (i.e.,

assuming that misclassification costs are equal).

  • When misclassification costs are not equal, the risk

can include the cost associated with different misclassifications.

slide-6
SLIDE 6
  • C. Long

Lecture 3 January 28, 2018 6

Terminology

  • State of nature ω (class label):
  • e.g., ω1 for sea bass, ω2 for salmon
  • Probabilities P(ω1) and P(ω2) (priors):
  • e.g., prior knowledge of how likely is to get a sea

bass or a salmon

  • Probability density function p(x) (evidence):
  • e.g., how frequently we will measure a pattern with

feature value x (e.g., x corresponds to lightness)

slide-7
SLIDE 7
  • C. Long

Lecture 3 January 28, 2018 7

Terminology

  • Conditional probability density p(x/ωj) (likelihood) :
  • e.g., how frequently we will measure a pattern with

feature value x given that the pattern belongs to class ωj

slide-8
SLIDE 8
  • C. Long

Lecture 3 January 28, 2018 8

Terminology

  • Conditional probability P(ωj /x) (posterior) :
  • e.g., the probability that the fish belongs to

class ωj given feature x.

  • Ultimately, we are interested in computing P(ωj /x)

for each class ωj.

slide-9
SLIDE 9
  • C. Long

Lecture 3 January 28, 2018 9

Decision Rule

  • r
  • Favours the most likely class.
  • This rule will be making the same decision all times.
  • i.e., optimum if no other information is available
slide-10
SLIDE 10
  • C. Long

Lecture 3 January 28, 2018 10

Decision Rule

  • Using Bayes’ rule:

where

( / ) ( ) ( / ) ( )

j j j

p x P likelihood prior P x p x evidence w w w ´ = =

2 1

( ) ( / ) ( )

j j j

p x p x P w w

=

= å

Decideω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2

  • r

Decideω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decideω2

  • r

Decideω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2

slide-11
SLIDE 11
  • C. Long

Lecture 3 January 28, 2018 11

Decision Rule

1 2

2 1 ( ) ( ) 3 3 P P w w = =

P(ωj /x) p(x/ωj)

slide-12
SLIDE 12
  • C. Long

Lecture 3 January 28, 2018 12

Probability of Error

  • The probability of error is defined as:
  • r
  • What is the average probability error?
  • The Bayes rule is optimum, that is, it minimizes

the average probability error!

slide-13
SLIDE 13
  • C. Long

Lecture 3 January 28, 2018 13

Where do Probabilities come from?

  • There are two competitive answers:

 Relative frequency (objective) approach.

  • Probabilities can only come from experiments.

 Bayesian (subjective) approach.

  • Probabilities may reflect degree of belief and can be

based on opinion.

slide-14
SLIDE 14
  • C. Long

Lecture 3 January 28, 2018 14

Example: Objective approach

  • Classify cars whether they are more or less than

$ 50K:

  • Classes: C1 if price >50K, C2 if price <= 50K
  • Features: x, the height of a car
  • Use the Bayes’ rule to compute the posterior

probabilities:

  • We need to estimate p(x/C1), p(x/C2), P(C1), P(C2)

( / ) ( ) ( / ) ( )

i i i

p x C P C P C x p x =

slide-15
SLIDE 15
  • C. Long

Lecture 3 January 28, 2018 15

Example: Objective approach

  • Collect data
  • Ask drivers how much their car was and measure height.
  • Determine prior probabilities P(C1), P(C2)
  • e.g., 1209 samples: #C1=221 #C2=988

1 2

221 ( ) 0.183 1209 988 ( ) 0.817 1209 P C P C = = = =

slide-16
SLIDE 16
  • C. Long

Lecture 3 January 28, 2018 16

Example: Objective approach

  • Determine class conditional probabilities (likelihood)
  • Discretize car height into bins and use normalized

histogram

 Calculate the posterior probability for each bin:

( / )

i

p x C

1 1 1 1 1 2 2

( 1.0/ ) ( ) ( / 1.0) ( 1.0/ ) ( ) ( 1.0/ ) ( ) 0.2081*0.183 0.438 0.2081*0.183 0.0597*0.817 p x C P C P C x p x C P C p x C P C = = = = = + = = = +

slide-17
SLIDE 17
  • C. Long

Lecture 3 January 28, 2018 17

Outline

  • What's Beyesian Decision Theory?
  • A More General Theory
  • Discriminant Function and Decision Boundary
  • Multivariate Gaussian Density
slide-18
SLIDE 18
  • C. Long

Lecture 3 January 28, 2018 18

A More General Theory

 Use more than one features.  Allow more than two categories.  Allow actions other than classifying the input to one of the possible

categories (e.g., rejection).

 Employ a more general error function (i.e., expected “risk”) by

associating a “cost” (based on a “loss” function) with different errors.

slide-19
SLIDE 19
  • C. Long

Lecture 3 January 28, 2018 19

Terminology

  • Features form a vector
  • A set of c categories ω1, ω2, …, ωc
  • A finite set of l actions α1, α2, …, αl
  • A loss function λ(αi / ωj)
  • the cost associated with taking action αi when the

correct classification category is ωj

d

R Î x

slide-20
SLIDE 20
  • C. Long

Lecture 3 January 28, 2018 20

Conditional Risk (or Expected Loss)

  • Suppose we observe x and take action αi
  • The conditional risk (or expected loss) with taking

action αi is defined as:

1

( / ) ( / ) ( / )

c i i j j j

R a a P l w w

=

= å x x

From a medical image, we want to classify (determine) whether it contains cancer tissues or not.

slide-21
SLIDE 21
  • C. Long

Lecture 3 January 28, 2018 21

Overall Risk

  • Suppose α(x) is a general decision rule that

determines which action α1, α2, …, αl to take for every x.

  • The overall risk is defined as:
  • The optimum decision rule is the Bayes rule

( ( ) / ) ( ) R R a p d = ò x x x x

slide-22
SLIDE 22
  • C. Long

Lecture 3 January 28, 2018 22

Overall Risk

  • The Bayes rule minimizes R by:

(i) Computing R(αi /x) for every αi given an x (ii) Choosing the action αi with the minimum R(αi /x)

  • The resulting minimum R* is called Bayes risk and

is the best (i.e., optimum) performance that can be achieved:

*

min R R =

slide-23
SLIDE 23
  • C. Long

Lecture 3 January 28, 2018 23

Example: Two-category classification

  • Define
  • α1: decide ω1
  • α2: decide ω2
  • λij = λ(αi /ωj)
  • The conditional risks are:

1

( / ) ( / ) ( / )

c i i j j j

R a a P l w w

=

= å x x

slide-24
SLIDE 24
  • C. Long

Lecture 3 January 28, 2018 24

Example: Two-category classification

  • Minimum risk decision rule:
  • r (i.e., using likelihood ratio)
  • r

likelihood ratio threshold

slide-25
SLIDE 25
  • C. Long

Lecture 3 January 28, 2018 25

Special Case: Zero-One Loss Function

  • Assign the same loss to all errors:
  • The conditional risk corresponding to this loss function:
slide-26
SLIDE 26
  • C. Long

Lecture 3 January 28, 2018 26

Special Case: Zero-One Loss Function

  • The decision rule becomes:
  • The overall risk turns out to be the average probability

error!

  • r
  • r
slide-27
SLIDE 27
  • C. Long

Lecture 3 January 28, 2018 27

Example

  • Assuming general loss:
  • Assuming zero-one loss:

Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2

2 1

( )/ ( )

a

P P q w w =

2 12 22 1 21 11

( )( ) ( )( )

b

P P w l l q w l l

  • =
  • 12

21

l l >

assume:

slide-28
SLIDE 28
  • C. Long

Lecture 3 January 28, 2018 28

Outline

  • What's Beyesian Decision Theory?
  • A More General Theory
  • Discriminant Function and Decision Boundary
  • Multivariate Gaussian Density
  • Error Bound, ROC, Missing Features and Compound

Bayesian Decision Theory

  • Summary
slide-29
SLIDE 29
  • C. Long

Lecture 3 January 28, 2018 29

Discriminant Functions

  • A useful way to represent a classifier is through

discriminant functions gi(x), i = 1, . . . , c, where a feature vector x is assigned to class ωi if gi(x) > gj(x) for all j i

max

slide-30
SLIDE 30
  • C. Long

Lecture 3 January 28, 2018 30

Discriminants for Bayes Classifier

  • Is the choice of gi unique?
  • Replacing gi(x) with f(gi(x)), where f() is

monotonically increasing, does not change the classification results.

( / ) ( ) ( ) ( ) ( ) ( / ) ( ) ( ) ln ( / ) ln ( )

i i i i i i i i i

p P g p g p P g p P w w w w w w = = = + x x x x x x x

gi(x)=P(ωi/x)

we’ll use this discriminant extensively!

slide-31
SLIDE 31
  • C. Long

Lecture 3 January 28, 2018 31

Case of two categories

  • More common to use a single discriminant function

(dichotomizer) instead of two: Examples:

1 2 1 1 2 2

( ) ( / ) ( / ) ( / ) ( ) ( ) ln ln ( / ) ( ) g P P p P g p P w w w w w w =

  • =

+ x x x x x x

slide-32
SLIDE 32
  • C. Long

Lecture 3 January 28, 2018 32

Decision Regions and Boundaries

  • Discriminants divide the feature space in decision regions

R1, R2, …, Rc, separated by decision boundaries.

Decision boundary is defined by: g1(x)=g2(x)

slide-33
SLIDE 33
  • C. Long

Lecture 3 January 28, 2018 33

Outline

  • What's Beyesian Decision Theory?
  • A More General Theory
  • Discriminant Function and Decision Boundary
  • Multivariate Gaussian Density
slide-34
SLIDE 34
  • C. Long

Lecture 3 January 28, 2018 34

Why are Gaussians so Useful?

  • They represent many probability distributions in nature

quite accurately. In our case, when patterns can be represented as random variations of an ideal prototype (represented by the mean feature vector)

  • Everyday examples: height, weight of a population
slide-35
SLIDE 35
  • C. Long

Lecture 3 January 28, 2018 35

Multivariate Gaussian Density

  • A normal distribution over two or more variables (d

variables/dimensions)

slide-36
SLIDE 36
  • C. Long

Lecture 3 January 28, 2018 36

The Covariance Matrix

  • For our purposes...
  • Assume matrix is positive definite, so the determinant
  • f the matrix is always positive
  • Matrix Elements
  • Main diagonal: variances for each individual variable
  • Off-diagonal: covariances of each variable pairing i & j

(note: values are repeated, as matrix is symmetric)

slide-37
SLIDE 37
  • C. Long

Lecture 3 January 28, 2018 37

Discriminant Function for Multivariate Gaussian Density

  • We will consider three special cases for:
  • normally distributed features, and
  • minimum error rate classification (0-1 loss)

 Recall:

slide-38
SLIDE 38
  • C. Long

Lecture 3 January 28, 2018 38

Minimum Error-Rate Discriminant Function for Multivariate Gaussian Feature Distributions

  • ln (natural log) of

gives a general form for our discriminant functions:

slide-39
SLIDE 39
  • C. Long

Lecture 3 January 28, 2018 39

Special Cases for Binary Classification

  • Purpose

Overview of commonly assumed cases for feature likelihood densities,

  • Goal: eliminate common additive constants in discriminant
  • functions. These do not affect the classification decision

(i.e. define providing “just the differences”)

  • Also, look at resulting decision surfaces ( )
  • Three Special Cases

Statistically independent features, identically distributed Gaussians for each class

Identical covariances for each class

Arbitrary covariances

slide-40
SLIDE 40
  • C. Long

Lecture 3 January 28, 2018 40

Case I:

  • Satisfiy two conditions: (1) Features are statistically independent and

(2) Each feature has the same variance.

  • Remove Items in red: same across classes (“unimportant additive

constants”)

  • Inverse of covariance matrix:
  • Only effect is to scale vector product by
  • Discriminant function:
slide-41
SLIDE 41
  • C. Long

Lecture 3 January 28, 2018 41

Case I:

  • Linear Discriminant Function
  • Produced by factoring the previous form
  • Threshold or Bias for Class i:
  • Change in prior translates decision boundary
slide-42
SLIDE 42
  • C. Long

Lecture 3 January 28, 2018 42

Case I:

  • Decision Boundary:
  • Decision boundary goes through x0 along line between means,
  • rthogonal to this line
  • If priors equal, x0 between means (minimum distance classifier),
  • therwise x0 shifted
  • If variance small relative to distance between means, priors have

limited effect on boundary location

slide-43
SLIDE 43
  • C. Long

Lecture 3 January 28, 2018 43

Case 1: Statistically Independent Features with Identical Variances

slide-44
SLIDE 44
  • C. Long

Lecture 3 January 28, 2018 44

Example: Translation of Decision Boundaries Through Changing Priors

slide-45
SLIDE 45
  • C. Long

Lecture 3 January 28, 2018 45

Case II: Identical Covariances

  • Remove terms in red as in Case I these can be ignored

(same across classes)

  • Squared Mahalanobis Distance (yellow)
  • Distance from x to mean for class i, taking covariance

into account; defines contours of fixed density

slide-46
SLIDE 46
  • C. Long

Lecture 3 January 28, 2018 46

Case II: Identical Covariances

  • Expansion of squared Mahalanobis distance

the last step comes from symmetry of the covariance matrix and thus its inverse:

  • Once again, term above in red is an additive constant independent of

class, and can be removed

slide-47
SLIDE 47
  • C. Long

Lecture 3 January 28, 2018 47

Multivariate Gaussian Density:

  • Linear Discriminant Function
  • Decision Boundary:
slide-48
SLIDE 48
  • C. Long

Lecture 3 January 28, 2018 48

Case II: Identical Covariances

  • Notes on Decision Boundary
  • As for Case I, passes through point x0 lying on the line between the two

class means. Again, x0 in the middle if priors identical

  • Hyperplane defined by boundary generally not orthogonal to the line

between the two means

slide-49
SLIDE 49
  • C. Long

Lecture 3 January 28, 2018 49

Case III: arbitrary

 Can only remove the one term in red aboveSquared  Discriminant Function (quadratic)

slide-50
SLIDE 50
  • C. Long

Lecture 3 January 28, 2018 50

Case III: arbitrary

 Decision Boundaries

  • Are hyperquadrics: can be hyperplanes,
  • hyperplane pairs, hyperspheres,
  • hyperellipsoids, hyperparabaloids,
  • hyperhyperparabaloids

 Decision Regions

  • Need not be simply connected, even in one dimension (next

slide)

slide-51
SLIDE 51
  • C. Long

Lecture 3 January 28, 2018 51

Case III: arbitrary

slide-52
SLIDE 52
  • C. Long

Lecture 3 January 28, 2018 52

Case III: arbitrary

Nonlinear decision boundaries

slide-53
SLIDE 53
  • C. Long

Lecture 3 January 28, 2018 53

Example: Case III

P(ω1)=P(ω2) decision boundary:

boundary does not pass through midpoint of μ1,μ2

slide-54
SLIDE 54
  • C. Long

Lecture 3 January 28, 2018 54