CS 559: Machine Learning Fundamentals and Applications 4 th Set of - - PowerPoint PPT Presentation

cs 559 machine learning fundamentals and applications 4
SMART_READER_LITE
LIVE PREVIEW

CS 559: Machine Learning Fundamentals and Applications 4 th Set of - - PowerPoint PPT Presentation

1 CS 559: Machine Learning Fundamentals and Applications 4 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Parameter Estimation


slide-1
SLIDE 1

CS 559: Machine Learning Fundamentals and Applications 4th Set of Notes

Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215

1

slide-2
SLIDE 2

Overview

  • Parameter Estimation

– Frequentist or Maximum Likelihood approach (cont.) – Bayesian approach (Barber Ch. 8 and DHS

  • Ch. 3)
  • Cross-validation
  • Overfitting
  • Naïve Bayes Classifier
  • Non-parametric Techniques

2

slide-3
SLIDE 3

MLE Classifier Example

3

slide-4
SLIDE 4

Data

  • Pima Indians Diabetes Database

– http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

– Number of Instances: 768 – Number of Attributes: 8 plus class – Class Distribution: (class value 1 is interpreted as "tested positive for diabetes") – Class Value Number of instances 500 1 268

4

slide-5
SLIDE 5

Data

Attributes: (all numeric-valued)

  • 1. Number of times pregnant
  • 2. Plasma glucose concentration a 2 hours in an oral

glucose tolerance test

  • 3. Diastolic blood pressure (mm Hg)
  • 4. Triceps skin fold thickness (mm)
  • 5. 2-Hour serum insulin (mu U/ml) 6. Body mass

index (weight in kg/(height in m)^2)

  • 7. Diabetes pedigree function
  • 8. Age (years)
  • 9. Class variable (0 or 1)

5

slide-6
SLIDE 6

Simple MLE Classifier

data = dlmread('pima-indians-diabetes.data'); data = reshape(data,[],9); % use randperm to re-order data. % ignore if not using Matlab rp = randperm(length(data)); data=data(rp,:); train_data = data(1:length(data)/2,:); test_data = data(length(data)/2+1:end,:);

6

slide-7
SLIDE 7

% pick a feature active_feat = 3; % training mean1 = mean(train_data(train_data(:,9)==0,active_feat)) mean2 = mean(train_data(train_data(:,9)==1,active_feat)) var1 = var(train_data(train_data(:,9)==0,active_feat)) var2 = var(train_data(train_data(:,9)==1,active_feat)) prior1tmp = length(train_data(train_data(:,9)==0)); prior2tmp = length(train_data(train_data(:,9)==1)); prior1 = prior1tmp/(prior1tmp+prior2tmp) prior2 = prior2tmp/(prior1tmp+prior2tmp)

7

slide-8
SLIDE 8

% testing correct=0; wrong=0; for i=1:length(test_data) lklhood1 = exp(-(test_data(i,active_feat)-mean1)^2/(2*var1)) /sqrt(var1); lklhood2 = exp(-(test_data(i,active_feat)-mean2)^2/(2*var2)); /sqrt(var2); post1 = lklhood1*prior1; post2 = lklhood2*prior2; if(post1 > post2 && test_data(i,9) == 0) correct = correct+1; elseif(post1 < post2 && test_data(i,9) == 1) correct = correct+1; else wrong = wrong+1; end end

8

slide-9
SLIDE 9

Training/Test Split

  • Randomly split dataset into two parts:

– Training data – Test data

  • Use training data to optimize parameters
  • Evaluate error using test data

9

slide-10
SLIDE 10

Training/Test Split

  • How many points in each set?
  • Very hard question

– Too few points in training set, learned classifier is bad – Too few points in test set, classifier evaluation is insufficient

  • Cross-validation
  • Leave-one-out cross-validation
  • Bootstrapping

10

slide-11
SLIDE 11

Cross-Validation

  • In practice
  • Available data => training and validation
  • Train on the training data
  • Test on the validation data
  • k-fold cross validation:

– Data randomly separated into k groups – Each time k−1 groups used for training and

  • ne as testing

11

slide-12
SLIDE 12

Cross Validation and Test Accuracy

  • If we select parameters so that CV is highest:

– Does CV represent future test accuracy? – Slightly different

  • If we have enough parameters, we can

achieve 100% CV as well

– e.g. more parameters than # of training data

  • But test accuracy may be different
  • So split available data with class labels, into:

– training – validation – testing

12

slide-13
SLIDE 13

Cross Validation and Test Accuracy

  • Using CV on

training + validation

  • Classify test data

with the best parameters from CV

13

slide-14
SLIDE 14

Overfitting

  • Prediction error: probability of test pattern

not in class with max posterior (true)

  • Training error: probability of test pattern

not in class with max posterior (estimated)

  • Classifier optimized w.r.t. training error

– Training error: optimistically biased estimate

  • f prediction error

14

slide-15
SLIDE 15

Overfitting

Overfitting: a learning algorithm overfits the training data if it outputs a solution w w when another solution w’ w’ exists such that: errortrain(w) < errortrain(w’) AND errortrue(w’) < errortrue(w)

15

slide-16
SLIDE 16

Pattern Classification, Chapter 1 16

Fish Classifier from DHS Ch. 1

slide-17
SLIDE 17

Pattern Classification, Chapter 1 17

Minimum Training Error

slide-18
SLIDE 18

Final Decision Boundary

Pattern Classification, Chapter 1 18

slide-19
SLIDE 19

Typical Behavior

Slide credit: A. Smola 19

slide-20
SLIDE 20

Typical Behavior

Slide credit: A. Smola 20

slide-21
SLIDE 21

Bayesian Parameter Estimation Bayesian Parameter Estimation

  • Gaussian Case
  • General Estimation

21

slide-22
SLIDE 22

Bayesian Estimation

  • In MLE  was assumed fixed
  • In BE  is a random variable
  • Suppose we have some idea of the range

where the parameters θ should be

– Shouldn’t we utilize this prior knowledge in hope that it will lead to better parameter estimation?

Pattern Classification, Chapter 3 22

slide-23
SLIDE 23

Bayesian Estimation

  • Let θ be a random variable with prior

distribution P(θ)

– This is the key difference between ML and Bayesian parameter estimation – This allows us to use a prior to express the uncertainty present before seeing the data – Frequentist approach does not account for uncertainty in θ (see bootstrap for more on this, however)

Pattern Classification, Chapter 2 23

slide-24
SLIDE 24

Motivation

  • As in MLE, suppose p(x|θ) is completely

specified if θ is given

  • But now θ is a random variable with prior

p(θ)

– Unlike MLE case, p(x|θ) is a conditional density

  • After we observe the data D, using Bayes

rule we can compute the posterior p(θ|D)

Pattern Classification, Chapter 2 24

slide-25
SLIDE 25

Motivation

  • Recall that for the MAP classifier we find the

class ωi that maximizes the posterior p(ω|D)

  • By analogy, a reasonable estimate of θ is the
  • ne that maximizes the posterior p(θ|D)
  • But θ is not our final goal, our final goal is the

unknown p(x)

  • Therefore a better thing to do is to maximize

p(x|D), this is as close as we can come to the unknown p(x) !

Pattern Classification, Chapter 2 25

slide-26
SLIDE 26

Parameter Distribution

  • Assumptions:

– p(x) is unknown, but has known parametric form – Parameter vector θ is unknown – p(x| θ) is completely known – Prior density p(θ) is known

  • Observation of samples provides posterior

density p(θ|D)

– Hopefully peaked around true value of θ

  • Treat each class separately and drop subscripts

Pattern Classification, Chapter 3 26

slide-27
SLIDE 27
  • Converted problem of learning probability

density function to learning parameter vector

  • Goal: compute p(x|D) as best possible

estimate of p(x)

Pattern Classification, Chapter 3 27

p(x) is completely known given θ, independent of samples in D

   d D) | p(x, D) | p(x

 

        d D p d D p D ) | ( ) | p(x ) | ( ) , | p(x D) | p(x

slide-28
SLIDE 28
  • Links class-conditional density p(x|D) to

posterior density p(θ|D)

Pattern Classification, Chapter 3 28

 

        d D p d D p D ) | ( ) | p(x ) | ( ) , | p(x D) | p(x

slide-29
SLIDE 29

Bayesian Parameter Estimation: Gaussian Case

Goal: Estimate  using the a-posteriori density P( | D) – The univariate case: p( | D)  is the only unknown parameter 0 and 0 are known 0 is best guess for , 0 is uncertainty of guess

Pattern Classification, Chapter 3 29

) , N( ~ ) p( ) , N( ~ ) | p(x

2 2

     

slide-30
SLIDE 30

Pattern Classification, Chapter 3 30

  • α depends on D, not µ
  • (1) shows how training samples affect our

idea about the true value of µ

 

 

 

n k k k

p x p d p p p p p

1

) ( ) | ( (1) ) ( ) | ( ) ( ) | ( ) | (          D D D

slide-31
SLIDE 31

Pattern Classification, Chapter 3 31

Reproducing density (remains Gaussian) (1) and (2) yield:

 

 

 

n k k k

p x p d p p p p p

1

) ( ) | ( (1) ) ( ) | ( ) ( ) | ( ) | (          D D D

(2) ) , ( ~ ) | (

2 n n

N p    D

2 2 2 2 2 2 2 2 2 2 2

ˆ                             n and n n n

n n n

Empirical (sample) mean

slide-32
SLIDE 32
  • µ is linear combination of empirical and prior

information

  • Each additional observation decreases

uncertainty about µ

Pattern Classification, Chapter 3 32

2 2 2 2 2 2 2 2 2 2 2

ˆ                             n and n n n

n n n

slide-33
SLIDE 33

Pattern Classification, Chapter 3 33

– The univariate case p(x | D)

  • p( | D) computed
  • p(x | D) remains to be computed*

It provides: * Desired class-conditional density p(x | Dj, j) Using Bayes formula, we obtain the Bayesian classification rule:

Gaussian is ) | ( ) | ( ) | (    d p x p x p D D

) , ( ~ ) | (

2 2 n n

N x p     D

   

) ( ) , | ( , | (

j j j j

p x p M ax x p M ax

j j

  

 

D D) 

slide-34
SLIDE 34
  • We have:

– Replaced mean with conditional mean – Increased variance to account for additional uncertainty in x due to inexact knowledge of mean

Pattern Classification, Chapter 3 34

) , ( ~ ) | (

2 2 n n

N x p     D

slide-35
SLIDE 35

Bayesian Parameter Estimation: General Theory

– p(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:

  • The form of p(x | ) is assumed known, but the

value of  is not known exactly

  • Our knowledge about  is assumed to be

contained in a known prior density p()

  • The rest of our knowledge  is contained in a set D
  • f n random variables x1, x2, …, xn that follows p(x)

Pattern Classification, Chapter 3 35

slide-36
SLIDE 36

Pattern Classification, Chapter 3 36

The basic problem is: “Compute the posterior density p( | D)” then “Derive p(x | D)” Using Bayes formula, we have: And by the independence assumption:

 

n k k k

x p p

1

) | ( ) | (   D

       d p p p p p ) ( ) | ( ) ( ) | ( ) | ( D D D

slide-37
SLIDE 37

Recursive Bayes Learning

  • Assume that training samples become

available one by one

  • Due to independence, result is

independent of order:

Pattern Classification, Chapter 3 37

) | ( ) | ( ) | (   

1

  • n

n

D D p x p p

n

 

n k k k

x p p

1

) | ( ) | (   D

slide-38
SLIDE 38

Estimation of p(x|D)

  • The basic problem is: Compute p(x | D)
  • Compute the posterior density p( | D)
  • Then derive p(x | D)
  • Repeat for all classes to obtain p(x | i)
  • Combine with p(i) to get posteriors

Pattern Classification, Chapter 3 38

       d p p p p p ) ( ) | ( ) ( ) | ( ) | ( D D D

slide-39
SLIDE 39

Conjugate Priors

  • Prior is conjugate to likelihood if it leads to

itself as posterior

  • Closed form representation of posterior
  • If the prior on θ, with hyperparameters α,

has some p(θ|α), the posterior given data D is of the same form but with updated hyperparameters p(θ|D,α) = p(θ|α’)

39 Barber, Chapter 8

slide-40
SLIDE 40

Bayesian Inference of Mean and Variance

  • Uni-variate Gaussian
  • Posterior of parameters
  • Prior of mean (Gaussian)

40 Barber, Chapter 8

slide-41
SLIDE 41

Bayesian Inference of Mean and Variance

  • Posterior

after some manipulation …

41 Barber, Chapter 8

slide-42
SLIDE 42

Bayesian Inference of Mean and Variance

  • Use inverse Gamma distribution for p(σ2)
  • Then, posterior is also Gauss-Inverse-

Gamma

42 Barber, Chapter 8

slide-43
SLIDE 43

ML vs. Bayesian Parameter Estimation: Summary

43

slide-44
SLIDE 44

BE vs. MLE

  • BE: p(x|D) can be thought of as the weighted average of

the proposed model for all possible values of θ

  • Contrast this with the MLE solution which always gives

us a single model: |

  • When we have many possible solutions, taking their sum

averaged by their probabilities seems better than pick just one solution

44

slide-45
SLIDE 45

Bayesian Estimation vs. MLE

  • In practice, it may be hard to do integration

analytically and we may have to resort to numerical methods

  • The MLE solution requires differentiation,

instead of integration, to get

– Differentiation is easy and can always be done analytically

45

slide-46
SLIDE 46

When do Maximum-Likelihood and Bayes Methods Differ?

  • Equivalent asymptotically (for infinite

training data)

– For reasonable prior distributions – When prior p(θ) is uninformative and p(θ|D) is peaked

  • MLE computationally cheaper, simpler

solutions

  • BE uses more information (more general

model)

46

slide-47
SLIDE 47

Naïve Bayes Classifier (not BE)

  • Simple classifier that applies Bayes' rule with

strong (naive) independence assumptions

  • A.k.a. the "independent feature model”
  • p(i|x1, x2,…)= α p(x1|i) p(x2|i)… p(i)
  • Often performs reasonably well despite

simplicity

47

slide-48
SLIDE 48

Naïve Bayes Classifier

  • NB is known to produce posteriors closer

to extremes (0 or 1) than true posteriors

– Why?

  • NB performs well when only

small amounts of training data are available

– Why?

48

slide-49
SLIDE 49

Non-parametric Classification

49

slide-50
SLIDE 50

The Histogram

  • The simplest form of non-parametric density estimation is the

histogram

– Divide sample space in number of bins – Approximate the density at the center of each bin by the fraction of points that fall into the bin – Two parameters: bin width and starting position of first bin (or other equivalent pairs)

  • Drawbacks:

– Depends on position of bin centers

  • Often compute two

histograms, offset by ½ bin width

– Discontinuities as an artifact

  • f bin boundaries

– Curse of dimensionality

50

slide-51
SLIDE 51

Introduction

  • All parametric densities are unimodal (have a single local

maximum), whereas many practical problems involve multi-modal densities

  • Non-parametric procedures can be used with arbitrary

distributions and without the assumption that the forms of the underlying densities are known

  • There are two types of non-parametric methods:

– Estimate P(x | j ) – Bypass density function and go directly to posterior probability estimation

Pattern Classification, Chapter 4 51

slide-52
SLIDE 52

Density Estimation

– Probability that a vector x will fall in region R is: – P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the probability that k points fall in R is: and the expected value for k is: E(k) = E(k) = nP nP (3)

 (1) ' ) ' ( dx x p P

(2) ) 1 (

k n k k

P P k n P

         

Pattern Classification, Chapter 4 52

slide-53
SLIDE 53

ML Estimate

ML estimation of P =  is reached for Therefore, the ratio k/n k/n is a good estimate for the probability P and hence for the density function p(x) p(x) (for large n)

Pattern Classification, Chapter 4 53

P n k    ˆ

) | P ( Max

k  

slide-54
SLIDE 54

Assumptions

p(x) is continuous and the region R is so small that p does not vary significantly within it, we can write: where x is a point within R and V the volume enclosed by

R.

Combining equation (1) , (3) and (4) yields:

Pattern Classification, Chapter 4 54

(4) V ) x ( p ' dx ) ' x ( p

 V n / k ) x ( p 

slide-55
SLIDE 55
  • The volume V needs to approach 0, if we want to use this

estimate

  • Practically, V cannot be allowed to become small since the number of

samples is always limited

  • One will have to accept a certain amount of variance in the ratio k/n
  • Theoretically, if an unlimited number of samples is available, we can

circumvent this difficulty To estimate the density of x, we form a sequence of regions

R1, R2,…containing x: the first region contains one sample, the

second two samples and so on. Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x) be the nth estimate for p(x):

pn(x) = (kn/n)/Vn (7)

Pattern Classification, Chapter 4 55

slide-56
SLIDE 56

Three necessary conditions should apply if we want pn(x) to converge to p(x): There are two different ways of obtaining sequences of regions that satisfy these conditions: (a) Shrink an initial region where Vn = 1/n and show that This is called “the Parzen-window estimation method” (b) Specify kn as some function of n, such as kn = n; the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn-nearest neighbor estimation method”

Pattern Classification, Chapter 4 56

n / k lim ) 3 k lim ) 2 V lim ) 1

n n n n n n

   

     

) x ( p ) x ( p

n n  

slide-57
SLIDE 57

Pattern Classification, Chapter 4 57

slide-58
SLIDE 58

Parzen Windows

– The Parzen-window approach to estimate densities assumes that the region Rn is a d- dimensional hypercube – ((x-x ((x-xi)/h )/hn) ) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise

Pattern Classification, Chapter 4 58

         

  • therwise

d , 1,... j 2 1 u 1 (u) : function window following the be (u) Let )

  • f

edge the

  • f

length : (h h V

j n n d n n

 

slide-59
SLIDE 59

– The number of samples in this hypercube is: By substituting kn in equation (7), we obtain the following estimate:

Pn(x) estimates p(x) as an average of functions of x and the samples {xi} (i = 1,… ,n). These functions  can be general

Pattern Classification, Chapter 4 59

 

         

n i i n i n

h x x k

1

           

  n i n i 1 i n n

h x x V 1 n 1 ) x ( p

slide-60
SLIDE 60

Window Functions

  • Conditions for estimating legitimate

density function

– Non-negative – Integrate to 1

  • In other words, the window function should

be a probability density function

Pattern Classification, Chapter 4 60

1 ) ( dx x  (x)  

slide-61
SLIDE 61
  • The behavior of the Parzen-window method

– Case where p(x) N(0,1) – Let and

(h1: known parameter)

Thus: is an average of normal densities centered at the samples xi

Pattern Classification, Chapter 4 61

          

  n i n i 1 i n n

h x x h 1 n 1 ) x ( p 

Illustration

2

2

2 1 ) (

u

e u

  

n h hn

1

slide-62
SLIDE 62

Numerical Results

For n = 1 and h1=1

Pattern Classification, Chapter 4 62

) 1 , ( ) ( 2 1 ) ( ) ( p

1 2 1 2 / 1 1 1

x N x x e x x x     

 

slide-63
SLIDE 63

Pattern Classification, Chapter 4 63

For n = 10 and h = 0.1, the contributions of the individual samples are clearly observable

slide-64
SLIDE 64

Pattern Classification, Chapter 4 64

slide-65
SLIDE 65

Analogous results are also obtained in two dimensions as illustrated:

Pattern Classification, Chapter 4 65

slide-66
SLIDE 66

Pattern Classification, Chapter 4 66

slide-67
SLIDE 67

– Case where p(x) = p(x) = 1U(a,b) + U(a,b) + 2T(c,d) T(c,d)

  • unknown density, mixture of a uniform and a triangle density

Pattern Classification, Chapter 4 67

slide-68
SLIDE 68

Pattern Classification, Chapter 4 68

slide-69
SLIDE 69

Classification

  • In classifiers based on Parzen-window

estimation:

– We estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior – The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure

Pattern Classification, Chapter 4 69

slide-70
SLIDE 70

Pattern Classification, Chapter 4 70

Remember discussion on overfitting

slide-71
SLIDE 71

K - Nearest Neighbor Estimation

  • Goal: a solution for the problem of the unknown “best” window

function

– Let the cell volume be a function of the training data – Center a cell about x and let it grow until it captures kn samples (kn = f(n)) – kn are called the kn nearest-neighbors of x

  • Benefits

– If density is high near x, the cell will be small which provides a good resolution – If density is low, the cell will grow large and stop when higher density regions are reached We can obtain a family of estimates by setting kn=k1 /√ and choosing different values for k1

Pattern Classification, Chapter 4 71

slide-72
SLIDE 72

Illustration

For kn = = 1 ; the estimate becomes:

Pn(x) = kn / nVn = 1 / V1 =1 / 2|x-x1| (goes to infinity at x1)

Pattern Classification, Chapter 4 72

slide-73
SLIDE 73

Pattern Classification, Chapter 4 73

slide-74
SLIDE 74

Pattern Classification, Chapter 4 74

slide-75
SLIDE 75

Estimation of Posterior Probabilities

  • Goal: estimate P(

P(i | x) | x) from a set of n labeled samples

  • Place a cell of volume V around x and capture k samples
  • ki samples amongst k turned out to be labeled i then:

pn(x, (x, i) = k ) = ki /nV /nV An estimate for pn(i| x) | x) is:

Pattern Classification, Chapter 4 75

k k ) , x ( p ) , x ( p ) x | ( p

i c j 1 j j n i n i n

 

 

  

slide-76
SLIDE 76

– ki/k is the fraction of the samples within the cell that are labeled i – For minimum error rate, the most frequently represented category within the cell is selected => This is equivalent to posterior estimation – If k is large and the cell sufficiently small, the performance will approach the best possible

Pattern Classification, Chapter 4 76

slide-77
SLIDE 77

The Nearest–Neighbor Rule

  • Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes
  • Let x’  Dn be the closest prototype to a test point x then the

nearest-neighbor rule for classifying x is to assign it the label associated with x’

  • The nearest-neighbor rule leads to an error rate greater than the

minimum possible: the Bayes rate

  • If the number of prototypes is large (unlimited), the error rate of the

nearest-neighbor classifier is never worse than twice the Bayes rate (it can be proven!)

  • If n  , it is always possible to find x’ sufficiently close so that:

P( P(i | x’) | x’) ≈ P( P(i | x) | x)

Pattern Classification, Chapter 4 77

slide-78
SLIDE 78

78 Pattern Classification, Chapter 4

slide-79
SLIDE 79

The k–Nearest-Neighbor Rule

  • Goal: Classify x by assigning it the label

most frequently represented among the k nearest samples

  • Use a voting scheme

Pattern Classification, Chapter 4 79

slide-80
SLIDE 80

Pattern Classification, Chapter 4 80

slide-81
SLIDE 81

Matlab Example

data = dlmread('pima-indians-diabetes.data'); data = reshape(data,[],9); % use randperm to re-order data. ignore if not using Matlab rp = randperm(length(data)); data=data(rp,:); %split = length(data)/2; split = 300; train_data = data(1:split,:); test_data = data(split+1:end,:);

81

slide-82
SLIDE 82

% pick features active_feat = [1:3]; % training % NOT NEEDED % testing correct=0; wrong=0;

82

slide-83
SLIDE 83

for i=1:length(test_data) sample=test_data(i,active_feat); dist = train_data(:,active_feat)-repmat(sample,length(train_data),1); dist = dist*dist'; % we are only interested in the diagonal elements % DON’T USE QUADRATIC DISTANCE COMPUTATION IN PRACTICE fin_dist = diag(dist); [min_d index] = min(fin_dist); if(test_data(i,9) == train_data(index,9)) correct = correct+1; else wrong = wrong+1; end end

83