CS 559: Machine Learning Fundamentals and Applications 4th Set of Notes
Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215
1
CS 559: Machine Learning Fundamentals and Applications 4 th Set of - - PowerPoint PPT Presentation
1 CS 559: Machine Learning Fundamentals and Applications 4 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Parameter Estimation
1
2
3
– http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
4
5
6
% pick a feature active_feat = 3; % training mean1 = mean(train_data(train_data(:,9)==0,active_feat)) mean2 = mean(train_data(train_data(:,9)==1,active_feat)) var1 = var(train_data(train_data(:,9)==0,active_feat)) var2 = var(train_data(train_data(:,9)==1,active_feat)) prior1tmp = length(train_data(train_data(:,9)==0)); prior2tmp = length(train_data(train_data(:,9)==1)); prior1 = prior1tmp/(prior1tmp+prior2tmp) prior2 = prior2tmp/(prior1tmp+prior2tmp)
7
% testing correct=0; wrong=0; for i=1:length(test_data) lklhood1 = exp(-(test_data(i,active_feat)-mean1)^2/(2*var1)) /sqrt(var1); lklhood2 = exp(-(test_data(i,active_feat)-mean2)^2/(2*var2)); /sqrt(var2); post1 = lklhood1*prior1; post2 = lklhood2*prior2; if(post1 > post2 && test_data(i,9) == 0) correct = correct+1; elseif(post1 < post2 && test_data(i,9) == 1) correct = correct+1; else wrong = wrong+1; end end
8
9
10
11
12
13
14
15
Pattern Classification, Chapter 1 16
Pattern Classification, Chapter 1 17
Pattern Classification, Chapter 1 18
Slide credit: A. Smola 19
Slide credit: A. Smola 20
21
Pattern Classification, Chapter 3 22
Pattern Classification, Chapter 2 23
Pattern Classification, Chapter 2 24
Pattern Classification, Chapter 2 25
Pattern Classification, Chapter 3 26
Pattern Classification, Chapter 3 27
Pattern Classification, Chapter 3 28
Pattern Classification, Chapter 3 29
2 2
Pattern Classification, Chapter 3 30
n k k k
p x p d p p p p p
1
) ( ) | ( (1) ) ( ) | ( ) ( ) | ( ) | ( D D D
Pattern Classification, Chapter 3 31
n k k k
p x p d p p p p p
1
) ( ) | ( (1) ) ( ) | ( ) ( ) | ( ) | ( D D D
2 n n
2 2 2 2 2 2 2 2 2 2 2
ˆ n and n n n
n n n
Empirical (sample) mean
Pattern Classification, Chapter 3 32
2 2 2 2 2 2 2 2 2 2 2
ˆ n and n n n
n n n
Pattern Classification, Chapter 3 33
) , ( ~ ) | (
2 2 n n
N x p D
j j j j
j j
Pattern Classification, Chapter 3 34
) , ( ~ ) | (
2 2 n n
N x p D
Pattern Classification, Chapter 3 35
Pattern Classification, Chapter 3 36
n k k k
1
Pattern Classification, Chapter 3 37
1
n
n
n k k k
1
Pattern Classification, Chapter 3 38
39 Barber, Chapter 8
40 Barber, Chapter 8
41 Barber, Chapter 8
42 Barber, Chapter 8
43
44
45
46
47
48
49
histogram
– Divide sample space in number of bins – Approximate the density at the center of each bin by the fraction of points that fall into the bin – Two parameters: bin width and starting position of first bin (or other equivalent pairs)
– Depends on position of bin centers
histograms, offset by ½ bin width
– Discontinuities as an artifact
– Curse of dimensionality
50
– Estimate P(x | j ) – Bypass density function and go directly to posterior probability estimation
Pattern Classification, Chapter 4 51
(2) ) 1 (
k n k k
P P k n P
Pattern Classification, Chapter 4 52
Pattern Classification, Chapter 4 53
P n k ˆ
k
Pattern Classification, Chapter 4 54
samples is always limited
circumvent this difficulty To estimate the density of x, we form a sequence of regions
second two samples and so on. Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x) be the nth estimate for p(x):
Pattern Classification, Chapter 4 55
Three necessary conditions should apply if we want pn(x) to converge to p(x): There are two different ways of obtaining sequences of regions that satisfy these conditions: (a) Shrink an initial region where Vn = 1/n and show that This is called “the Parzen-window estimation method” (b) Specify kn as some function of n, such as kn = n; the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn-nearest neighbor estimation method”
Pattern Classification, Chapter 4 56
n / k lim ) 3 k lim ) 2 V lim ) 1
n n n n n n
n n
Pattern Classification, Chapter 4 57
Pattern Classification, Chapter 4 58
j n n d n n
Pn(x) estimates p(x) as an average of functions of x and the samples {xi} (i = 1,… ,n). These functions can be general
Pattern Classification, Chapter 4 59
n i i n i n
1
n i n i 1 i n n
Pattern Classification, Chapter 4 60
(h1: known parameter)
Pattern Classification, Chapter 4 61
n i n i 1 i n n
2
2
2 1 ) (
u
e u
1
Pattern Classification, Chapter 4 62
1 2 1 2 / 1 1 1
Pattern Classification, Chapter 4 63
For n = 10 and h = 0.1, the contributions of the individual samples are clearly observable
Pattern Classification, Chapter 4 64
Analogous results are also obtained in two dimensions as illustrated:
Pattern Classification, Chapter 4 65
Pattern Classification, Chapter 4 66
– Case where p(x) = p(x) = 1U(a,b) + U(a,b) + 2T(c,d) T(c,d)
Pattern Classification, Chapter 4 67
Pattern Classification, Chapter 4 68
Pattern Classification, Chapter 4 69
Pattern Classification, Chapter 4 70
Remember discussion on overfitting
function
– Let the cell volume be a function of the training data – Center a cell about x and let it grow until it captures kn samples (kn = f(n)) – kn are called the kn nearest-neighbors of x
– If density is high near x, the cell will be small which provides a good resolution – If density is low, the cell will grow large and stop when higher density regions are reached We can obtain a family of estimates by setting kn=k1 /√ and choosing different values for k1
Pattern Classification, Chapter 4 71
Pattern Classification, Chapter 4 72
Pattern Classification, Chapter 4 73
Pattern Classification, Chapter 4 74
Pattern Classification, Chapter 4 75
i c j 1 j j n i n i n
Pattern Classification, Chapter 4 76
nearest-neighbor rule for classifying x is to assign it the label associated with x’
minimum possible: the Bayes rate
nearest-neighbor classifier is never worse than twice the Bayes rate (it can be proven!)
P( P(i | x’) | x’) ≈ P( P(i | x) | x)
Pattern Classification, Chapter 4 77
78 Pattern Classification, Chapter 4
Pattern Classification, Chapter 4 79
Pattern Classification, Chapter 4 80
data = dlmread('pima-indians-diabetes.data'); data = reshape(data,[],9); % use randperm to re-order data. ignore if not using Matlab rp = randperm(length(data)); data=data(rp,:); %split = length(data)/2; split = 300; train_data = data(1:split,:); test_data = data(split+1:end,:);
81
82
for i=1:length(test_data) sample=test_data(i,active_feat); dist = train_data(:,active_feat)-repmat(sample,length(train_data),1); dist = dist*dist'; % we are only interested in the diagonal elements % DON’T USE QUADRATIC DISTANCE COMPUTATION IN PRACTICE fin_dist = diag(dist); [min_d index] = min(fin_dist); if(test_data(i,9) == train_data(index,9)) correct = correct+1; else wrong = wrong+1; end end
83