Introduction to Machine Learning
- 2. Basic Tools
Alex Smola & Geoff Gordon Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701x 10-701
Introduction to Machine Learning 2. Basic Tools Alex Smola & - - PowerPoint PPT Presentation
Introduction to Machine Learning 2. Basic Tools Alex Smola & Geoff Gordon Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701x 10-701 This is not a toy dataset
Alex Smola & Geoff Gordon Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701x 10-701
http://wadejohnston1962.files.wordpress.com/2012/09/datainoneminute.jpg
f(x) = ax + b minimize
a,b m
X
i=1
1 2(axi + b − yi)2 ∂a [. . .] = 0 =
m
X
i=1
xi(axi + b − yi) ∂b [. . .] = 0 =
m
X
i=1
(axi + b − y)
f(x) = ha, xi + b = hw, (x, 1)i minimize
w m
X
i=1
1 2(hw, ¯ xii yi)2 0 =
m
X
i=1
¯ xi(hw, ¯ xii yi) ( ) " m X
i=1
¯ xi¯ x>
i
# w =
m
X
i=1
yi¯ xi
f(x) = hw, (1, x)i f(x) = ⌦ w, (1, x, x2) ↵ f(x) = ⌦ w, (1, x, x2, x3) ↵ f(x) = hw, φ(x)i
f(x) = ha, xi + b = hw, (x, 1)i minimize
w m
X
i=1
1 2(hw, ¯ xii yi)2 0 =
m
X
i=1
¯ xi(hw, ¯ xii yi) ( ) " m X
i=1
¯ xi¯ x>
i
# w =
m
X
i=1
yi¯ xi
0 =
m
X
i=1
φ(xi)(hw, φ(xi)i yi) ( ) " m X
i=1
φ(xi)φ(xi)> # w =
m
X
i=1
yiφ(xi)
f(x) = hw, φ(x)i minimize
w m
X
i=1
1 2(hw, φ(xi)i yi)2
Training phi_xx = [xx.^4, xx.^3, xx.^2, xx, 1.0 + 0.0 * xx]; w = (yy' * phi_xx) / (phi_xx' * phi_xx); Testing phi_x = [x.^4, x.^3, x.^2, x, 1.0 + 0.0 * x]; y = phi_x * w';
warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5
warning: matrix singular to machine precision, rcond = 5.8676e-19 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 5.86761e-19 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 8x8 matrix, rank = 7 warning: matrix singular to machine precision, rcond = 1.10156e-21 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.10145e-21 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 9x9 matrix, rank = 6 warning: matrix singular to machine precision, rcond = 2.16217e-26 warning: attempting to find minimum norm solution warning: matrix singular to machine precision, rcond = 1.66008e-26 warning: attempting to find minimum norm solution warning: dgelsd: rank deficient 10x10 matrix, rank = 5
(model is too simple to explain data)
(model is too complicated to learn from data)
(failed matrix inverse)
Need to quantify model complexity vs. data
Parzen
p(y|x) = p(x, y) p(x) = p(x|y)p(y) P
y0 p(x|y0)p(y0)
25 English Chinese German French Spanish male 5 2 3 1 female 6 3 2 2 1
25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 female 0.24 0.12 0.08 0.08 0.04
25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 female 0.24 0.12 0.08 0.08 0.04
25 English Chinese German French Spanish male 0.2 0.08 0.12 0.04 female 0.24 0.12 0.08 0.08 0.04
not enough data
#bins grows exponentially need many bins per dimension
#bins grows exponentially need many bins per dimension
probability mass per cell also decreases by 1010.
40 50 60 70 80 90 100 110 0.00 0.01 0.02 0.03 0.04 0.05 40 50 60 70 80 90 100 110 0.00 0.05 0.10sample underlying density
can’t we just go and smooth this out?
Use empirical density (delta distributions)
Smear out empirical density with a nonnegative smoothing kernel kx(x’) satisfying
pemp(x) = 1 m
m
X
i=1
δxi(x) Z
X
kx(x0)dx0 = 1 for all x
pemp(x) = 1 m
m
X
i=1
δxi(x) ˆ p(x) = 1 m
m
X
i=1
kxi(x)
1 2
0.0 0.5 1.01 2
0.0 0.5 1.01 2
0.0 0.5 1.01 2
0.0 0.5 1.0(2π)− 1
2 e− 1 2 x2
1 2e−|x| 3 4 max(0, 1 − x2) 1 2χ[−1,1](x)
Gauss Laplace Epanechikov Uniform
dist = norm(X - x * ones(1,m),'columns'); p = (1/m) * ((2 * pi)**(-d/2)) * sum(exp(-0.5 * dist.**2))
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
40 50 60 70 80 90 100 110 0.00 0.01 0.02 0.03 0.04 0.05 40 50 60 70 80 90 100 110 0.00 0.05 0.100.3 1 3 10
Size matters Shape matters mostly in theory
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
40 60 80 100
0.000 0.025 0.050
kxi(x) = r−dh ✓x − xi r ◆
all data points since xi explains xi best ...
Pr {X} =
m
Y
i=1
p(xi)
40 60 80 100
0.000 0.025 0.050 0.025
Likelihood on training set is much higher than typical.
40 60 80 100
0.000 0.025 0.050 0.025
Likelihood on training set is much higher than typical.
density 0 density ≫ 0
40 60 80 100
0.000 0.025 0.050
Likelihood on training set is very similar to typical one. Too simple.
L(X0|X) := 1 n0
n0
X
i=1
log ˆ p(x0
i)
1 n
n
X
i=1
log ˆ p(xi) − Ex [log ˆ p(x)]
L(X0|X) := 1 n0
n0
X
i=1
log ˆ p(x0
i)
1 n
n
X
i=1
log ˆ p(xi) − Ex [log ˆ p(x)]
easy
L(X0|X) := 1 n0
n0
X
i=1
log ˆ p(x0
i)
1 n
n
X
i=1
log ˆ p(xi) − Ex [log ˆ p(x)]
easy wasteful
L(X0|X) := 1 n0
n0
X
i=1
log ˆ p(x0
i)
1 n
n
X
i=1
log ˆ p(xi) − Ex [log ˆ p(x)]
easy wasteful difficult
1 n
n
X
i=1
log n n − 1p(xi) − 1 n − 1k(xi, xi)
n
n
X
i=1
k(xi, x)
log p(xi|X\ {xi}) = log 1 n − 1 X
j6=i
k(xi, xj)
(error is for (k-1)/k sized set)
1 k
k
X
i=1
l(p(Xi|X\Xi))
Geoff Watson
From density estimation to classification
p(x|y = 1) and p(x|y = −1) p(y|x) = p(x|y)p(y) p(x) =
1 my
P
yi=y k(xi, x) · my m 1 m
P
i k(xi, x)
local weights
p(y = 1|x) − p(y = −1|x) = P
j yjk(xj, x)
P
i k(xi, x)
= X
j
yj k(xj, x) P
i k(xi, x)
Watson-Nadaraya Classifier
Watson-Nadaraya Classifier
dist = norm(X - x * ones(1,m),'columns'); f = sum(y .* exp(-0.5 * dist.**2));
labels local weights
p(y = 1|x) − p(y = −1|x) = P
j yjk(xj, x)
P
i k(xi, x)
= X
j
yj k(xj, x) P
i k(xi, x)
ˆ y(x) = X
j
yj k(xj, x) P
i k(xi, x)
Watson-Nadaraya regression estimate
For previously seen instance remember label
small amounts of data.
Neighborhood function is hard threshold.
ˆ y(x) = X
j
yj k(xj, x) P
i k(xi, x) =
X
j
yjwj(x) ˆ y(x) = X
j
yj k(xj, x) P
i k(xi, x) =
X
j
yjwj(x)
(use increasing k)
so we can make this small
(up to some approximation error in neighborhood)
(e.g. via Hoeffding’s theorem for tail). Show that it vanishes
guarantees
(logarithmic time lookup)
Silverman’ s Rule
Bernard Silverman
Use average distance from k nearest neighbors
ri = r k X
x∈NN(xi,k)
kxi xk
true density
non adaptive estimate
adaptive estimate
distance distribution
Kernels, algorithm
Crossvalidation, leave one out, bias variance
Classification, regression, novelty detection
http://hunch.net/~jl/projects/cover_tree/cover_tree.html
(Andrew Moore’s tutorial from his PhD thesis)
http://dx.doi.org/10.1137/1109020
http://www.jstor.org/stable/25049340
http://cran.r-project.org/web/packages/np/index.html
http://projecteuclid.org/euclid.aos/1176343886
http://www-isl.stanford.edu/people/cover/papers/transIT/0021cove.pdf
Rates of Convergence for Nearest Neighbor Procedures.
http://cseweb.ucsd.edu/~dasgupta/papers/nnactive.pdf
http://cgm.cs.mcgill.ca/~godfried/teaching/pr-notes/dasarathy.pdf
http://valis.cs.uiuc.edu/~sariel/papers/04/survey/survey.pdf