Lecture 3:
−Kernel Regression −Distance Metrics −Curse of Dimensionality −Linear Regression
Aykut Erdem
October 2016 Hacettepe University
Lecture 3: Kernel Regression Distance Metrics Curse of - - PowerPoint PPT Presentation
Lecture 3: Kernel Regression Distance Metrics Curse of Dimensionality Linear Regression Aykut Erdem October 2016 Hacettepe University Administrative Assignment 1 will be out on Wednesday It is due October 26 (i.e. in two
Aykut Erdem
October 2016 Hacettepe University
− Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code
2
3
− It can be slow in testing − Finding NN in high
dimensions is slow
adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson
4
Sports% Science% News%
X'='Document'
Y'='Topic'
Anemic%cell% Healthy%cell%
X'='Cell'Image'
Y'='Diagnosis'
slide by Aarti Singh and Barnabas Poczos
5
t%%
Y'='?'
X'='Feb01''
Stock%Market%% Predic$on%
slide by Aarti Singh and Barnabas Poczos
6
slide by Sanja Fidler
7
slide by Sanja Fidler
8
slide by Sanja Fidler
9
10
slide by Sanja Fidler
11
12
slide by Thorsten Joachims
– Attribute vectors: 𝑦𝑗 ∈ 𝑌 – Target attribute 𝑧𝑗 ∈
– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → – Number of nearest neighbors to consider: k
– New example 𝑦′ – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)
R R
k
i2knn(~ x0) yi
13
x y
Here, this is the closest datapoint Here, this is the closest datapoint
Here, this is the closest datapoint
Here, this is the closest datapoint Figure Credit: Carlos Guestrin
slide by Dhruv Batra
14 Figure Credit: Andrew Moore
slide by Dhruv Batra
15 Figure Credit: Andrew Moore
slide by Dhruv Batra
x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).
16
The relative scalings in the distance metric affect region shapes
Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Dist(xi, xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi, xj) = (xi1 – xj1)2 + (3xi2 – 3xj2)2
decisions by taking into account lots of factors
important are they with respect to the others).
17 Reviews (out of 5 stars) $ Distance Cuisine (out of 10) 4 30 21 7 2 15 12 8 5 27 53 9 3 20 5 6
slide by Richard Zemel
18
where Or equivalently, A
Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Scaled Euclidian (L2) Mahalanobis (non-diagonal A)
19 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
20
Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons slide by Dhruv Batra
D = n X
i=1
|xi − yi|p !1/p
L1 norm (absolute) Linf (max) norm
21 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Scaled Euclidian (L2)
–Yes = Non-parametric Models –No = Parametric Models
–http://www.theparticle.com/applets/ml/nearest_neighbor/
22
23
slide by Thorsten Joachims
– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ
𝐿 ∶ 𝑌 × 𝑌 → ℜ –
x’ – 𝐿 𝑦 𝑗, 𝑦 ′
– Attribute vectors: 𝑦𝑗 ∈ 𝑌 – Target attribute 𝑧𝑗 ∈
– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → – Number of nearest neighbors to consider: k
– New example 𝑦′ – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)
R R
− Euclidean (and others)
− All of them
− wi = exp(-d(xi, query)2 / σ2) − Nearby points to the query are weighted strongly, far points weakly.
The σ parameter is the Kernel Width. Very important.
− Predict the weighted average of the outputs
predict = Σwiyi / Σwi
24 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
wi = exp(-d(xi, query)2 / σ2)
(Our examples use Gaussian)
25 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
26 Image Credit: Ben Taskar
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
− Must use tricks like approximate nearest neighbour
27
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
− Must use tricks like approximate nearest neighbour
− Distances overwhelmed by noisy features
− Distances become meaningless in high dimensions
28
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
− Must use tricks like approximate nearest neighbour
− Distances overwhelmed by noisy features
29
slide by Dhruv Batra
data where the inputs are uniformly distributed in the D-dimensional unit cube.
around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points.
eD( f ) = f
1/D.
10% of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension around x.
e10(0.01) = 0.63.
30
s 1 1
0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of data in neighborhood Edge length of cube
d=1 d=3 d=5 d=7 d=10
slide by Kevin Murphy
— no longer very local
31
t(x) = f(x) + ε
with ε some noise
32
slide by Sanja Fidler
− How do we parametrize the model (the curve)? − What loss (objective) function should we use to judge fit? − How do we optimize fit to unseen test data
(generalization)?
33
slide by Sanja Fidler
statistics
34
https://archive.ics.uci.edu/ml/datasets/Housing
slide by Sanja Fidler
− x is the input feature (per capita crime rate) − t is the target output (median house price) −
(i)
simply indicates the training examples (we have N in this case)
y(x) = w0 + w1x
− Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
− Evaluate hypothesis on test set
35
slide by Sanja Fidler
fit can be considered noise
− Imprecision in data attributes (input noise, e.g. noise in per-capita
crime)
− Errors in data targets (mislabeling, e.g. noise in house prices) − Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect house prices?
− Model may be too simple to account for data targets
36
slide by Sanja Fidler
37
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
error between y and the true value t
in red), what does the loss represent geometrically?
38
slide by Sanja Fidler
y(x) = function(x, w)
Linear:
error between y and the true value t
in red), what does the loss represent geometrically?
39
y(x) = w0 + w1x
slide by Sanja Fidler
Linear:
between y and the true value t
in red), what does the loss represent geometrically?
40
y(x) = w0 + w1x
slide by Sanja Fidler
`(w) =
N
X
n=1
h t(n) − y(x(n)) i2
Linear:
between y and the true value t Linear model:
in red), what does the loss represent geometrically?
41
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
in red), what does the loss represent geometrically?
42
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
errors (squared lengths of green vertical lines)
43
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
44
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
Linear:
between y and the true value t Linear model:
loss
45
y(x) = w0 + w1x `(w) =
N
X
n=1
h t(n) − (w0 + w1x(n)) i2
slide by Sanja Fidler
46