photo:@rewardyfahmi // Unsplash
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 3:
Kernel Regression, Distance Metrics, Curse of Dimensionality
BBM406 Fundamentals of Machine Learning Lecture 3: Kernel - - PowerPoint PPT Presentation
photo:@rewardyfahmi // Unsplash BBM406 Fundamentals of Machine Learning Lecture 3: Kernel Regression, Distance Metrics, Curse of Dimensionality Aykut Erdem // Hacettepe University // Fall 2019 Administrative Assignment 1 will be out
photo:@rewardyfahmi // Unsplash
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 3:
Kernel Regression, Distance Metrics, Curse of Dimensionality
− Pencil-and-paper derivations − Implementing kernel regression − numpy/Python code
2
Movie Recommendation System
this next week)
− Ratings: userId, movieId, rating, timestamp − Movies: movieId, title, genre − Links: movieId, imdbId, and tmdbId − Tags: userId, movieId, tag, timestamp
3
Recall from last time… Nearest Neighbors
4
− It can be slow in testing − Finding NN in high
dimensions is slow
adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson
5
Sports% Science% News%
X'='Document'
Y'='Topic'
Anemic%cell% Healthy%cell%
X'='Cell'Image'
Y'='Diagnosis'
slide by Aarti Singh and Barnabas Poczos
6
t%%
Y'='?'
X'='Feb01''
Stock%Market%% Predic$on%
slide by Aarti Singh and Barnabas Poczos
7
slide by Sanja Fidler
8
slide by Sanja Fidler
9
slide by Sanja Fidler
− nonparametric
− parametric − simple model
10
given to us
t(x) = f(x) + ε
with ε some noise
11
slide by Sanja Fidler
12
13
slide by Thorsten Joachims
– Attribute vectors: 𝑦𝑗 ∈ 𝑌 – Target attribute 𝑧𝑗 ∈
– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → – Number of nearest neighbors to consider: k
– New example 𝑦′ – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)
R R
h(~ x0) = 1
k
P
i2knn(~ x0) yi
14
x y
Here, this is the closest datapoint Here, this is the closest datapoint
Here, this is the closest datapointHere, this is the closest datapoint Figure Credit: Carlos Guestrin
slide by Dhruv Batra
15 Figure Credit: Andrew Moore
slide by Dhruv Batra
16 Figure Credit: Andrew Moore
slide by Dhruv Batra
x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).
17
The relative scalings in the distance metric affect region shapes
Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Dist(xi, xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi, xj) = (xi1 – xj1)2 + (3xi2 – 3xj2)2
Example: Choosing a restaurant
decisions by taking into account lots of factors
put on each of these factors (how important are they with respect to the others).
18 Reviews (out of 5 stars) $ Distance Cuisine (out of 10) 4 30 21 7 2 15 12 8 5 27 53 9 3 20 5 6
slide by Richard Zemel
19
where Or equivalently, A
Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Scaled Euclidian (L2) Mahalanobis (non-diagonal A)
20 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
21
Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons slide by Dhruv Batra
D = n X
i=1
|xi − yi|p !1/p
L1 norm (absolute) Linf (max) norm
22 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
Scaled Euclidian (L2)
23
slide by Thorsten Joachims
1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜
– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ
𝐿 ∶ 𝑌 × 𝑌 → ℜ –
x’ – 𝐿 𝑦 𝑗, 𝑦 ′
– Attribute vectors: 𝑦𝑗 ∈ 𝑌 – Target attribute 𝑧𝑗 ∈
– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 → – Number of nearest neighbors to consider: k
– New example 𝑦′ – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)
R R
Kernel Regression/Classification
Four things make a memory based learner:
− Euclidean (and others)
− All of them
− wi = exp(-d(xi, query)2 / σ2) − Nearby points to the query are weighted strongly, far points weakly.
The σ parameter is the Kernel Width. Very important.
− Predict the weighted average of the outputs
predict = Σwiyi / Σwi
24 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
wi = exp(-d(xi, query)2 / σ2)
(Our examples use Gaussian)
25 Slide Credit: Carlos Guestrin
slide by Dhruv Batra
as σ → inf ?
as σ → 0?
26 Image Credit: Ben Taskar
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
– very slow!
− Must use tricks like approximate nearest neighbour
search
features
27
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
– very slow!
− Must use tricks like approximate nearest neighbour
search
features
− Distances overwhelmed by noisy features
− Distances become meaningless in high dimensions
28
slide by Dhruv Batra
− No Learning: most real work done during testing − For every test sample, must search through all dataset
– very slow!
− Must use tricks like approximate nearest neighbour
search
features
− Distances overwhelmed by noisy features
29
slide by Dhruv Batra
data where the inputs are uniformly distributed in the D-dimensional unit cube.
around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points.
eD( f ) = f .
10% of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension around x.
e10(0.01) = 0.63.
30
s 1 1 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of data in neighborhood Edge length of cube d=1 d=3 d=5 d=7 d=10slide by Kevin Murphy
— no longer very local
1/D
Parametric vs Non-parametric Models
size of training data? –Yes = Non-parametric Models –No = Parametric Models
31
Linear Regression, Least Squares Optimization, Model complexity, Regularization
32