[PPT] - Lecture 3: Kernel Regression Distance Metrics Curse of PowerPoint Presentation

SLIDE 1

Lecture 3:

−Kernel Regression −Distance Metrics −Curse of Dimensionality −Linear Regression

Aykut Erdem

October 2016 Hacettepe University

SLIDE 2

Administrative

Assignment 1 will be out on Wednesday
It is due October 26 (i.e. in two weeks).
It includes

− Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code

Note: Lecture slides are not enough, you should

also read related book chapters!

2

SLIDE 3

Recall from last time… Nearest Neighbors

3

Very simple method
Retain all training data

− It can be slow in testing − Finding NN in high

dimensions is slow

Metrics are very important
Good baseline

adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson

SLIDE 4

Classification

Input: X
Real valued, vectors over real.
Discrete values (0,1,2,…)
Other structures (e.g., strings, graphs, etc.) 
Output: Y
Discrete (0,1,2,...)

4

Sports% Science% News%

X'='Document'

Y'='Topic'

Anemic%cell% Healthy%cell%

X'='Cell'Image'

Y'='Diagnosis'

slide by Aarti Singh and Barnabas Poczos

SLIDE 5

Regression

Input: X
Real valued, vectors over real.
Discrete values (0,1,2,…)
Other structures (e.g., strings, graphs, etc.) 
Output: Y
Real valued, vectors over real.

5

t%%

Y'='?'

X'='Feb01''

Stock%Market%% Predic$on%

slide by Aarti Singh and Barnabas Poczos

SLIDE 6

What should I watch tonight?

6

slide by Sanja Fidler

SLIDE 7

What should I watch tonight?

7

slide by Sanja Fidler

SLIDE 8

What should I watch tonight?

8

slide by Sanja Fidler

SLIDE 9

Today

Kernel regression

− nonparametric 

Distance metrics
Linear regression (more on Thursday)

− parametric − simple model

9

SLIDE 10

Simple 1-D Regression

Circles are data points (i.e., training examples) that are

given to us

The data points are uniform in x, but may be displaced in y

t(x) = f(x) + ε  

with ε some noise

In green is the “true” curve that we don’t know

10

slide by Sanja Fidler

SLIDE 11

Kernel Regression

11

SLIDE 12

K-NN for Regression

12

slide by Thorsten Joachims

Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )}

– Attribute vectors: 𝑦𝑗 ∈ 𝑌  – Target attribute 𝑧𝑗 ∈

Parameter:

– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →  – Number of nearest neighbors to consider: k

Prediction rule

– New example 𝑦′   – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)

R R

h(~ x0) = 1

k

P

i2knn(~ x0) yi

SLIDE 13

1-NN for Regression

13

x y

Here, this is the closest datapoint Here, this is the closest datapoint

Here, this is the closest datapoint

Here, this is the closest datapoint Figure Credit: Carlos Guestrin

slide by Dhruv Batra

SLIDE 14

1-NN for Regression

14 Figure Credit: Andrew Moore

slide by Dhruv Batra

Often bumpy (overfits)

SLIDE 15

9-NN for Regression

Often bumpy (overfits)

15 Figure Credit: Andrew Moore

slide by Dhruv Batra

SLIDE 16

Multivariate distance metrics

Suppose the input vectors x1, x2, …xN are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).

One can draw the nearest-neighbor regions in input space.

16

The relative scalings in the distance metric affect region shapes

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Dist(xi, xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi, xj) = (xi1 – xj1)2 + (3xi2 – 3xj2)2

SLIDE 17

Example: Choosing a restaurant

In everyday life we need to make

decisions by taking into account lots of factors

The question is what weight we put
n each of these factors (how

important are they with respect to the others).

17 Reviews (out of 5 stars) $ Distance Cuisine (out of 10) 4 30 21 7 2 15 12 8 5 27 53 9 3 20 5 6

individuals’ ¡preferences
individuals’ ¡preferences
?

slide by Richard Zemel

SLIDE 18

Euclidean distance metric

18

where Or equivalently, A

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

SLIDE 19

Scaled Euclidian (L2) Mahalanobis  (non-diagonal A)

Notable distance metrics   (and their level sets)

19 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

SLIDE 20

Minkowski distance

20

Image Credit: By Waldir (Based on File:MinkowskiCircles.svg)   [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons slide by Dhruv Batra

D = n X

i=1

|xi − yi|p !1/p

SLIDE 21

L1 norm (absolute) Linf (max) norm

21 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Scaled Euclidian (L2)

Notable distance metrics   (and their level sets)

SLIDE 22

Parametric vs Non-parametric Models

Does the capacity (size of hypothesis class) grow with size
f training data?

–Yes = Non-parametric Models –No = Parametric Models

Example

–http://www.theparticle.com/applets/ml/nearest_neighbor/

22

SLIDE 23

Weighted K-NN for Regression

23

slide by Thorsten Joachims

𝑦

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ

–

𝐿 ∶ 𝑌 × 𝑌 → ℜ –

–

x’ – 𝐿 𝑦 𝑗, 𝑦 ′

Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )}

– Attribute vectors: 𝑦𝑗 ∈ 𝑌  – Target attribute 𝑧𝑗 ∈

Parameter:

– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →  – Number of nearest neighbors to consider: k

Prediction rule

– New example 𝑦′   – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)

R R

SLIDE 24

Kernel Regression/Classification

Four things make a memory based learner:

A distance metric

− Euclidean (and others)

How many nearby neighbors to look at?

− All of them

A weighting function (optional)

− wi = exp(-d(xi, query)2 / σ2) − Nearby points to the query are weighted strongly, far points weakly.  

The σ parameter is the Kernel Width. Very important.

How to fit with the local points?

− Predict the weighted average of the outputs

predict = Σwiyi / Σwi

24 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

SLIDE 25

Weighting/Kernel functions

wi = exp(-d(xi, query)2 / σ2)

(Our examples use Gaussian)

25 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

SLIDE 26

Effect of Kernel Width

What happens

as σ → inf ?

What happens

as σ → 0?

26 Image Credit: Ben Taskar

slide by Dhruv Batra

SLIDE 27

Problems with Instance- Based Learning

Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

Doesn’t work well when large number of irrelevant

features

Distances overwhelmed by noisy features
Curse of Dimensionality
Distances become meaningless in high dimensions

27

slide by Dhruv Batra

SLIDE 28

Problems with Instance- Based Learning

Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

Doesn’t work well when large number of irrelevant

features

− Distances overwhelmed by noisy features

Curse of Dimensionality

− Distances become meaningless in high dimensions

28

slide by Dhruv Batra

SLIDE 29

Problems with Instance- Based Learning

Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

Doesn’t work well when large number of irrelevant

features

− Distances overwhelmed by noisy features

Curse of Dimensionality
Distances become meaningless in high dimensions

29

slide by Dhruv Batra

SLIDE 30

Curse of Dimensionality

Consider applying a KNN classifier/regressor to

data where the inputs are uniformly distributed in the D-dimensional unit cube.

Suppose we estimate the density of class labels

around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points.

The expected edge length of this cube will be

eD( f ) = f

1/D.

If D = 10, and we want to base our estimate on

10% of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension around x.

Even if we only use 1% of the data, we find

e10(0.01) = 0.63.

30

s 1 1

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of data in neighborhood Edge length of cube

d=1 d=3 d=5 d=7 d=10

slide by Kevin Murphy

— no longer very local

SLIDE 31

Linear Regression

31

SLIDE 32

Simple 1-D Regression

Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y

t(x) = f(x) + ε  

with ε some noise

In green is the “true” curve that we don’t know

32

slide by Sanja Fidler

Goal: We want to fit a curve to these points

SLIDE 33

Simple 1-D Regression

Key Questions:

− How do we parametrize the model (the curve)? − What loss (objective) function should we use to judge fit? − How do we optimize fit to unseen test data

(generalization)? 

33

slide by Sanja Fidler

SLIDE 34

Example: Boston House Prizes

Estimate median house price in a neighborhood based on neighborhood

statistics

Look at first (of 13) attributes: per capita crime rate
Use this to predict house prices in other neighborhoods
Is this a good input (attribute) to predict house prices?

34

https://archive.ics.uci.edu/ml/datasets/Housing

slide by Sanja Fidler

SLIDE 35

Represent the data

Data described as pairs D = {(x(1),t(1)), (x(2),t(2)),..., (x(N),t(N))}

− x is the input feature (per capita crime rate) − t is the target output (median house price) −

(i)

simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem
Model outputs y, an estimate of t

  y(x) = w0 + w1x

What type of model did we choose?
Divide the dataset into training and testing examples

− Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

− Evaluate hypothesis on test set

35

slide by Sanja Fidler

SLIDE 36

Noise

A simple model typically does not exactly fit the data — lack of

fit can be considered noise 

Sources of noise:

− Imprecision in data attributes (input noise, e.g. noise in per-capita

crime)

− Errors in data targets (mislabeling, e.g. noise in house prices) − Additional attributes not taken into account by data attributes, affect

target values (latent variables). In the example, what else could affect house prices?

− Model may be too simple to account for data targets

36

slide by Sanja Fidler

SLIDE 37

Least-Squares Regression

37

slide by Sanja Fidler

y(x) = function(x, w)

SLIDE 38

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared

error between y and the true value t   

For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

38

slide by Sanja Fidler

y(x) = function(x, w)

SLIDE 39

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared

error between y and the true value t   

For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

39

y(x) = w0 + w1x

slide by Sanja Fidler

SLIDE 40

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t   

For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

40

y(x) = w0 + w1x

slide by Sanja Fidler

`(w) =

N

X

n=1

h t(n) − y(x(n)) i2

SLIDE 41

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t    Linear model:

For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

41

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

SLIDE 42

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t    Linear model:

For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

42

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

SLIDE 43

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t    Linear model:

The loss for the red hypothesis is the sum of the squared vertical

errors (squared lengths of green vertical lines)

43

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

SLIDE 44

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t    Linear model:

How do we obtain weights ?

44

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1)

SLIDE 45

Least-Squares Regression

Define a model

Linear:

Standard loss/cost/objective function measures the squared error

between y and the true value t    Linear model:

How do we obtain weights ? Find w that minimizes

loss

45

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1) `(w)

SLIDE 46

Next Lecture:

Least Squares Optimization, Model complexity, and Regularization

46

Lecture 3:

−Kernel Regression −Distance Metrics −Curse of Dimensionality −Linear Regression

Administrative

also read related book chapters!

Recall from last time… Nearest Neighbors

Classification

Regression

What should I watch tonight?

What should I watch tonight?

What should I watch tonight?

Today

− nonparametric

− parametric − simple model

Simple 1-D Regression

given to us

t(x) = f(x) + ε

with ε some noise

Kernel Regression

K-NN for Regression

h(~ x0) = 1

P

1-NN for Regression

1-NN for Regression

9-NN for Regression

Multivariate distance metrics

Example: Choosing a restaurant

Euclidean distance metric

Notable distance metrics (and their level sets)

Minkowski distance

Notable distance metrics (and their level sets)

Parametric vs Non-parametric Models

Weighted K-NN for Regression

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

Kernel Regression/Classification

Four things make a memory based learner:

Weighting/Kernel functions

Effect of Kernel Width

as σ → inf ?

as σ → 0?

Problems with Instance- Based Learning

– very slow!

search

features

Problems with Instance- Based Learning

– very slow!

search

features

Problems with Instance- Based Learning

– very slow!

search

features

Curse of Dimensionality

Linear Regression

Simple 1-D Regression

Simple 1-D Regression

Example: Boston House Prizes

Represent the data

Noise

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

Least-Squares Regression

w = (w0, w1)

Least-Squares Regression

w = (w0, w1) `(w)

Next Lecture:

Least Squares Optimization, Model complexity, and Regularization

− nonparametric 

t(x) = f(x) + ε  

Notable distance metrics   (and their level sets)

Notable distance metrics   (and their level sets)