Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem - - PowerPoint PPT Presentation

lecture 3
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem - - PowerPoint PPT Presentation

Lecture 3: Kernel Regression Curse of Dimensionality Aykut Erdem February 2016 Hacettepe University Administrative Assignment 1 will be out on Thursday It is due March 4 (i.e. in two weeks). It includes Pencil-and-paper


slide-1
SLIDE 1

Lecture 3:

−Kernel Regression −Curse of Dimensionality

Aykut Erdem

February 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 1 will be out on Thursday
  • It is due March 4 (i.e. in two weeks).
  • It includes

− Pencil-and-paper derivations − Implementing kNN classifier − Implementing linear regression − numpy/Python code

  • Note: Lecture slides are not enough, you should

also read related book chapters!

2

slide-3
SLIDE 3

Recall from last time… Nearest Neighbors

3

  • Very simple method
  • Retain all training data

− It can be slow in testing − Finding NN in high

dimensions is slow

  • Metrics are very important
  • Good baseline

adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-4
SLIDE 4

Classification

  • Input: X
  • Real valued, vectors over real.
  • Discrete values (0,1,2,…)
  • Other structures (e.g., strings, graphs, etc.)

  • Output: Y
  • Discrete (0,1,2,...)

4

Sports% Science% News%

X'='Document'

Y'='Topic'

Anemic%cell% Healthy%cell%

X'='Cell'Image'

Y'='Diagnosis'

slide by Aarti Singh and Barnabas Poczos

slide-5
SLIDE 5

Regression

  • Input: X
  • Real valued, vectors over real.
  • Discrete values (0,1,2,…)
  • Other structures (e.g., strings, graphs, etc.)

  • Output: Y
  • Real valued, vectors over real.

5

t%%

Y'='?'

X'='Feb01''

Stock%Market%% Predic$on%

slide by Aarti Singh and Barnabas Poczos

slide-6
SLIDE 6

What should I watch tonight?

6

slide by Sanja Fidler

slide-7
SLIDE 7

What should I watch tonight?

7

slide by Sanja Fidler

slide-8
SLIDE 8

What should I watch tonight?

8

slide by Sanja Fidler

slide-9
SLIDE 9

Today

  • Kernel regression

− nonparametric


  • Distances
  • Next: Linear regression

− parametric − simple model

9

slide-10
SLIDE 10

Simple 1-D Regression

  • Circles are data points (i.e., training examples) that are

given to us

  • The data points are uniform in x, but may be displaced in y 


t(x) = f(x) + ε 


with ε some noise

  • In green is the “true” curve that we don’t know

10

slide by Sanja Fidler

slide-11
SLIDE 11

Kernel Regression

11

slide-12
SLIDE 12

1-NN for Regression

12

x y

Here, this is the closest datapoint Here, this is the closest datapoint

Here, this is the closest datapoint

Here, this is the closest datapoint Figure Credit: Carlos Guestrin

slide by Dhruv Batra

slide-13
SLIDE 13

1-NN for Regression

13 Figure Credit: Andrew Moore

slide by Dhruv Batra

  • Often bumpy (overfits)
slide-14
SLIDE 14

9-NN for Regression

  • Often bumpy (overfits)

14 Figure Credit: Andrew Moore

slide by Dhruv Batra

slide-15
SLIDE 15

Weighted K-NN for Regression

15

slide by Thorsten Joachims

  • 𝑦

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ

𝐿 ∶ 𝑌 × 𝑌 → ℜ –

x’ – 𝐿 𝑦 𝑗, 𝑦 ′

  • Given: Training data ( (𝑦1,𝑧1),…, (𝑦n,𝑧n )) 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Target attribute 𝑧𝑗 ∈

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦’ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦’)

R R

slide-16
SLIDE 16

Multivariate distance metrics

  • Suppose the input vectors x1, x2, …xN are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).

  • One can draw the nearest-neighbor regions in input space.

16

The relative scalings in the distance metric affect region shapes

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)2

slide-17
SLIDE 17

Example: Choosing a restaurant

  • In everyday life we need to make

decisions by taking into account lots of factors

  • The question is what weight we put
  • n each of these factors (how

important are they with respect to the others).

17 Reviews (out of 5 stars) $ Distance Cuisine (out of 10) 4 30 21 7 2 15 12 8 5 27 53 9 3 20 5 6

  • individuals’ ¡preferences
  • individuals’ ¡preferences
  • ?

slide by Richard Zemel

slide-18
SLIDE 18

Euclidean distance metric

18

where Or equivalently, A

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-19
SLIDE 19

Scaled Euclidian (L2) Mahalanobis
 (non-diagonal A)

Notable distance metrics 
 (and their level sets)

19 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-20
SLIDE 20

Minkowski distance

20

Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) 
 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons slide by Dhruv Batra

D = n X

i=1

|xi − yi|p !1/p

slide-21
SLIDE 21

L1 norm (absolute) Linf (max) norm

21 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Scaled Euclidian (L2)

Notable distance metrics 
 (and their level sets)

slide-22
SLIDE 22

Kernel Regression/Classification

Four things make a memory based learner:

  • A distance metric

− Euclidean (and others)

  • How many nearby neighbors to look at?

− All of them

  • A weighting function (optional)

− wi = exp(-d(xi, query)2 / σ2) − Nearby points to the query are weighted strongly, far points weakly. 


The σ parameter is the Kernel Width. Very important.

  • How to fit with the local points?

− Predict the weighted average of the outputs

predict = Σwiyi / Σwi

22 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-23
SLIDE 23

Weighting/Kernel functions

wi = exp(-d(xi, query)2 / σ2)

(Our examples use Gaussian)

23 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-24
SLIDE 24

Effect of Kernel Width

  • What happens

as σ→inf?

  • What happens

as σ→0?

24 Image Credit: Ben Taskar

slide by Dhruv Batra

slide-25
SLIDE 25

Problems with Instance- Based Learning

  • Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

  • Doesn’t work well when large number of irrelevant

features

− Distances overwhelmed by noisy features

  • Curse of Dimensionality

− Distances become meaningless in high dimensions

25

slide by Dhruv Batra

slide-26
SLIDE 26

Curse of Dimensionality

  • Consider applying a KNN classifier/regressor to

data where the inputs are uniformly distributed in the D-dimensional unit cube.

  • Suppose we estimate the density of class labels

around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points.

  • The expected edge length of this cube will be

eD( f ) = f

1/D.

  • If D = 10, and we want to base our estimate on

10% of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension around x.

  • Even if we only use 1% of the data, we find

e10(0.01) = 0.63.

26

s 1 1

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of data in neighborhood Edge length of cube

d=1 d=3 d=5 d=7 d=10

slide by Kevin Murphy

— no longer very local

slide-27
SLIDE 27

Next Lecture: Linear Regression

27