Lecture 3: Kernel Regression Distance Metrics Curse of - - PowerPoint PPT Presentation

lecture 3
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Kernel Regression Distance Metrics Curse of - - PowerPoint PPT Presentation

Lecture 3: Kernel Regression Distance Metrics Curse of Dimensionality Linear Regression Aykut Erdem October 2016 Hacettepe University Administrative Assignment 1 will be out on Wednesday It is due October 26 (i.e. in two


slide-1
SLIDE 1

Lecture 3:

−Kernel Regression −Distance Metrics −Curse of Dimensionality −Linear Regression

Aykut Erdem

October 2016 Hacettepe University

slide-2
SLIDE 2

Administrative

  • Assignment 1 will be out on Wednesday
  • It is due October 26 (i.e. in two weeks).
  • It includes

− Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code

  • Note: Lecture slides are not enough, you should

also read related book chapters!

2

slide-3
SLIDE 3

Recall from last time… Nearest Neighbors

3

  • Very simple method
  • Retain all training data

− It can be slow in testing − Finding NN in high

dimensions is slow

  • Metrics are very important
  • Good baseline

adopted from Fei-Fei Li & Andrej Karpathy & Justin Johnson

slide-4
SLIDE 4

Classification

  • Input: X
  • Real valued, vectors over real.
  • Discrete values (0,1,2,…)
  • Other structures (e.g., strings, graphs, etc.)

  • Output: Y
  • Discrete (0,1,2,...)

4

Sports% Science% News%

X'='Document'

Y'='Topic'

Anemic%cell% Healthy%cell%

X'='Cell'Image'

Y'='Diagnosis'

slide by Aarti Singh and Barnabas Poczos

slide-5
SLIDE 5

Regression

  • Input: X
  • Real valued, vectors over real.
  • Discrete values (0,1,2,…)
  • Other structures (e.g., strings, graphs, etc.)

  • Output: Y
  • Real valued, vectors over real.

5

t%%

Y'='?'

X'='Feb01''

Stock%Market%% Predic$on%

slide by Aarti Singh and Barnabas Poczos

slide-6
SLIDE 6

What should I watch tonight?

6

slide by Sanja Fidler

slide-7
SLIDE 7

What should I watch tonight?

7

slide by Sanja Fidler

slide-8
SLIDE 8

What should I watch tonight?

8

slide by Sanja Fidler

slide-9
SLIDE 9

Today

  • Kernel regression

− nonparametric


  • Distance metrics
  • Linear regression (more on Thursday)

− parametric − simple model

9

slide-10
SLIDE 10

Simple 1-D Regression

  • Circles are data points (i.e., training examples) that are

given to us

  • The data points are uniform in x, but may be displaced in y 


t(x) = f(x) + ε 


with ε some noise

  • In green is the “true” curve that we don’t know

10

slide by Sanja Fidler

slide-11
SLIDE 11

Kernel Regression

11

slide-12
SLIDE 12

K-NN for Regression

12

slide by Thorsten Joachims

  • Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )} 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Target attribute 𝑧𝑗 ∈

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)

R R

h(~ x0) = 1

k

P

i2knn(~ x0) yi

slide-13
SLIDE 13

1-NN for Regression

13

x y

Here, this is the closest datapoint Here, this is the closest datapoint

Here, this is the closest datapoint

Here, this is the closest datapoint Figure Credit: Carlos Guestrin

slide by Dhruv Batra

slide-14
SLIDE 14

1-NN for Regression

14 Figure Credit: Andrew Moore

slide by Dhruv Batra

  • Often bumpy (overfits)
slide-15
SLIDE 15

9-NN for Regression

  • Often bumpy (overfits)

15 Figure Credit: Andrew Moore

slide by Dhruv Batra

slide-16
SLIDE 16

Multivariate distance metrics

  • Suppose the input vectors x1, x2, …xN are two dimensional:

x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).

  • One can draw the nearest-neighbor regions in input space.

16

The relative scalings in the distance metric affect region shapes

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Dist(xi, xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi, xj) = (xi1 – xj1)2 + (3xi2 – 3xj2)2

slide-17
SLIDE 17

Example: Choosing a restaurant

  • In everyday life we need to make

decisions by taking into account lots of factors

  • The question is what weight we put
  • n each of these factors (how

important are they with respect to the others).

17 Reviews (out of 5 stars) $ Distance Cuisine (out of 10) 4 30 21 7 2 15 12 8 5 27 53 9 3 20 5 6

  • individuals’ ¡preferences
  • individuals’ ¡preferences
  • ?

slide by Richard Zemel

slide-18
SLIDE 18

Euclidean distance metric

18

where Or equivalently, A

Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-19
SLIDE 19

Scaled Euclidian (L2) Mahalanobis
 (non-diagonal A)

Notable distance metrics 
 (and their level sets)

19 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-20
SLIDE 20

Minkowski distance

20

Image Credit: By Waldir (Based on File:MinkowskiCircles.svg) 
 [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons slide by Dhruv Batra

D = n X

i=1

|xi − yi|p !1/p

slide-21
SLIDE 21

L1 norm (absolute) Linf (max) norm

21 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

Scaled Euclidian (L2)

Notable distance metrics 
 (and their level sets)

slide-22
SLIDE 22

Parametric vs Non-parametric Models

  • Does the capacity (size of hypothesis class) grow with size
  • f training data?

–Yes = Non-parametric Models –No = Parametric Models

  • Example

–http://www.theparticle.com/applets/ml/nearest_neighbor/

22

slide-23
SLIDE 23

Weighted K-NN for Regression

23

slide by Thorsten Joachims

  • 𝑦

1, 𝑧1 , … , 𝑦 𝑜, 𝑧𝑜

– 𝑦 𝑗 ∈ 𝑌 – 𝑧𝑗 ∈ ℜ

𝐿 ∶ 𝑌 × 𝑌 → ℜ –

x’ – 𝐿 𝑦 𝑗, 𝑦 ′

  • Given: Training data {(𝑦1,𝑧1),…, (𝑦n,𝑧n )} 


– Attribute vectors: 𝑦𝑗 ∈ 𝑌
 – Target attribute 𝑧𝑗 ∈

  • Parameter:


– Similarity function: 𝐿 ∶ 𝑌 × 𝑌 →
 – Number of nearest neighbors to consider: k

  • Prediction rule


– New example 𝑦′ 
 – K-nearest neighbors: k train examples with largest 𝐿(𝑦𝑗,𝑦′)

R R

slide-24
SLIDE 24

Kernel Regression/Classification

Four things make a memory based learner:

  • A distance metric

− Euclidean (and others)

  • How many nearby neighbors to look at?

− All of them

  • A weighting function (optional)

− wi = exp(-d(xi, query)2 / σ2) − Nearby points to the query are weighted strongly, far points weakly. 


The σ parameter is the Kernel Width. Very important.

  • How to fit with the local points?

− Predict the weighted average of the outputs

predict = Σwiyi / Σwi

24 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-25
SLIDE 25

Weighting/Kernel functions

wi = exp(-d(xi, query)2 / σ2)

(Our examples use Gaussian)

25 Slide Credit: Carlos Guestrin

slide by Dhruv Batra

slide-26
SLIDE 26

Effect of Kernel Width

  • What happens

as σ → inf ?

  • What happens

as σ → 0?

26 Image Credit: Ben Taskar

slide by Dhruv Batra

slide-27
SLIDE 27

Problems with Instance- Based Learning

  • Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

  • Doesn’t work well when large number of irrelevant

features

  • Distances overwhelmed by noisy features
  • Curse of Dimensionality
  • Distances become meaningless in high dimensions

27

slide by Dhruv Batra

slide-28
SLIDE 28

Problems with Instance- Based Learning

  • Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

  • Doesn’t work well when large number of irrelevant

features

− Distances overwhelmed by noisy features

  • Curse of Dimensionality

− Distances become meaningless in high dimensions

28

slide by Dhruv Batra

slide-29
SLIDE 29

Problems with Instance- Based Learning

  • Expensive

− No Learning: most real work done during testing − For every test sample, must search through all dataset

– very slow!

− Must use tricks like approximate nearest neighbour

search

  • Doesn’t work well when large number of irrelevant

features

− Distances overwhelmed by noisy features

  • Curse of Dimensionality
  • Distances become meaningless in high dimensions

29

slide by Dhruv Batra

slide-30
SLIDE 30

Curse of Dimensionality

  • Consider applying a KNN classifier/regressor to

data where the inputs are uniformly distributed in the D-dimensional unit cube.

  • Suppose we estimate the density of class labels

around a test point x by “growing” a hyper-cube around x until it contains a desired fraction f of the data points.

  • The expected edge length of this cube will be

eD( f ) = f

1/D.

  • If D = 10, and we want to base our estimate on

10% of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension around x.

  • Even if we only use 1% of the data, we find

e10(0.01) = 0.63.

30

s 1 1

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of data in neighborhood Edge length of cube

d=1 d=3 d=5 d=7 d=10

slide by Kevin Murphy

— no longer very local

slide-31
SLIDE 31

Linear Regression

31

slide-32
SLIDE 32

Simple 1-D Regression

  • Circles are data points (i.e., training examples) that are given to us
  • The data points are uniform in x, but may be displaced in y 


t(x) = f(x) + ε 


with ε some noise

  • In green is the “true” curve that we don’t know 


32

slide by Sanja Fidler

  • Goal: We want to fit a curve to these points
slide-33
SLIDE 33

Simple 1-D Regression

  • Key Questions:

− How do we parametrize the model (the curve)? − What loss (objective) function should we use to judge fit? − How do we optimize fit to unseen test data

(generalization)?


33

slide by Sanja Fidler

slide-34
SLIDE 34

Example: Boston House Prizes

  • Estimate median house price in a neighborhood based on neighborhood

statistics

  • Look at first (of 13) attributes: per capita crime rate
  • Use this to predict house prices in other neighborhoods
  • Is this a good input (attribute) to predict house prices?

34

https://archive.ics.uci.edu/ml/datasets/Housing

slide by Sanja Fidler

slide-35
SLIDE 35

Represent the data

  • Data described as pairs D = {(x(1),t(1)), (x(2),t(2)),..., (x(N),t(N))}

− x is the input feature (per capita crime rate) − t is the target output (median house price) −

(i)

simply indicates the training examples (we have N in this case)

  • Here t is continuous, so this is a regression problem
  • Model outputs y, an estimate of t 



 y(x) = w0 + w1x

  • What type of model did we choose?
  • Divide the dataset into training and testing examples

− Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

− Evaluate hypothesis on test set

35

slide by Sanja Fidler

slide-36
SLIDE 36

Noise

  • A simple model typically does not exactly fit the data — lack of

fit can be considered noise


  • Sources of noise:

− Imprecision in data attributes (input noise, e.g. noise in per-capita

crime)

− Errors in data targets (mislabeling, e.g. noise in house prices) − Additional attributes not taken into account by data attributes, affect

target values (latent variables). In the example, what else could affect house prices?

− Model may be too simple to account for data targets

36

slide by Sanja Fidler

slide-37
SLIDE 37

Least-Squares Regression

37

slide by Sanja Fidler

y(x) = function(x, w)

slide-38
SLIDE 38

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared

error between y and the true value t
 


  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

38

slide by Sanja Fidler

y(x) = function(x, w)

slide-39
SLIDE 39

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared

error between y and the true value t
 


  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

39

y(x) = w0 + w1x

slide by Sanja Fidler

slide-40
SLIDE 40

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 


  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

40

y(x) = w0 + w1x

slide by Sanja Fidler

`(w) =

N

X

n=1

h t(n) − y(x(n)) i2

slide-41
SLIDE 41

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

41

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

slide-42
SLIDE 42

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • For a particular hypothesis (y(x) defined by a choice of w, drawn

in red), what does the loss represent geometrically?

42

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

slide-43
SLIDE 43

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • The loss for the red hypothesis is the sum of the squared vertical

errors (squared lengths of green vertical lines)

43

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

slide-44
SLIDE 44

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • How do we obtain weights ?


44

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1)

slide-45
SLIDE 45

Least-Squares Regression

  • Define a model 


Linear:

  • Standard loss/cost/objective function measures the squared error

between y and the true value t
 
 Linear model:

  • How do we obtain weights ? Find w that minimizes 


loss

45

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2

slide by Sanja Fidler

w = (w0, w1) `(w)

slide-46
SLIDE 46

Next Lecture:

Least Squares Optimization, Model complexity, and Regularization

46