[PPT] - MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco PowerPoint Presentation

SLIDE 1

MLCC 2017 Local Methods and Bias Variance Trade-Off

Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

SLIDE 2

About this class

1. Introduce a basic class of learning methods, namely local methods.
2. Discuss the fundamental concept of bias-variance trade-off to

understand parameter tuning (a.k.a. model selection)

MLCC 2017 2

SLIDE 3

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2017 3

SLIDE 4

The problem

What is the price of one house given its area?

MLCC 2017 4

SLIDE 5

The problem

What is the price of one house given its area? Start from data...

MLCC 2017 5

SLIDE 6

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)}

MLCC 2017 6

SLIDE 7

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)} Given a new point x∗ we want to predict y∗ by means of S.

MLCC 2017 7

SLIDE 8

Example

Let x∗ a 300m2 house.

MLCC 2017 8

SLIDE 9

Example

Let x∗ a 300m2 house.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 9

SLIDE 10

Example

Let x∗ a 300m2 house.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

What is its price?

MLCC 2017 10

SLIDE 11

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 11

SLIDE 12

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 12

SLIDE 13

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R

MLCC 2017 13

SLIDE 14

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,

MLCC 2017 14

SLIDE 15

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

MLCC 2017 15

SLIDE 16

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D)

MLCC 2017 16

SLIDE 17

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D) In general let d : RD × RD a distance on the input space, then f(x) = yj j = arg min

i=1,...,n d(x, xi)

MLCC 2017 17

SLIDE 18

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 18

SLIDE 19

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Can we do better? (for example using more points)

MLCC 2017 19

SLIDE 20

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 20

SLIDE 21

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 21

SLIDE 22

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,

MLCC 2017 22

SLIDE 23

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n,

MLCC 2017 23

SLIDE 24

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and

jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},

MLCC 2017 24

SLIDE 25

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and

jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},

◮ predicted output

y∗ = 1 K

i∈{j1,...,jK}

yi

MLCC 2017 25

SLIDE 26

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

i=1

yji

MLCC 2017 26

SLIDE 27

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

i=1

yji

◮ Computational cost O(nD + n log n): compute the n distances

x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).

MLCC 2017 27

SLIDE 28

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

i=1

yji

◮ Computational cost O(nD + n log n): compute the n distances

x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).

◮ General Metric d f is the same, but j1, . . . , jK are defined as

j1 = arg mini∈{1,...,n} d(x, xi) and jt = arg mini∈{1,...,n}\{j1,...,jt−1} d(x, xi) for t ∈ {2, . . . , K}

MLCC 2017 28

SLIDE 29

Parzen Windows

K-NN puts equal weights on the values of the selected points.

MLCC 2017 29

SLIDE 30

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it?

MLCC 2017 30

SLIDE 31

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value

MLCC 2017 31

SLIDE 32

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value PARZEN WINDOWS: ˆ f(x) = n

i=1 yik(x, xi)

n

i=1 k(x, xi)

where k is a similarity function

◮ k(x, x′) ≥ 0 for all x, x′ ∈ RD ◮ k(x, x′) → 1 when x − x′ → 0 ◮ k(x, x′) → 0 when x − x′ → ∞

MLCC 2017 32

SLIDE 33

Parzen Windows

Examples of k

◮ k1(x, x′) = sign

1 − x−x′

σ

+ with a σ > 0

◮ k2(x, x′) =

1 − x−x′

σ

+ with a σ > 0

◮ k3(x, x′) =

1 − x−x′2

σ2

+ with a σ > 0

◮ k4(x, x′) = e− x−x′2

2σ2

with a σ > 0

◮ k5(x, x′) = e− x−x′

σ

with a σ > 0

MLCC 2017 33

SLIDE 34

K-NN example

K-Nearest neighbor depends on K. When K = 1

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 34

SLIDE 35

K-NN example

K-Nearest neighbor depends on K. When K = 2

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 35

SLIDE 36

K-NN example

K-Nearest neighbor depends on K. When K = 3

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 36

SLIDE 37

K-NN example

K-Nearest neighbor depends on K. When K = 4

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 37

SLIDE 38

K-NN example

K-Nearest neighbor depends on K. When K = 5

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 38

SLIDE 39

K-NN example

K-Nearest neighbor depends on K. When K = 9

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 39

SLIDE 40

K-NN example

K-Nearest neighbor depends on K. When K = 15

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 40

SLIDE 41

K-NN example

K-Nearest neighbor depends on K.

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5

Changing K the result changes a lot! How to select K?

MLCC 2017 41

SLIDE 42

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2017 42

SLIDE 43

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K)

MLCC 2017 43

SLIDE 44

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2

MLCC 2017 44

SLIDE 45

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK

MLCC 2017 45

SLIDE 46

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2017 46

SLIDE 47

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK

MLCC 2017 47

SLIDE 48

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2017 48

SLIDE 49

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

Ideally! (In practice we don’t have access to the distribution)

◮ We can still try to understand the above minimization problem: does

a solution exists? What does it depend on?

◮ Yet, ultimately, we need something we can compute!

MLCC 2017 49

SLIDE 50

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2017 50

SLIDE 51

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x).

MLCC 2017 51

SLIDE 52

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ

MLCC 2017 52

SLIDE 53

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

MLCC 2017 53

SLIDE 54

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2017 54

SLIDE 55

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2

MLCC 2017 55

SLIDE 56

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2 that is EK(x) = ES(f∗(x) − ˆ fS,K(x))2 + σ2 . . .

MLCC 2017 56

SLIDE 57

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

l∈Kx

f∗(xl)

MLCC 2017 57

SLIDE 58

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x).

MLCC 2017 58

SLIDE 59

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

bias

+ ES( ˜ fS,K(x) − ˆ fS,K(x))2 + σ2

variance

. . .

MLCC 2017 59

SLIDE 60

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

bias

+ 1 K2 EX

l∈Kx

Eyl|xl(yl − f∗(xl))2 + σ2

variance

. . .

MLCC 2017 60

SLIDE 61

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

bias

+ σ2 K + σ2

variance

. . .

MLCC 2017 61

SLIDE 62

Bias Variance trade-off K Errors Variance Bias

MLCC 2017 62

SLIDE 63

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

MLCC 2017 63

SLIDE 64

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and

MLCC 2017 64

SLIDE 65

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

MLCC 2017 65

SLIDE 66

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

How to choose K in practice?

MLCC 2017 66

SLIDE 67

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

How to choose K in practice?

◮ Idea: train on some data and validate the parameter on new unseen

data as a proxy for the ideal case.

MLCC 2017 67

SLIDE 68

Hold-out Cross-validation

For each K

MLCC 2017 68

SLIDE 69

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)

MLCC 2017 69

SLIDE 70

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)
2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

x,y∈V (y − ˆ

fT,K(x))2

MLCC 2017 70

SLIDE 71

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)
2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

x,y∈V (y − ˆ

fT,K(x))2

MLCC 2017 71

SLIDE 72

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)
2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

x,y∈V (y − ˆ

fT,K(x))2

3. Select ˆ

K that minimize ˆ EK.

MLCC 2017 72

SLIDE 73

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)
2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

x,y∈V (y − ˆ

fT,K(x))2

3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials.

MLCC 2017 73

SLIDE 74

Hold-out Cross-validation

For each K

1. shuffle and split S in T (training) and V (validation)
2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

x,y∈V (y − ˆ

fT,K(x))2

3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...).

MLCC 2017 74

SLIDE 75

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error MLCC 2017 75

SLIDE 76

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error

ˆ K = 8.

MLCC 2017 76

SLIDE 77

Wrapping up

In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization.

MLCC 2017 77

SLIDE 78

Next Class

High Dimensions: Beyond local methods!

MLCC 2017 78