MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2017 local methods and bias variance trade off
SMART_READER_LITE
LIVE PREVIEW

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017 About this class 1. Introduce a basic class of learning methods, namely local methods . 2. Discuss the fundamental concept of bias-variance


slide-1
SLIDE 1

MLCC 2017 Local Methods and Bias Variance Trade-Off

Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017

slide-2
SLIDE 2

About this class

  • 1. Introduce a basic class of learning methods, namely local methods.
  • 2. Discuss the fundamental concept of bias-variance trade-off to

understand parameter tuning (a.k.a. model selection)

MLCC 2017 2

slide-3
SLIDE 3

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2017 3

slide-4
SLIDE 4

The problem

What is the price of one house given its area?

MLCC 2017 4

slide-5
SLIDE 5

The problem

What is the price of one house given its area? Start from data...

MLCC 2017 5

slide-6
SLIDE 6

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)}

MLCC 2017 6

slide-7
SLIDE 7

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)} Given a new point x∗ we want to predict y∗ by means of S.

MLCC 2017 7

slide-8
SLIDE 8

Example

Let x∗ a 300m2 house.

MLCC 2017 8

slide-9
SLIDE 9

Example

Let x∗ a 300m2 house.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 9

slide-10
SLIDE 10

Example

Let x∗ a 300m2 house.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

What is its price?

MLCC 2017 10

slide-11
SLIDE 11

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 11

slide-12
SLIDE 12

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 12

slide-13
SLIDE 13

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R

MLCC 2017 13

slide-14
SLIDE 14

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,

MLCC 2017 14

slide-15
SLIDE 15

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

MLCC 2017 15

slide-16
SLIDE 16

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D)

MLCC 2017 16

slide-17
SLIDE 17

Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ

f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D) In general let d : RD × RD a distance on the input space, then f(x) = yj j = arg min

i=1,...,n d(x, xi)

MLCC 2017 17

slide-18
SLIDE 18

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 18

slide-19
SLIDE 19

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Can we do better? (for example using more points)

MLCC 2017 19

slide-20
SLIDE 20

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 20

slide-21
SLIDE 21

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2017 21

slide-22
SLIDE 22

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,

MLCC 2017 22

slide-23
SLIDE 23

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n,

MLCC 2017 23

slide-24
SLIDE 24

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and

jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},

MLCC 2017 24

slide-25
SLIDE 25

K-Nearest Neighbors

◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and

jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},

◮ predicted output

y∗ = 1 K

  • i∈{j1,...,jK}

yi

MLCC 2017 25

slide-26
SLIDE 26

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji

MLCC 2017 26

slide-27
SLIDE 27

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji

◮ Computational cost O(nD + n log n): compute the n distances

x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).

MLCC 2017 27

slide-28
SLIDE 28

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji

◮ Computational cost O(nD + n log n): compute the n distances

x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).

◮ General Metric d f is the same, but j1, . . . , jK are defined as

j1 = arg mini∈{1,...,n} d(x, xi) and jt = arg mini∈{1,...,n}\{j1,...,jt−1} d(x, xi) for t ∈ {2, . . . , K}

MLCC 2017 28

slide-29
SLIDE 29

Parzen Windows

K-NN puts equal weights on the values of the selected points.

MLCC 2017 29

slide-30
SLIDE 30

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it?

MLCC 2017 30

slide-31
SLIDE 31

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value

MLCC 2017 31

slide-32
SLIDE 32

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value PARZEN WINDOWS: ˆ f(x) = n

i=1 yik(x, xi)

n

i=1 k(x, xi)

where k is a similarity function

◮ k(x, x′) ≥ 0 for all x, x′ ∈ RD ◮ k(x, x′) → 1 when x − x′ → 0 ◮ k(x, x′) → 0 when x − x′ → ∞

MLCC 2017 32

slide-33
SLIDE 33

Parzen Windows

Examples of k

◮ k1(x, x′) = sign

  • 1 − x−x′

σ

  • + with a σ > 0

◮ k2(x, x′) =

  • 1 − x−x′

σ

  • + with a σ > 0

◮ k3(x, x′) =

  • 1 − x−x′2

σ2

  • + with a σ > 0

◮ k4(x, x′) = e− x−x′2

2σ2

with a σ > 0

◮ k5(x, x′) = e− x−x′

σ

with a σ > 0

MLCC 2017 33

slide-34
SLIDE 34

K-NN example

K-Nearest neighbor depends on K. When K = 1

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 34

slide-35
SLIDE 35

K-NN example

K-Nearest neighbor depends on K. When K = 2

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 35

slide-36
SLIDE 36

K-NN example

K-Nearest neighbor depends on K. When K = 3

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 36

slide-37
SLIDE 37

K-NN example

K-Nearest neighbor depends on K. When K = 4

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 37

slide-38
SLIDE 38

K-NN example

K-Nearest neighbor depends on K. When K = 5

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 38

slide-39
SLIDE 39

K-NN example

K-Nearest neighbor depends on K. When K = 9

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 39

slide-40
SLIDE 40

K-NN example

K-Nearest neighbor depends on K. When K = 15

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 40

slide-41
SLIDE 41

K-NN example

K-Nearest neighbor depends on K.

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5

Changing K the result changes a lot! How to select K?

MLCC 2017 41

slide-42
SLIDE 42

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2017 42

slide-43
SLIDE 43

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K)

MLCC 2017 43

slide-44
SLIDE 44

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2

MLCC 2017 44

slide-45
SLIDE 45

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK

MLCC 2017 45

slide-46
SLIDE 46

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ

fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2017 46

slide-47
SLIDE 47

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK

MLCC 2017 47

slide-48
SLIDE 48

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2017 48

slide-49
SLIDE 49

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

Ideally! (In practice we don’t have access to the distribution)

◮ We can still try to understand the above minimization problem: does

a solution exists? What does it depend on?

◮ Yet, ultimately, we need something we can compute!

MLCC 2017 49

slide-50
SLIDE 50

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2017 50

slide-51
SLIDE 51

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x).

MLCC 2017 51

slide-52
SLIDE 52

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ

MLCC 2017 52

slide-53
SLIDE 53

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

MLCC 2017 53

slide-54
SLIDE 54

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2017 54

slide-55
SLIDE 55

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2

MLCC 2017 55

slide-56
SLIDE 56

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:

◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2 that is EK(x) = ES(f∗(x) − ˆ fS,K(x))2 + σ2 . . .

MLCC 2017 56

slide-57
SLIDE 57

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl)

MLCC 2017 57

slide-58
SLIDE 58

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x).

MLCC 2017 58

slide-59
SLIDE 59

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ ES( ˜ fS,K(x) − ˆ fS,K(x))2 + σ2

  • variance

. . .

MLCC 2017 59

slide-60
SLIDE 60

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ 1 K2 EX

  • l∈Kx

Eyl|xl(yl − f∗(xl))2 + σ2

  • variance

. . .

MLCC 2017 60

slide-61
SLIDE 61

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ σ2 K + σ2

  • variance

. . .

MLCC 2017 61

slide-62
SLIDE 62

Bias Variance trade-off K Errors Variance Bias

MLCC 2017 62

slide-63
SLIDE 63

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

MLCC 2017 63

slide-64
SLIDE 64

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and

MLCC 2017 64

slide-65
SLIDE 65

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

MLCC 2017 65

slide-66
SLIDE 66

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

How to choose K in practice?

MLCC 2017 66

slide-67
SLIDE 67

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

How to choose K in practice?

◮ Idea: train on some data and validate the parameter on new unseen

data as a proxy for the ideal case.

MLCC 2017 67

slide-68
SLIDE 68

Hold-out Cross-validation

For each K

MLCC 2017 68

slide-69
SLIDE 69

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)

MLCC 2017 69

slide-70
SLIDE 70

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

MLCC 2017 70

slide-71
SLIDE 71

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

MLCC 2017 71

slide-72
SLIDE 72

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK.

MLCC 2017 72

slide-73
SLIDE 73

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials.

MLCC 2017 73

slide-74
SLIDE 74

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...).

MLCC 2017 74

slide-75
SLIDE 75

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error MLCC 2017 75

slide-76
SLIDE 76

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error

ˆ K = 8.

MLCC 2017 76

slide-77
SLIDE 77

Wrapping up

In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization.

MLCC 2017 77

slide-78
SLIDE 78

Next Class

High Dimensions: Beyond local methods!

MLCC 2017 78