MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2019 local methods and bias variance trade off
SMART_READER_LITE
LIVE PREVIEW

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this class 1. Introduce a basic class of learning methods, namely local methods . 2. Discuss the fundamental concept of bias-variance trade-off to


slide-1
SLIDE 1

MLCC 2019 Local Methods and Bias Variance Trade-Off

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

About this class

  • 1. Introduce a basic class of learning methods, namely local methods.
  • 2. Discuss the fundamental concept of bias-variance trade-off to

understand parameter tuning (a.k.a. model selection)

MLCC 2019 2

slide-3
SLIDE 3

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2019 3

slide-4
SLIDE 4

The problem

What is the price of one house given its area?

MLCC 2019 4

slide-5
SLIDE 5

The problem

What is the price of one house given its area? Start from data...

MLCC 2019 5

slide-6
SLIDE 6

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)}

MLCC 2019 6

slide-7
SLIDE 7

The problem

What is the price of one house given its area? Start from data...

Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)} Given a new point x∗ we want to predict y∗ by means of S.

MLCC 2019 7

slide-8
SLIDE 8

Example

Let x∗ a 300m2 house.

MLCC 2019 8

slide-9
SLIDE 9

Example

Let x∗ a 300m2 house.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

What is its price?

MLCC 2019 9

slide-10
SLIDE 10

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2019 10

slide-11
SLIDE 11

Nearest Neighbors

Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2019 11

slide-12
SLIDE 12

Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

MLCC 2019 12

slide-13
SLIDE 13

Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD,

MLCC 2019 13

slide-14
SLIDE 14

Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

MLCC 2019 14

slide-15
SLIDE 15

Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D)

MLCC 2019 15

slide-16
SLIDE 16

Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ f(x∗) where y∗ = yj j = arg min

i=1,...,n x − xi

Computational cost O(nD): we compute n times the distance x − xi that costs O(D) In general let d : RD × RD a distance on the input space, then f(x) = yj j = arg min

i=1,...,n d(x, xi)

MLCC 2019 16

slide-17
SLIDE 17

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2019 17

slide-18
SLIDE 18

Extensions

Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

Can we do better? (for example using more points)

MLCC 2019 18

slide-19
SLIDE 19

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2019 19

slide-20
SLIDE 20

K-Nearest Neighbors

K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600

Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .

50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10

5

MLCC 2019 20

slide-21
SLIDE 21

K-Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD,

MLCC 2019 21

slide-22
SLIDE 22

K-Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n,

MLCC 2019 22

slide-23
SLIDE 23

K-Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},

MLCC 2019 23

slide-24
SLIDE 24

K-Nearest Neighbors

◮ S = {(xi, yi)}n

i=1 with xi ∈ RD, yi ∈ R

◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K}, ◮ predicted output y∗ = 1 K

  • i∈{j1,...,jK}

yi

MLCC 2019 24

slide-25
SLIDE 25

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji

MLCC 2019 25

slide-26
SLIDE 26

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji ◮ Computational cost O(nD + n log n): compute the n distances x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).

MLCC 2019 26

slide-27
SLIDE 27

K-Nearest Neighbors (cont.)

f(x) = 1 K

K

  • i=1

yji ◮ Computational cost O(nD + n log n): compute the n distances x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n). ◮ General Metric d f is the same, but j1, . . . , jK are defined as j1 = arg mini∈{1,...,n} d(x, xi) and jt = arg mini∈{1,...,n}\{j1,...,jt−1} d(x, xi) for t ∈ {2, . . . , K}

MLCC 2019 27

slide-28
SLIDE 28

Parzen Windows

K-NN puts equal weights on the values of the selected points.

MLCC 2019 28

slide-29
SLIDE 29

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it?

MLCC 2019 29

slide-30
SLIDE 30

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value

MLCC 2019 30

slide-31
SLIDE 31

Parzen Windows

K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value PARZEN WINDOWS: ˆ f(x) = n

i=1 yik(x, xi)

n

i=1 k(x, xi)

where k is a similarity function ◮ k(x, x′) ≥ 0 for all x, x′ ∈ RD ◮ k(x, x′) → 1 when x − x′ → 0 ◮ k(x, x′) → 0 when x − x′ → ∞

MLCC 2019 31

slide-32
SLIDE 32

Parzen Windows

Examples of k ◮ k1(x, x′) = sign

  • 1 − x−x′

σ

  • + with a σ > 0

◮ k2(x, x′) =

  • 1 − x−x′

σ

  • + with a σ > 0

◮ k3(x, x′) =

  • 1 − x−x′2

σ2

  • + with a σ > 0

◮ k4(x, x′) = e− x−x′2

2σ2

with a σ > 0 ◮ k5(x, x′) = e− x−x′

σ

with a σ > 0

MLCC 2019 32

slide-33
SLIDE 33

K-NN example

K-Nearest neighbor depends on K. When K = 1

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 33

slide-34
SLIDE 34

K-NN example

K-Nearest neighbor depends on K. When K = 2

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 34

slide-35
SLIDE 35

K-NN example

K-Nearest neighbor depends on K. When K = 3

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 35

slide-36
SLIDE 36

K-NN example

K-Nearest neighbor depends on K. When K = 4

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 36

slide-37
SLIDE 37

K-NN example

K-Nearest neighbor depends on K. When K = 5

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 37

slide-38
SLIDE 38

K-NN example

K-Nearest neighbor depends on K. When K = 9

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 38

slide-39
SLIDE 39

K-NN example

K-Nearest neighbor depends on K. When K = 15

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2019 39

slide-40
SLIDE 40

K-NN example

K-Nearest neighbor depends on K.

−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5

Changing K the result changes a lot! How to select K?

MLCC 2019 40

slide-41
SLIDE 41

Outline

Learning with Local Methods From Bias-Variance to Cross-Validation

MLCC 2019 41

slide-42
SLIDE 42

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n

i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ).

◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ fS,K learned function (depends on S and K)

MLCC 2019 42

slide-43
SLIDE 43

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n

i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ).

◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2

MLCC 2019 43

slide-44
SLIDE 44

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n

i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ).

◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK

MLCC 2019 44

slide-45
SLIDE 45

Optimal choice for the Hyper-parameters

◮ S = (xi, yi)n

i=1 training set. Name Y = (y1, . . . , yn) and

X = (x⊤

1 , . . . , x⊤ n ).

◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2019 45

slide-46
SLIDE 46

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK

MLCC 2019 46

slide-47
SLIDE 47

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

MLCC 2019 47

slide-48
SLIDE 48

Optimal choice for the Hyper-parameters (cont.)

Optimal hyperparameter K∗ should minimize EK K∗ = arg min

K∈K EK

Ideally! (In practice we don’t have access to the distribution) ◮ We can still try to understand the above minimization problem: does a solution exists? What does it depend on? ◮ Yet, ultimately, we need something we can compute!

MLCC 2019 48

slide-49
SLIDE 49

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2019 49

slide-50
SLIDE 50

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x).

MLCC 2019 50

slide-51
SLIDE 51

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting: ◮ Regression model y = f∗(x) + δ

MLCC 2019 51

slide-52
SLIDE 52

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting: ◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2

MLCC 2019 52

slide-53
SLIDE 53

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting: ◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2 Now EK(x) = ESEy|x(y − ˆ fS,K(x))2

MLCC 2019 53

slide-54
SLIDE 54

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting: ◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2 Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2

MLCC 2019 54

slide-55
SLIDE 55

Example: regression problem

Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting: ◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2 Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2 that is EK(x) = ES(f∗(x) − ˆ fS,K(x))2 + σ2 . . .

MLCC 2019 55

slide-56
SLIDE 56

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl)

MLCC 2019 56

slide-57
SLIDE 57

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x).

MLCC 2019 57

slide-58
SLIDE 58

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ ES( ˜ fS,K(x) − ˆ fS,K(x))2 + σ2

  • variance

. . .

MLCC 2019 58

slide-59
SLIDE 59

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ 1 K2 EX

  • l∈Kx

Eyl|xl(yl − f∗(xl))2 + σ2

  • variance

. . .

MLCC 2019 59

slide-60
SLIDE 60

Bias Variance trade-off for K-NN

Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K

  • l∈Kx

f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2

  • bias

+ σ2 K + σ2

  • variance

. . .

MLCC 2019 60

slide-61
SLIDE 61

Bias Variance trade-off K Errors Variance Bias

MLCC 2019 61

slide-62
SLIDE 62

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that:

MLCC 2019 62

slide-63
SLIDE 63

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that: ◮ an optimal parameter exists and

MLCC 2019 63

slide-64
SLIDE 64

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that: ◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.

MLCC 2019 64

slide-65
SLIDE 65

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that: ◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function. How to choose K in practice?

MLCC 2019 65

slide-66
SLIDE 66

How to choose the hyper-parameters

Bias-Variance trade-off is theoretical, but shows that: ◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function. How to choose K in practice? ◮ Idea: train on some data and validate the parameter on new unseen data as a proxy for the ideal case.

MLCC 2019 66

slide-67
SLIDE 67

Hold-out Cross-validation

MLCC 2019 67

slide-68
SLIDE 68

Hold-out Cross-validation

For each K

MLCC 2019 68

slide-69
SLIDE 69

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)

MLCC 2019 69

slide-70
SLIDE 70

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

MLCC 2019 70

slide-71
SLIDE 71

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

MLCC 2019 71

slide-72
SLIDE 72

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK.

MLCC 2019 72

slide-73
SLIDE 73

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials.

MLCC 2019 73

slide-74
SLIDE 74

Hold-out Cross-validation

For each K

  • 1. shuffle and split S in T (training) and V (validation)
  • 2. train the algorithm on T and compute the empirical loss on V

ˆ EK =

1 |V |

  • x,y∈V (y − ˆ

fT,K(x))2

  • 3. Select ˆ

K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...).

MLCC 2019 74

slide-75
SLIDE 75

V-Fold Cross-validation

MLCC 2019 75

slide-76
SLIDE 76

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error MLCC 2019 76

slide-77
SLIDE 77

Training and Validation Error behavior

2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error

ˆ K = 8.

MLCC 2019 77

slide-78
SLIDE 78

Wrapping up

In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization.

MLCC 2019 78

slide-79
SLIDE 79

Next Class

High Dimensions: Beyond local methods!

MLCC 2019 79