MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation
MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco - - PowerPoint PPT Presentation
MLCC 2017 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2017 About this class 1. Introduce a basic class of learning methods, namely local methods . 2. Discuss the fundamental concept of bias-variance
About this class
- 1. Introduce a basic class of learning methods, namely local methods.
- 2. Discuss the fundamental concept of bias-variance trade-off to
understand parameter tuning (a.k.a. model selection)
MLCC 2017 2
Outline
Learning with Local Methods From Bias-Variance to Cross-Validation
MLCC 2017 3
The problem
What is the price of one house given its area?
MLCC 2017 4
The problem
What is the price of one house given its area? Start from data...
MLCC 2017 5
The problem
What is the price of one house given its area? Start from data...
Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)}
MLCC 2017 6
The problem
What is the price of one house given its area? Start from data...
Area (m2) Price (A C) x1 = 62 y1 = 99, 200 x2 = 64 y2 = 135, 700 x3 = 65 y3 = 93, 300 x4 = 66 y4 = 114, 000 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
Let S the houses example dataset (n = 100) S = {(x1, y1), . . . , (xn, yn)} Given a new point x∗ we want to predict y∗ by means of S.
MLCC 2017 7
Example
Let x∗ a 300m2 house.
MLCC 2017 8
Example
Let x∗ a 300m2 house.
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 9
Example
Let x∗ a 300m2 house.
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
What is its price?
MLCC 2017 10
Nearest Neighbors
Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 11
Nearest Neighbors
Nearest Neighbor: y∗ is the same of the closest point to x∗ in S. y∗ = 311, 200
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 12
Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R
MLCC 2017 13
Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,
MLCC 2017 14
Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ
f(x∗) where y∗ = yj j = arg min
i=1,...,n x − xi
MLCC 2017 15
Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ
f(x∗) where y∗ = yj j = arg min
i=1,...,n x − xi
Computational cost O(nD): we compute n times the distance x − xi that costs O(D)
MLCC 2017 16
Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ y∗ the predicted output y∗ = ˆ
f(x∗) where y∗ = yj j = arg min
i=1,...,n x − xi
Computational cost O(nD): we compute n times the distance x − xi that costs O(D) In general let d : RD × RD a distance on the input space, then f(x) = yj j = arg min
i=1,...,n d(x, xi)
MLCC 2017 17
Extensions
Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 18
Extensions
Nearest Neighbor takes y∗ is the same of the closest point to x∗ in S.
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
Can we do better? (for example using more points)
MLCC 2017 19
K-Nearest Neighbors
K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 20
K-Nearest Neighbors
K-Nearest Neighbor: y∗ is the mean of the values of the K closest point to x∗ in S. If K = 3 we have y∗ = 274, 600 + 324, 900 + 311, 200 3 = 303, 600
Area (m2) Price (A C) . . . . . . x93 = 255 y93 = 274, 600 x94 = 264 y94 = 324, 900 x95 = 310 y95 = 311, 200 x96 = 480 y96 = 515, 400 . . . . . .
50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 x 10
5
MLCC 2017 21
K-Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD,
MLCC 2017 22
K-Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n,
MLCC 2017 23
K-Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and
jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},
MLCC 2017 24
K-Nearest Neighbors
◮ S = {(xi, yi)}n i=1 with xi ∈ RD, yi ∈ R ◮ x∗ the new point x∗ ∈ RD, ◮ Let K be an integer K << n, ◮ j1, . . . , jK defined as j1 = arg mini∈{1,...,n} x∗ − xi and
jt = arg mini∈{1,...,n}\{j1,...,jt−1} x∗ − xi for t ∈ {2, . . . , K},
◮ predicted output
y∗ = 1 K
- i∈{j1,...,jK}
yi
MLCC 2017 25
K-Nearest Neighbors (cont.)
f(x) = 1 K
K
- i=1
yji
MLCC 2017 26
K-Nearest Neighbors (cont.)
f(x) = 1 K
K
- i=1
yji
◮ Computational cost O(nD + n log n): compute the n distances
x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).
MLCC 2017 27
K-Nearest Neighbors (cont.)
f(x) = 1 K
K
- i=1
yji
◮ Computational cost O(nD + n log n): compute the n distances
x − xi for i = {1, . . . , n} (each costs O(D)). Order them O(n log n).
◮ General Metric d f is the same, but j1, . . . , jK are defined as
j1 = arg mini∈{1,...,n} d(x, xi) and jt = arg mini∈{1,...,n}\{j1,...,jt−1} d(x, xi) for t ∈ {2, . . . , K}
MLCC 2017 28
Parzen Windows
K-NN puts equal weights on the values of the selected points.
MLCC 2017 29
Parzen Windows
K-NN puts equal weights on the values of the selected points. Can we generalize it?
MLCC 2017 30
Parzen Windows
K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value
MLCC 2017 31
Parzen Windows
K-NN puts equal weights on the values of the selected points. Can we generalize it? Closer points to x∗ should influence more its value PARZEN WINDOWS: ˆ f(x) = n
i=1 yik(x, xi)
n
i=1 k(x, xi)
where k is a similarity function
◮ k(x, x′) ≥ 0 for all x, x′ ∈ RD ◮ k(x, x′) → 1 when x − x′ → 0 ◮ k(x, x′) → 0 when x − x′ → ∞
MLCC 2017 32
Parzen Windows
Examples of k
◮ k1(x, x′) = sign
- 1 − x−x′
σ
- + with a σ > 0
◮ k2(x, x′) =
- 1 − x−x′
σ
- + with a σ > 0
◮ k3(x, x′) =
- 1 − x−x′2
σ2
- + with a σ > 0
◮ k4(x, x′) = e− x−x′2
2σ2
with a σ > 0
◮ k5(x, x′) = e− x−x′
σ
with a σ > 0
MLCC 2017 33
K-NN example
K-Nearest neighbor depends on K. When K = 1
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 34
K-NN example
K-Nearest neighbor depends on K. When K = 2
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 35
K-NN example
K-Nearest neighbor depends on K. When K = 3
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 36
K-NN example
K-Nearest neighbor depends on K. When K = 4
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 37
K-NN example
K-Nearest neighbor depends on K. When K = 5
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 38
K-NN example
K-Nearest neighbor depends on K. When K = 9
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 39
K-NN example
K-Nearest neighbor depends on K. When K = 15
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5 MLCC 2017 40
K-NN example
K-Nearest neighbor depends on K.
−0.5 0.5 1 1.5 −1.5 −1 −0.5 0.5 1 1.5
Changing K the result changes a lot! How to select K?
MLCC 2017 41
Outline
Learning with Local Methods From Bias-Variance to Cross-Validation
MLCC 2017 42
Optimal choice for the Hyper-parameters
◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and
X = (x⊤
1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ
fS,K learned function (depends on S and K)
MLCC 2017 43
Optimal choice for the Hyper-parameters
◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and
X = (x⊤
1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ
fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2
MLCC 2017 44
Optimal choice for the Hyper-parameters
◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and
X = (x⊤
1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ
fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK
MLCC 2017 45
Optimal choice for the Hyper-parameters
◮ S = (xi, yi)n i=1 training set. Name Y = (y1, . . . , yn) and
X = (x⊤
1 , . . . , x⊤ n ). ◮ K ∈ N hyperparameter of the learning algorithm ◮ ˆ
fS,K learned function (depends on S and K) The expected loss EK is EK = ESEx,y(y − ˆ fS,K(x))2 Optimal hyperparameter K∗ should minimize EK K∗ = arg min
K∈K EK
MLCC 2017 46
Optimal choice for the Hyper-parameters (cont.)
Optimal hyperparameter K∗ should minimize EK
MLCC 2017 47
Optimal choice for the Hyper-parameters (cont.)
Optimal hyperparameter K∗ should minimize EK K∗ = arg min
K∈K EK
MLCC 2017 48
Optimal choice for the Hyper-parameters (cont.)
Optimal hyperparameter K∗ should minimize EK K∗ = arg min
K∈K EK
Ideally! (In practice we don’t have access to the distribution)
◮ We can still try to understand the above minimization problem: does
a solution exists? What does it depend on?
◮ Yet, ultimately, we need something we can compute!
MLCC 2017 49
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2
MLCC 2017 50
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x).
MLCC 2017 51
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:
◮ Regression model y = f∗(x) + δ
MLCC 2017 52
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:
◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2
MLCC 2017 53
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:
◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2
Now EK(x) = ESEy|x(y − ˆ fS,K(x))2
MLCC 2017 54
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:
◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2
Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2
MLCC 2017 55
Example: regression problem
Define the pointwise expected loss EK(x) = ESEy|x(y − ˆ fS,K(x))2 By definition EK = ExEK(x). Regression setting:
◮ Regression model y = f∗(x) + δ ◮ Eδ = 0, Eδ2 = σ2
Now EK(x) = ESEy|x(y − ˆ fS,K(x))2 = ESEy|x(f∗(x) + δ − ˆ fS,K(x))2 that is EK(x) = ES(f∗(x) − ˆ fS,K(x))2 + σ2 . . .
MLCC 2017 56
Bias Variance trade-off for K-NN
Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K
- l∈Kx
f∗(xl)
MLCC 2017 57
Bias Variance trade-off for K-NN
Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K
- l∈Kx
f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x).
MLCC 2017 58
Bias Variance trade-off for K-NN
Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K
- l∈Kx
f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2
- bias
+ ES( ˜ fS,K(x) − ˆ fS,K(x))2 + σ2
- variance
. . .
MLCC 2017 59
Bias Variance trade-off for K-NN
Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K
- l∈Kx
f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2
- bias
+ 1 K2 EX
- l∈Kx
Eyl|xl(yl − f∗(xl))2 + σ2
- variance
. . .
MLCC 2017 60
Bias Variance trade-off for K-NN
Define the noisyless K-NN (it is ideal!) ˜ fS,K(x) = 1 K
- l∈Kx
f∗(xl) Note that ˜ fS,K(x) = Ey|x ˆ fS,K(x). Consider EK(x) = (f∗(x) − EX ˜ fS,K(x))2
- bias
+ σ2 K + σ2
- variance
. . .
MLCC 2017 61
Bias Variance trade-off K Errors Variance Bias
MLCC 2017 62
How to choose the hyper-parameters
Bias-Variance trade-off is theoretical, but shows that:
MLCC 2017 63
How to choose the hyper-parameters
Bias-Variance trade-off is theoretical, but shows that:
◮ an optimal parameter exists and
MLCC 2017 64
How to choose the hyper-parameters
Bias-Variance trade-off is theoretical, but shows that:
◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.
MLCC 2017 65
How to choose the hyper-parameters
Bias-Variance trade-off is theoretical, but shows that:
◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.
How to choose K in practice?
MLCC 2017 66
How to choose the hyper-parameters
Bias-Variance trade-off is theoretical, but shows that:
◮ an optimal parameter exists and ◮ it depends on the noise and the unknown target function.
How to choose K in practice?
◮ Idea: train on some data and validate the parameter on new unseen
data as a proxy for the ideal case.
MLCC 2017 67
Hold-out Cross-validation
For each K
MLCC 2017 68
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
MLCC 2017 69
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
- 2. train the algorithm on T and compute the empirical loss on V
ˆ EK =
1 |V |
- x,y∈V (y − ˆ
fT,K(x))2
MLCC 2017 70
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
- 2. train the algorithm on T and compute the empirical loss on V
ˆ EK =
1 |V |
- x,y∈V (y − ˆ
fT,K(x))2
MLCC 2017 71
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
- 2. train the algorithm on T and compute the empirical loss on V
ˆ EK =
1 |V |
- x,y∈V (y − ˆ
fT,K(x))2
- 3. Select ˆ
K that minimize ˆ EK.
MLCC 2017 72
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
- 2. train the algorithm on T and compute the empirical loss on V
ˆ EK =
1 |V |
- x,y∈V (y − ˆ
fT,K(x))2
- 3. Select ˆ
K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials.
MLCC 2017 73
Hold-out Cross-validation
For each K
- 1. shuffle and split S in T (training) and V (validation)
- 2. train the algorithm on T and compute the empirical loss on V
ˆ EK =
1 |V |
- x,y∈V (y − ˆ
fT,K(x))2
- 3. Select ˆ
K that minimize ˆ EK. The above procedure can be repeated to augment stability and K selected to minimize error over trials. There are other related parameter selection methods (k-fold cross validation, leave-one out...).
MLCC 2017 74
Training and Validation Error behavior
2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error MLCC 2017 75
Training and Validation Error behavior
2 4 6 8 10 12 14 16 18 20 0.05 0.1 0.15 0.2 0.25 Training vs. validation error T = 50% S; n = 200 K Error Validation error Training error
ˆ K = 8.
MLCC 2017 76
Wrapping up
In this class we made our first encounter with learning algorithms (local methods) and the problem of tuning their parameters (via bias-variance trade-off and cross-validation) to avoid overfitting and achieve generalization.
MLCC 2017 77
Next Class
High Dimensions: Beyond local methods!
MLCC 2017 78