Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks
Sasha Rakhlin
Nov 21, 2019
1 / 32
Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and - - PowerPoint PPT Presentation
Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks Sasha Rakhlin Nov 21, 2019 1 / 32 In the previous lecture, we derived upper bounds on Rademacher averages of a set of neural networks in terms of norms of weight
Nov 21, 2019
1 / 32
2 / 32
2 = E ∥̂
2 + E ∥EY1∶n[̂
2 .
2 = E(̂
f
3 / 32
4 / 32
n
i=1
i=1 Kh(x − Xi)
5 / 32
(figure from Gy¨
6 / 32
n
i=1
n
i=1
n
i=1
7 / 32
n
i=1
n
i=1
i=1 Kh(x − Xi))2 ]
8 / 32
1 2+d yield
2 ≲ h2 +
2 2+d 9 / 32
10 / 32
11 / 32
“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman 12 / 32
13 / 32
14 / 32
▸ Fitting data too well? ▸ Bias too low, variance too high?
15 / 32
16 / 32
17 / 32
w∈Rd n
i=1
T (XX T + λI)−1Y
c
Tc =
n
i=1
n
i=1
18 / 32
f∈F n
i=1
K
n
i=1
T(K + λI)−1Y
T. 19 / 32
T(XX T)−1Y
TK−1Y
f∈F
20 / 32
TK−1(Y − f∗(X))
2 ≤ σ2 Y ⋅ E ∥K(x, X)
TK−1∥
2
Y is a uniform upper bound on variance of the noise.
21 / 32
TK−1∥
2 ≲ min i∈N
22 / 32
2 ≥ Ωd(1).
23 / 32
24 / 32
25 / 32
m
i=1
i x),
n
j=1
n
j=1
i xj)
n
j=1
i xj)
26 / 32
m
i=1
T dwi
m
i=1
T dai
m
i=1
T ∂L
m
i=1
T ⎛
n
j=1
n
j=1
m
i=1
T
Km(x,xj)
m
i=1
i σ′(w
T
i x)σ′(w
T
i x′)(x
Tx′) 27 / 32
Tx)σ′(w Tx′)(x Tx′) = π − ∠(x, x′)
Tx′)
n
j=1
nK∞(xi, xj). Solution is
28 / 32
n
j=1
t 0 K∞(x, X)e−s⋅K(V0 − Y)ds
t 0 e−sKds] (V0 − Y)
min norm interpolant
29 / 32
30 / 32
31 / 32
▸ First, zero training error does not contradict consistency (that is,
▸ Since Bayes error is a constant but training error is zero, we should
▸ Bias-variance decomposition can circumvent the uniform deviations
▸ In RKHS and NN examples, there are infinitely many ERMs and some
▸ In experiments, increasing noise level (equivalently, Bayes error) leads
2 is small since this can be written as expected error loss
32 / 32