Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and - - PowerPoint PPT Presentation

lecture 22 interpolation overfitting ridgeless regression
SMART_READER_LITE
LIVE PREVIEW

Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and - - PowerPoint PPT Presentation

Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks Sasha Rakhlin Nov 21, 2019 1 / 32 In the previous lecture, we derived upper bounds on Rademacher averages of a set of neural networks in terms of norms of weight


slide-1
SLIDE 1

Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks

Sasha Rakhlin

Nov 21, 2019

1 / 32

slide-2
SLIDE 2

In the previous lecture, we derived upper bounds on Rademacher averages

  • f a set of neural networks in terms of norms of weight matrices, without

explicit dependence on the number of neurons. Such a result is useful to control uniform deviations between empirical and expected errors, or for margin-based bounds. As we discussed earlier, analyses that employ uniform deviations are not the only path to understanding the out-of-sample performance. Today we will discuss methods for which empirical fit can be zero while out-of-sample is far from zero. In such situations, the bias-variance decomposition (rather than the estimation-approximation decomposition) might be more useful.

2 / 32

slide-3
SLIDE 3

Bias-Variance

Bias-Variance decomposition: E ∥̂ fn − f∗∥

2 = E ∥̂

fn − EY1∶n[̂ fn]∥

2 + E ∥EY1∶n[̂

fn] − f∗∥

2 .

Recall that the above estimation error can be written as prediction error: E ∥̂ fn − f∗∥

2 = E(̂

fn(X) − Y)2 − min

f

E(f(X) − Y)2

3 / 32

slide-4
SLIDE 4

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

4 / 32

slide-5
SLIDE 5

Nadaraya-Watson estimator: ̂ fn(x) =

n

i=1

YiWi(x) with Wi(x) = Kh(x − Xi) ∑n

i=1 Kh(x − Xi)

Here Kh(x − Xi) is a notion of “distance” between x and Xi.

5 / 32

slide-6
SLIDE 6

Fix a kernel K ∶ Rd → R≥0. Assume K is zero outside unit Euclidean ball at

  • rigin (not true for e−x2, but close enough).

(figure from Gy¨

  • rfi et al)

Let Kh(x) = K(x/h), and so Kh(x − x′) is zero if ∥x − x′∥ ≥ h. h is “bandwidth” – tunable parameter. Assume K(x) > cI{∥x∥ ≤ 1} for some c > 0. This is important for the “averaging effect” to kick in.

6 / 32

slide-7
SLIDE 7

Unlike the k-NN example, bias is easier to estimate. Bias: for a given x, EY1∶n[̂ fn(x)] = EY1∶n [

n

i=1

YiWi(x)] =

n

i=1

f∗(Xi)Wi(x) and so EY1∶n[̂ fn(x)] − f∗(x) =

n

i=1

(f∗(Xi) − f∗(x))Wi(x) Suppose f∗ is 1-Lipschitz. Since Kh is zero outside the h-radius ball, ∣EY1∶n[̂ fn(x)] − f∗(x)∣2 ≤ h2.

7 / 32

slide-8
SLIDE 8

Variance: we have ̂ fn(x) − EY1∶n[̂ fn(x)] =

n

i=1

(Yi − f∗(Xi))Wi(x) Expectation of square of this difference is at most E [

n

i=1

(Yi − f∗(Xi))2Wi(x)2] since cross terms are zero (fix X’s, take expectation with respect to the Y’s). We are left analyzing nE [ Kh(x − X1)2 (∑n

i=1 Kh(x − Xi))2 ]

Under some assumptions on density of X, the denominator is at least (nhd)2 with high prob, whereas EKh(x − X1)2 = O(hd) assuming ∫ K2 < ∞. This gives an overall variance of O(1/(nhd)). Many details skipped here (e.g. problems at the boundary, assumptions, etc)

8 / 32

slide-9
SLIDE 9

Overall, bias and variance with h ∼ n−

1 2+d yield

E ∥̂ fn − f∗∥

2 ≲ h2 +

1 nhd = n−

2 2+d 9 / 32

slide-10
SLIDE 10

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

10 / 32

slide-11
SLIDE 11

Can a learning method be successful if it interpolates the training data?

11 / 32

slide-12
SLIDE 12

Bias-Variance and Overfitting

“Elements of Statistical Learning,” Hastie, Tibshirani, Friedman 12 / 32

slide-13
SLIDE 13

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

13 / 32

slide-14
SLIDE 14

Consider the Nadaraya-Watson estimator. Take a kernel that approaches a large value τ at 0, e.g. K(x) = min{1/ ∥x∥α , τ} Large τ means ̂ fn(Xi) ≈ Yi since the weight Wi(Xi) is dominating. If τ = ∞, we get interpolation ̂ fn(Xi) = Yi of all training data. Yet, the sketched proof still goes through. Hence, “memorizing the data” (governed by parameter τ) is completely decoupled from the bias-variance trade-off (as given by parameter h). Contrast with conventional wisdom: fitting data too well means overfitting. NB: Of course, we could always redefine any ̂ fn to be equal to Yi on Xi, but

  • ur example shows more explicitly how memorization is governed by a

parameter that is independent of bias-variance.

14 / 32

slide-15
SLIDE 15

What is overfitting?

▸ Fitting data too well? ▸ Bias too low, variance too high?

Key takeaway: we should not conflate these two.

15 / 32

slide-16
SLIDE 16

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

16 / 32

slide-17
SLIDE 17

We saw that local methods such as Nadaraya-Watson can interpolate the data yet generalize. How about global methods such as (regularized) least squares? Below, we will show that minimum-norm interpolants of the data (which can be seen as limiting solutions when we turn off regularization) can indeed generalize.

17 / 32

slide-18
SLIDE 18

First, recall Ridge Regression ̂ wλ = argmin

w∈Rd n

i=1

(⟨w, xi⟩ − yi)2 + λ ∥w∥2 has closed-form solution ̂ wλ = X

T (XX T + λI)−1Y

ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

c

= X

Tc =

n

i=1

cixi implying functional form ˆ fλ(x) = ⟨̂ wλ, x⟩ =

n

i=1

ci ⟨x, xi⟩

18 / 32

slide-19
SLIDE 19

Kernel Ridge Regression: ˆ fλ = argmin

f∈F n

i=1

(f(xi) − yi)2 + λ ∥f∥2

K

Representer Theorem: ˆ fλ(x) =

n

i=1

ciK(x, xi) (1) Solution to Kernel Ridge Regression is given by c = (K + λI)−1Y, [K]i,j = K(xi, xj). and functional form (1) can be written succinctly as ˆ fλ(x) = K(x, X)

T(K + λI)−1Y

where K(x, X) = [K(x, x1), K(x, x2), . . . , K(x, xn)]

T. 19 / 32

slide-20
SLIDE 20

Min-Norm Interpolation (Ridgeless Regression)

Linear case with n < d: the limiting solution λ → 0 is the minimum norm solution that interpolates the data. Indeed, ̂ w0 = X

T(XX T)−1Y

is the unique solution in the span of the data. Solutions outside the span have larger norms. Kernel case: λ → 0 solution (as a function) is ˆ f0(x) = K(x, X)

TK−1Y

which we can write as solution to argmin

f∈F

∥f∥K s.t. f(xi) = yi

20 / 32

slide-21
SLIDE 21

Bias-Variance Analysis of Kernel Ridgeless Regression

Variance: we have ̂ fn(x) − EY1∶n[̂ fn(x)] = K(x, X)

TK−1(Y − f∗(X))

where f∗(X) = [f∗(x1), . . . , f∗(xn)] T. ES ∥̂ fn(x) − EY1∶n[̂ fn(x)]∥

2 ≤ σ2 Y ⋅ E ∥K(x, X)

TK−1∥

2

where σ2

Y is a uniform upper bound on variance of the noise.

21 / 32

slide-22
SLIDE 22

Bias-Variance Analysis of Kernel Ridgeless Regression

(Liang, R., Zhai 19): under appropriate assumptions, for kernels of the form k(x, x′) = g(⟨x, x′⟩ /d), E ∥K(x, X)

TK−1∥

2 ≲ min i∈N

{di n + n di+1 }. and bias is dominated by variance. Conclusion: out-of-sample error of minimum-norm interpolation can be small if d ≍ nα, α ∈ (0, 1) and not inverse of integer.

22 / 32

slide-23
SLIDE 23

High dimensionality required

Interpolation is not always a good idea! Take Laplace kernel Kσ(x, x′) = σ−d exp{−∥x − x′∥ /σ} and ̂ fn is minimum norm interpolation, as before. (R. and Zhai ’18): with probability 1 − O(n−1/2), for any choice of σ, E ∥̂ fn − f∗∥

2 ≥ Ωd(1).

Hence, interpolation with Laplace kernel does not work in small d. High dimensionality can help!

23 / 32

slide-24
SLIDE 24

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

24 / 32

slide-25
SLIDE 25

We now turn to a particular setting of wide randomly initialized neural networks and sketch an argument that backprop on such networks leads to an approximate minimum-norm interpolant with respect to a certain kernel. Hence, the analysis of the previous part applies to these neural networks. Unlike the a-posteriori margin-based bounds for NN, the analysis we present is somewhat more satisfying since it includes the Bayes error term (see discussion at the end) and elucidates the implicit regularization of gradient descent on wide neural nets.

25 / 32

slide-26
SLIDE 26

One-hidden-layer NN f(x) = f(x; W, a) = 1 √m

m

i=1

aiσ(w⊺

i x),

(2) where W = (w1, ⋯, wm) ∈ Rd×m matrix and a ∈ Rm. Square loss: L = 1 2n

n

j=1

(f(xj; W, a) − yj)2 (3) Gradient: ∂L ∂ai = 1 n

n

j=1

σ(w⊺

i xj)

√m (f(xj; W, a) − yj), (4) and ∂L ∂wi = 1 n

n

j=1

aixjσ′(w⊺

i xj)

√m (f(xj; W, a) − yj). (5) Gradient flow (continuous version of backprop): dwi(t) dt = − ∂L ∂wi , dai(t) dt = − ∂L ∂ai ,

26 / 32

slide-27
SLIDE 27

For brevity, write ft(x) = f(x; W(t), a(t)) and let’s actually not update a(t) for simplicity of presentation. How does ft evolve under gradient flow? dft(x) dt =

m

i=1

(∂ft(x) ∂wi )

T dwi

dt +

m

i=1

(∂ft(x) ∂ai )

T dai

dt = −

m

i=1

(∂ft(x) ∂wi )

T ∂L

∂wi = −

m

i=1

(∂ft(x) ∂wi )

T ⎛

⎝ 1 n

n

j=1

(ft(xj) − yj)∂ft(xj) ∂wi ⎞ ⎠ = − 1 n

n

j=1

(ft(xj) − yj)

m

i=1

(∂ft(x) ∂wi )

T

(∂ft(xj) ∂wi ) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

Km(x,xj)

where Km(x, x′) = 1 m

m

i=1

a2

i σ′(w

T

i x)σ′(w

T

i x′)(x

Tx′) 27 / 32

slide-28
SLIDE 28

As m → ∞, Km converges to K∞(x, x′) = Ea,wa2σ′(w

Tx)σ′(w Tx′)(x Tx′) = π − ∠(x, x′)

2π (x

Tx′)

First, consider the dynamics dft(x) dt = − 1 n

n

j=1

(ft(xj) − yj)K∞(x, xj) and let’s specialize to the dynamics of Vt = (ft(x1), . . . , ft(xn)), the values

  • f the solution on the data:

d dtVt = −K ⋅ (Vt − Y) where Ki,j = 1

nK∞(xi, xj). Solution is

Vt = Y + e−t⋅K(V0 − Y) and hence the function ft quickly converges to the yi values on xi, assuming K ≻ 0.

28 / 32

slide-29
SLIDE 29

So, at least in continuous-time and m → ∞ regime, the solution interpolates the data. (Du et al 2018) performed the finite time analysis. Briefly, there are two main difficulties: the gap between Km and K∞ and the fact that Km = Km(t) also changes with time and can differ from Km(0). What happens to the rest of the function? Let’s go back to dft(x) dt = − 1 n

n

j=1

(ft(xj) − yj)K∞(x, xj) = −K∞(x, X)(Vt − Y) Then ft(x) ≈ f0(x) − ∫

t 0 K∞(x, X)e−s⋅K(V0 − Y)ds

= f0(x) − K∞(x, X) [∫

t 0 e−sKds] (V0 − Y)

= f0(x) − K∞(x, X)K−1(1 − e−tK)(V0 − Y) = K∞(x, X)K−1(1 − e−tK)Y ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

min norm interpolant

+{f0(x) − K∞(x, X)K−1(1 − e−tK)V0}

29 / 32

slide-30
SLIDE 30

Here is another motivation for the result. Expand the neural network as a linear function around random initialization W0: f(x; W) ≈ f(x; W0) + ⟨W − W0, ∇W0f(x; W0)⟩ If the weights W do not move far from initialization, the function is approximately linear in weights. This means that gradient flow quickly converges to zero square loss solution (with min norm). Essentially, the Neural Tangent Kernel (NTK) regime is linearization of the network where x ↦ ∇W0f(x; W0) becomes the feature map. Clearly, in this regime we are losing the “nonlinearity” in the weights that we desired when we left the land of kernel methods. Large width of NN brought us back.

30 / 32

slide-31
SLIDE 31

Outline

Local Kernel Regression: Nadaraya-Watson Interpolation Local Methods Kernel Ridge(less) Regression Wide Neural Networks Summary

31 / 32

slide-32
SLIDE 32

A few remarks.

▸ First, zero training error does not contradict consistency (that is,

expected error close to Bayes error). We already knew this from our discussion of Perceptron, but the phenomenon is a bit more surprising when Bayes error is not close to zero (that is, there is noise).

▸ Since Bayes error is a constant but training error is zero, we should

probably not use uniform deviations as a step in our proofs for the interpolation regime (since empirical cannot be close to expected). This is not a precise statement, but a hunch.

▸ Bias-variance decomposition can circumvent the uniform deviations

analysis; the latter is typically tailored to ERM/MLE.

▸ In RKHS and NN examples, there are infinitely many ERMs and some

  • f them are terrible, so the particular regularity (e.g. min norm) has

to play a key role. Thus, the ERM-style analysis of passing to uniform deviations is loose unless one finds a good modification.

▸ In experiments, increasing noise level (equivalently, Bayes error) leads

to solutions with equally-increased expected loss (45-degree line being ideal). This phenomenon can be captured when we show that E ∥̂ fn − f∗∥

2 is small since this can be written as expected error loss

minus Bayes error. It is harder to justify margin bounds in this regime, since there is no Bayes error term.

32 / 32