Theory for Minimum Norm Interpolation: Regression and Classification - - PowerPoint PPT Presentation

theory for minimum norm interpolation regression and
SMART_READER_LITE
LIVE PREVIEW

Theory for Minimum Norm Interpolation: Regression and Classification - - PowerPoint PPT Presentation

Intro. Min-norm Interpolant Regression Classification Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan Liang Classification: with Pragya Sur (Harvard) Regression: with Sasha Rakhlin (MIT), Xiyu


slide-1
SLIDE 1

Intro. Min-norm Interpolant Regression Classification

Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions

Tengyuan Liang Classification: with Pragya Sur (Harvard) Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT)

1 / 37

slide-2
SLIDE 2

Intro. Min-norm Interpolant Regression Classification

OUTLINE

  • Motivation: min-norm interpolants
  • Regression: multiple descent of risk
  • Classification: boosting on separable data

2 / 37

slide-3
SLIDE 3

Intro. Min-norm Interpolant Regression Classification

OUTLINE

  • Motivation: min-norm interpolants
  • Regression: multiple descent of risk
  • application to wide neural networks
  • restricted lower isometry of kernels
  • small-ball property
  • Classification: boosting on separable data
  • precise high-dim asymptotics
  • convex Gaussian min-max theorem
  • algorithmic implications on boosting

2 / 37

slide-4
SLIDE 4

Intro. Min-norm Interpolant Regression Classification

OVER-PARAMETRIZED REGIME OF STAT/ML

Model class complex enough to interpolate the training data.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error)

Kernel Regression on MNIST

digits pair [i,j] [2,5] [2,9] [3,6] [3,8] [4,7]

λ = 0: the interpolants on training data.

MNIST data from LeCun et al. (2010)

3 / 37

slide-5
SLIDE 5

Intro. Min-norm Interpolant Regression Classification

OVER-PARAMETRIZED REGIME OF STAT/ML

Model class complex enough to interpolate the training data.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error)

Kernel Regression on MNIST

digits pair [i,j] [2,5] [2,6] [2,7] [2,8] [2,9] [3,5] [3,6] [3,7] [3,8] [3,9] [4,5] [4,6] [4,7] [4,8] [4,9]

λ = 0: the interpolants on training data.

MNIST data from LeCun et al. (2010)

3 / 37

slide-6
SLIDE 6

Intro. Min-norm Interpolant Regression Classification

OVER-PARAMETRIZED REGIME OF STAT/ML

Model class complex enough to interpolate the training data.

Zhang, Bengio, Hardt, Recht, and Vinyals (2016)

In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle: among the models that interpolate, algorithms favor certain form of minimalism.

4 / 37

slide-7
SLIDE 7

Intro. Min-norm Interpolant Regression Classification

OVER-PARAMETRIZED REGIME OF STAT/ML

Principle: among the models that interpolate, algorithms favor certain form of minimalism.

  • over-parametrized linear model and matrix factorization
  • kernel machines
  • support vector machines
  • boosting, AdaBoost
  • two-layer ReLU networks

4 / 37

slide-8
SLIDE 8

Intro. Min-norm Interpolant Regression Classification

OVER-PARAMETRIZED REGIME OF STAT/ML

Principle: among the models that interpolate, algorithms favor certain form of minimalism.

  • over-parametrized linear model and matrix factorization
  • kernel machines
  • support vector machines
  • boosting, AdaBoost
  • two-layer ReLU networks

minimalism typically measured in form of certain norm motivates the study of min-norm interpolants

4 / 37

slide-9
SLIDE 9

Intro. Min-norm Interpolant Regression Classification

MIN-NORM INTERPOLANTS

minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ̂ f = arg min

f

∥f∥norm, s.t. yi = f(xi) ∀i ∈ [n]. Classification ̂ f = arg min

f

∥f∥norm, s.t. yi ⋅ f(xi) ≥ 1 ∀i ∈ [n].

5 / 37

slide-10
SLIDE 10

Intro. Min-norm Interpolant Regression Classification

REGRESSION

Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels with Sasha Rakhlin (MIT), Xiyu Zhai (MIT)

6 / 37

slide-11
SLIDE 11

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

Classic: U-shape curve Recent: double descent curve

Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019)

Question: shape of the risk curve w.r.t. “over-parametrization”?

7 / 37

slide-12
SLIDE 12

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

Classic: U-shape curve Recent: double descent curve

Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019)

Question: shape of the risk curve w.r.t. “over-parametrization”? We model the intrinsic dim. d = nα with α ∈ (0, 1), with feature cov. Σd = Id. We consider the non-linear Kernel Regression model.

7 / 37

slide-13
SLIDE 13

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

We consider the intrinsic dim. d = nα with α ∈ (0, 1). A non-linear Kernel Regression model. DGP.

  • {xi}n

i=1 i.i.d

∼ µ = P⊗d. distribution of each coordinate x ∼ P satisfies weak moment ∀t > 0, P (∣x∣ > t) ≤ C(1 + t)−ν.

  • target f⋆(x) ∶= E[Y∣X = x], with bounded Var[Y∣X = x].

Kernel.

  • h ∈ C∞(R), h(t) = ∑∞

i=0 αiti with αi ≥ 0.

  • inner product kernel k(x, z) = h (⟨x, z⟩/d).

Target Function.

  • Assume f⋆(x) = ∫ k(x, z)ρ⋆(z)µ(dz) with ∥ρ⋆∥µ ≤ C.

8 / 37

slide-14
SLIDE 14

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

We consider the intrinsic dim. d = nα with α ∈ (0, 1). A non-linear Kernel Regression model. Given n i.i.d. data pairs (xi, yi) ∼ PX,Y. Risk curve for minimum RKHS norm ∥ ⋅ ∥H interpolantŝ f ? ̂ f = arg min

f

∥f∥H, s.t. yi = f(xi) ∀i ∈ [n].

8 / 37

slide-15
SLIDE 15

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

For any integer ι ≥ 1, consider d = nα where α ∈ ( 1

ι+1, 1 ι ).

Theorem (L., Rakhlin & Zhai, ’19).

slide-16
SLIDE 16

Intro. Min-norm Interpolant Regression Classification

SHAPE OF RISK CURVE

For any integer ι ≥ 1, consider d = nα where α ∈ ( 1

ι+1, 1 ι ).

With probability at least 1 − δ − e−n/dι on the design X ∈ Rn×d, E [∥̂ f − f∗∥2

µ∣X] ≤ C ⋅ (dι

n + n dι+1 ) ≍ n−β, β ∶= min {(ι + 1)α − 1, 1 − ια} .

Here the constant C(δ, ι, h, P) does not depend on d, n.

Theorem (L., Rakhlin & Zhai, ’19).

9 / 37

slide-17
SLIDE 17

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.

10 / 37

slide-18
SLIDE 18

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N 10 / 37

slide-19
SLIDE 19

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N

  • over-parametrization: towards over-parametrized regime, the good rate at the

bottom of the valley is better

10 / 37

slide-20
SLIDE 20

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N

  • over-parametrization: towards over-parametrized regime, the good rate at the

bottom of the valley is better

  • empirical: preliminary empirical evidence of multiple descent

10 / 37

slide-21
SLIDE 21

Intro. Min-norm Interpolant Regression Classification

EMPIRICAL EVIDENCE

empirical evidence of multiple-descent behavior as the scaling d = nα changes.

11 / 37

slide-22
SLIDE 22

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4 1/2

= = −

  • theory

empirical

12 / 37

slide-23
SLIDE 23

Intro. Min-norm Interpolant Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • α = 1: Liang and Rakhlin (2018)
  • α = 0: Rakhlin and Zhai (2018)
  • α = 1 double descent: Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani

(2019); Bartlett, Long, Lugosi, and Tsigler (2019)

  • general α, stair-case, random fourier feature: Ghorbani, Mei, Misiakiewicz, and Montanari

(2019)

13 / 37

slide-24
SLIDE 24

Intro. Min-norm Interpolant Regression Classification

APPLICATION TO WIDE NEURAL NETWORKS

Neural Tangent Kernel (NTK)

Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)......

kNTK(x, x′) = 1 4π U( ⟨x, x′⟩ ∥x∥∥x′∥) U(t) = 3t(π − arccos(t)) + √ 1 − t2

14 / 37

slide-25
SLIDE 25

Intro. Min-norm Interpolant Regression Classification

APPLICATION TO WIDE NEURAL NETWORKS

Neural Tangent Kernel (NTK)

Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)......

kNTK(x, x′) = 1 4π U( ⟨x, x′⟩ ∥x∥∥x′∥) U(t) = 3t(π − arccos(t)) + √ 1 − t2 Our results can be generalized to the following type of kernels k(x, x′) =

i=0

αi ⋅ ( ⟨x, x′⟩ ∥x∥∥x′∥)i that include NTK. Consider integer ι that satisfies dι log d ≾ n ≾ dι+1/ log d, then Risk ≾ dι n + n log d dι+1 Corollary (L., Rakhlin & Zhai, ’19).

14 / 37

slide-26
SLIDE 26

Intro. Min-norm Interpolant Regression Classification

IDEAS BEHIND THE PROOF Proof Idea: on a filtration of spaces indexed by polynomial basis, establish restricted lower isometry of the empirical kernel. filtrated empirical kernel nK[≤ι]

ij

∶= ∑

r1,⋯,rd≥0 r1+⋯+rd≤ι

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd nK[≤ι] = Φ

  • n×(ι+d

ι )

⋅ Φ⊺

  • (ι+d

ι )×n

filtrated sample covariance operator Θ[≤ι] ∶= 1 n Φ⊺

  • (ι+d

ι )×n

⋅ Φ

  • n×(ι+d

ι ) 15 / 37

slide-27
SLIDE 27

Intro. Min-norm Interpolant Regression Classification

IDEAS BEHIND THE PROOF Proof Idea: on a filtration of spaces indexed by polynomial basis, establish restricted lower isometry of the empirical kernel. filtrated empirical kernel nK[≤ι]

ij

∶= ∑

r1,⋯,rd≥0 r1+⋯+rd≤ι

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd nK[≤ι] = Φ

  • n×(ι+d

ι )

⋅ Φ⊺

  • (ι+d

ι )×n

filtrated sample covariance operator Θ[≤ι] ∶= 1 n Φ⊺

  • (ι+d

ι )×n

⋅ Φ

  • n×(ι+d

ι )

Restricted Lower Isometry of Kernel: all non-zero eigenvalues of K[≤ι] is lower bounded by d−ι λmin (Θ[≤ι]) ≿ d−ι

15 / 37

slide-28
SLIDE 28

Intro. Min-norm Interpolant Regression Classification

IDEAS BEHIND THE PROOF small-ball approach rather than standard concentration lower bound λmin ( 1

nΨ⊺Ψ)

equiv. ∀u, ∥u∥ = 1, lower bound ∥Ψu∥2 utilize non-negativity ∥Ψu∥2 = 1 n

n

i=1

⟨Ψ(xi), u⟩2 ≥ c1E[⟨Ψ(xi), u⟩2] ⋅ 1 n

n

i=1

I⟨Ψ(xi),u⟩2≥c1E[⟨Ψ(X),u⟩2] small-ball property, ∃ constants c1, c2 P (⟨Ψ(xi), u⟩2 ≥ c1E[⟨Ψ(X), u⟩2]) ≥ c2

Koltchinskii and Mendelson (2015); Mendelson (2014)

which will imply w.p. at least 1 − exp(−c ⋅ n) 1 n

n

i=1

I⟨Ψ(xi),u⟩2≥c1E[⟨Ψ(X),u⟩2] ≥ c2/2 Non-trivial: verify small-ball property for polynomials (weakly dependent) via Paley-Zygmund

16 / 37

slide-29
SLIDE 29

Intro. Min-norm Interpolant Regression Classification

CLASSIFICATION

Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers with Pragya Sur (Harvard)

17 / 37

slide-30
SLIDE 30

Intro. Min-norm Interpolant Regression Classification

MIN-L1-NORM INTERPOLATED CLASSIFIER

Regression so far, what about Classification? Given n-i.i.d. data pairs {xi, yi}n

i=1 with yi ∈ {±1} being the labels and xi ∈ Rp

being feature vectors. We consider minimum L1-norm interpolated classifier: ˆ θ = min

θ

∥θ∥1, s.t. yix⊺

i θ ≥ 1.

when data is separable.

18 / 37

slide-31
SLIDE 31

Intro. Min-norm Interpolant Regression Classification

MIN-L1-NORM INTERPOLATED CLASSIFIER

Regression so far, what about Classification? Given n-i.i.d. data pairs {xi, yi}n

i=1 with yi ∈ {±1} being the labels and xi ∈ Rp

being feature vectors. We consider minimum L1-norm interpolated classifier: ˆ θ = min

θ

∥θ∥1, s.t. yix⊺

i θ ≥ 1.

when data is separable. min-L1-norm interpolated classifier agrees with the max-L1-margin direction max

∥θ∥1≤1 min 1≤i≤n yix⊺ i θ =∶ κℓ1(X, y) .

18 / 37

slide-32
SLIDE 32

Intro. Min-norm Interpolant Regression Classification

WHY L1 MARGIN? Algorithmic: on separable data, Boosting algorithm ˆ θt,η

boost with infinitesimal step-

size η agrees with the min-L1-norm direction asymptotically lim

η→0 lim t→∞

ˆ θt,η

boost/∥ˆ

θt,η

boost∥1 = ˆ

θ .

Freund and Schapire (1995); Zhang and Yu (2005)

19 / 37

slide-33
SLIDE 33

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

  • DGP. xi ∼ N (0, Λ) i.i.d. with cov. Λ ∈ Rp×p, and yi are generated with some

f ∶ R → [0, 1], P(yi = +1∣xi) = f(x⊺

i θ⋆) ,

with some θ⋆ ∈ Rp. Consider high-dim asymptotic regime with over-parametrized ratio p/n → ψ ∈ (0, ∞), p, n → ∞.

20 / 37

slide-34
SLIDE 34

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

Statistical.

  • how large is the empirical L1-margin?
  • angle between the ˆ

θ (min-L1-norm interpolated classifier) and the truth θ⋆?

  • generalization properties of Boosting?

Computational.

  • iterations of the Boosting (precisely as a function of over-parametrization

p/n) are required for an ǫ-approx. to the max-L1-margin?

  • proportion of features activated by Boosting (with zero initialization) when

the training error vanishes?

20 / 37

slide-35
SLIDE 35

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

Under mild conditions, for ψ ≥ ψ⋆(0), the following sharp asymptotic charac- terization lim

n,p→∞ p1/2 ⋅ κℓ1(X, y) = κ⋆(ψ, µ) , a.s.

Generalization error lim

n,p→∞ Px,y (y ⋅ x⊺ ˆ

θℓ1 < 0) = Err⋆(ψ, µ) , a.s. Theorem (L. & Sur, ’20).

Thrampoulidis et al. (2014, 2015, 2018); Gordon (1988) Montanari et al. (2019); Deng et al. (2019); Shcherbina and Tirozzi (2003); Gardner (1988)

20 / 37

slide-36
SLIDE 36

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

κ⋆(ψ, µ) enjoys the analytic characterization: [L. & Sur, ’20]

define Fκ ∶ R × R≥0 → R≥0 Fκ(c1, c2) ∶= (E [(κ − c1YZ1 − c2Z2)2])

1 2

where ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ Z2 ⊥ (Y, Z1) Zi ∼ N (0, 1), i = 1, 2 P(Y = +1∣Z1) = 1 − P(Y = −1∣Z1) = f(ρ ⋅ Z1) .

21 / 37

slide-37
SLIDE 37

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

κ⋆(ψ, µ) enjoys the analytic characterization: [L. & Sur, ’20]

Fixed point equations for c1, c2, s ∈ R × R>0 × R>0 given ψ > 0, where the expectation is over (Λ, W, G) ∼ µ ⊗ N (0, 1) =∶ Q c1 = − E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ Λ1/2W ⋅ proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠ c2

1 + c2 2 =

E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠

2

. 1 = E

(Λ,W,G)∼Q

  • proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

  • with proxλ(t) = arg min

s

{λ∣s∣ + 1 2 (s − t)2} = sgn(t) (∣t∣ − λ)+

21 / 37

slide-38
SLIDE 38

Intro. Min-norm Interpolant Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

κ⋆(ψ, µ) enjoys the analytic characterization: [L. & Sur, ’20]

Fixed point equations for c1, c2, s ∈ R × R>0 × R>0 given ψ > 0, where the expectation is over (Λ, W, G) ∼ µ ⊗ N (0, 1) =∶ Q c1 = − E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ Λ1/2W ⋅ proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠ c2

1 + c2 2 =

E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠

2

. 1 = E

(Λ,W,G)∼Q

  • proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

  • with proxλ(t) = arg min

s

{λ∣s∣ + 1 2 (s − t)2} = sgn(t) (∣t∣ − λ)+ T(ψ, κ) ∶= ψ−1/2 [Fκ(c1, c2) − c1∂1Fκ(c1, c2) − c2∂2Fκ(c1, c2)] − s with c1(ψ, κ), c2(ψ, κ), s(ψ, κ).

κ⋆(ψ, µ) ∶= inf{κ ≥ 0 ∶ T(ψ, κ) ≥ 0}

21 / 37

slide-39
SLIDE 39

Intro. Min-norm Interpolant Regression Classification

THEORY VS. EMPIRICAL

1 2 3 4 5 6 1 2 3 4 CGMT LP

Max-L1-Margin.

1 2 3 4 5 6 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 CGMT LP

Generalization Error for Min-L1-Interpolated Classifier.

22 / 37

slide-40
SLIDE 40

Intro. Min-norm Interpolant Regression Classification

TECHNICAL REMARKS

Our results builds upon Convex Gaussian Minimax Theorem Thrampoulidis et al. (2014, 2015,

2018); Gordon (1988) and the work on the L2-margin by Montanari et al. (2019)

L1 case introduce some technical issues to overcome

  • we prove a stronger uniform deviation result that suits the L1 case,

self-normalization property.

  • different fixed point equation systems.
  • (normalized) max L1 margin much larger than max L2 margin.

23 / 37

slide-41
SLIDE 41

Intro. Min-norm Interpolant Regression Classification

ALGORITHMIC: BOOSTING

With proper (non-vanishing) learning rate, the sequence {ˆ θt}∞

t=0 obtained by

the Boosting algorithm satisfy: for any 0 < ǫ < 1, when the number if iterations t ≥ Tǫ(p) with lim

n,p→∞

Tǫ(p) p log2 n = 12ǫ−2 κ2

⋆(ψ, µ) ,

the solution ˆ θt/∥ˆ θt∥1 is an (1 − ǫ)-approximation to the Min-L1-Interpolated Classifier p1/2 ⋅ min

i∈[n]

yix⊺

i ˆ

θt ∥ˆ θt∥1 ∈ [(1 − ǫ) ⋅ κ⋆(ψ, µ), κ⋆(ψ, µ)] . Theorem (L. & Sur, ’20).

24 / 37

slide-42
SLIDE 42

Intro. Min-norm Interpolant Regression Classification

ALGORITHMIC: ACTIVATED FEATURES BY BOOSTING

Let S0(p) be the number of features selected when Boosting (for the first time at t) obtains zero training error with ˆ θ0 = 0 initialization, 1 n

n

i=1

Iyix⊺

i ˆ

θt≤0 = 0

with S0(p) ∶= # {j ∈ [p] ∶ ˆ θt

j ≠ 0} .

We show lim sup

n,p→∞

S0(p) p log2 p ≤ 12 κ2

⋆(ψ, µ) ∧ 1

Theorem (L. & Sur, ’20).

25 / 37

slide-43
SLIDE 43

Intro. Min-norm Interpolant Regression Classification

PROOF SKETCH

Step 1: √p-rescaling of L1 ball

ξ(n,p)

ψ,κ ∶=

min

∥θ∥1≤√p

max

∥λ∥2≤1,λ≥0

1 √p λT(κ1 − (y ⊙ X)θ) It is not hard to see that ξ(n,p)

ψ,κ = 0, if and only if κ ≤ p1/2 ⋅ κℓ1 ({xi, yi}n i=1) ,

ξ(n,p)

ψ,κ > 0, if and only if κ > p1/2 ⋅ κℓ1 ({xi, yi}n i=1) . 26 / 37

slide-44
SLIDE 44

Intro. Min-norm Interpolant Regression Classification

PROOF SKETCH

Step 1: √p-rescaling of L1 ball

ξ(n,p)

ψ,κ ∶=

min

∥θ∥1≤√p

max

∥λ∥2≤1,λ≥0

1 √p λT(κ1 − (y ⊙ X)θ) ξ(n,p)

ψ,κ ∶=

min

∥θ∥1≤√p

max

∥λ∥2≤1,λ≥0

1 √p λT (κ1 − (y ⊙ z)⟨w, Λ1/2θ⟩) − 1 √p λTZΠw⊥ (Λ1/2θ)

Step 2: reduction via Gordon’s comparison (convex Gaussian min-max theorem)

Thrampoulidis et al. (2014, 2015, 2018); Gordon (1988) ˆ ξ(n,p)

ψ,κ

∶= min

∥θ∥1≤√p

max

∥λ∥2≤1,λ≥0

1 √p λT (κ1 − (y ⊙ z)⟨w, Λ1/2θ⟩ − ˜ z∥Πw⊥ (Λ1/2θ)∥2) + 1 √p ∥λ∥2⟨g, Πw⊥ (Λ1/2θ)⟩ = min

∥θ∥1≤√p

⎡ ⎢ ⎢ ⎢ ⎣ ψ−1/2̂ Fκ (⟨w, Λ1/2θ⟩, ∥Πw⊥ (Λ1/2θ)∥2) + 1 √p ⟨Πw⊥ (g), Λ1/2θ⟩ ⎤ ⎥ ⎥ ⎥ ⎦

26 / 37

slide-45
SLIDE 45

Intro. Min-norm Interpolant Regression Classification

TECHNICAL CHALLENGES IN L1 CASE

Step 3: large n, p limit

The empirical problem (finite-dim optimization) ˆ ξ(n,p)

ψ,κ =

min

∥θ∥1≤√p

⎡ ⎢ ⎢ ⎢ ⎣ ψ−1/2̂ Fκ (⟨w, Λ1/2θ⟩, ∥Πw⊥ (Λ1/2θ)∥2) + 1 √p ⟨Πw⊥ (g), Λ1/2θ⟩ ⎤ ⎥ ⎥ ⎥ ⎦ Let’s naively take the limit (infinite-dim optimization) ˜ ξ(∞,∞)

ψ,κ

∶= min

∥h∥L1(Q)≤1 [ψ−1/2Fκ (⟨w, Λ1/2h⟩L2(Q), ∥Πw⊥ (Λ1/2h)∥L2(Q)) + ⟨Πw⊥ (G), Λ1/2h⟩L2(Q)]

One needs to show lim

p→∞,p/n(p)→ψ

ˆ ξ(n,p)

ψ,κ a.s.

= ˜ ξ(∞,∞)

ψ,κ 27 / 37

slide-46
SLIDE 46

Intro. Min-norm Interpolant Regression Classification

TECHNICAL CHALLENGES IN L1 CASE

Step 3: large n, p limit

The empirical problem (finite-dim optimization) ˆ ξ(n,p)

ψ,κ =

min

∥θ∥1≤√p

⎡ ⎢ ⎢ ⎢ ⎣ ψ−1/2̂ Fκ (⟨w, Λ1/2θ⟩, ∥Πw⊥ (Λ1/2θ)∥2) + 1 √p ⟨Πw⊥ (g), Λ1/2θ⟩ ⎤ ⎥ ⎥ ⎥ ⎦ Let’s naively take the limit (infinite-dim optimization) ˜ ξ(∞,∞)

ψ,κ

∶= min

∥h∥L1(Q)≤1 [ψ−1/2Fκ (⟨w, Λ1/2h⟩L2(Q), ∥Πw⊥ (Λ1/2h)∥L2(Q)) + ⟨Πw⊥ (G), Λ1/2h⟩L2(Q)]

One needs to show lim

p→∞,p/n(p)→ψ

ˆ ξ(n,p)

ψ,κ a.s.

= ˜ ξ(∞,∞)

ψ,κ

L1 vs. L2 geometry: for the constraint set ∥θ∥1 ≤ √p, define c1 = ⟨w, Λ1/2θ⟩, c2 = ∥Πw⊥(Λ1/2θ)∥2 c2 could be √p → ∞.

27 / 37

slide-47
SLIDE 47

Intro. Min-norm Interpolant Regression Classification

TECHNICAL CHALLENGES IN L1 CASE

Step 3: large n, p limit

The empirical problem (finite-dim optimization) ˆ ξ(n,p)

ψ,κ =

min

∥θ∥1≤√p

⎡ ⎢ ⎢ ⎢ ⎣ ψ−1/2̂ Fκ (⟨w, Λ1/2θ⟩, ∥Πw⊥ (Λ1/2θ)∥2) + 1 √p ⟨Πw⊥ (g), Λ1/2θ⟩ ⎤ ⎥ ⎥ ⎥ ⎦ Let’s naively take the limit (infinite-dim optimization) ˜ ξ(∞,∞)

ψ,κ

∶= min

∥h∥L1(Q)≤1 [ψ−1/2Fκ (⟨w, Λ1/2h⟩L2(Q), ∥Πw⊥ (Λ1/2h)∥L2(Q)) + ⟨Πw⊥ (G), Λ1/2h⟩L2(Q)]

One needs to show lim

p→∞,p/n(p)→ψ

ˆ ξ(n,p)

ψ,κ a.s.

= ˜ ξ(∞,∞)

ψ,κ

L1 vs. L2 geometry: for the constraint set ∥θ∥1 ≤ √p, define c1 = ⟨w, Λ1/2θ⟩, c2 = ∥Πw⊥(Λ1/2θ)∥2 c2 could be √p → ∞. [L. & Sur ’20] shows uniform deviation over unbounded domain for the fixed-point equation (KKT), using a key self-normalization property of ∂iFκ(c1, c2).

For i = 1, 2, we have w.p. at least 1 − n−2, sup

∣c1∣≤M, c2 > 0

∣∂iˆ Fκ(c1, c2) − ∂iFκ(c1, c2)∣ ≤ C log n √n

27 / 37

slide-48
SLIDE 48

Intro. Min-norm Interpolant Regression Classification

[BACKUP] CONVEX GAUSSIAN MINIMAX THEOREM Let C1 ⊂ Rn, C2 ⊂ Rp be two compact sets and let R ∶ C1 × C2 → R be a continuous

  • function. Let X = (Xi,j) ∈ Rn×p, g ∼ N (0, In) and h ∼ N (0, Ip) be independent

vectors and matrices with standard Gaussian entries. Define Q1(X) = min

w1∈C1

max

w2∈C2

w⊺

1 Xw2 + R(w1, w2)

Q2(g, h) = min

w1∈C1

max

w2∈C2

∥w2∥g⊺w1 + ∥w1∥h⊺w2 + R(w1, w2). Then

  • 1. For all t ∈ R,

P(Q1(X) ≤ t) ≤ 2P(Q2(g, h) ≤ t).

  • 2. Suppose C1 and C2 are both convex, and R is convex concave in (w1, w2).

Then, for all t ∈ R, P(Q1(X) ≥ t) ≤ 2P(Q2(g, h) ≥ t).

28 / 37

slide-49
SLIDE 49

Intro. Min-norm Interpolant Regression Classification

SUMMARY

Research agenda: statistical or generalization theory for min-norm interpolants (naive usage of Rademacher complexity, or VC-dim won’t explain well)

  • Regression: [L. & Rakhlin ’18], [L. & Dou ’19], [L., Rakhlin & Zhai ’19]
  • Classification: [L. & Sur ’20]

29 / 37

slide-50
SLIDE 50

References

Thank you!

  • 1. Liang, T. & Sur, P. (2020). — A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm

Interpolated Classifiers. arXiv:2002.01586

  • 2. Liang, T., Rakhlin, A. & Zhai, X. (2019). — On the Multiple Descent of Minimum-Norm Interpolants and

Restricted Lower Isometry of Kernels. arXiv:1908.10292

  • 3. Liang, T. & Rakhlin, A. (2018). — Just Interpolate: Kernel “Ridgeless” Regression Can Generalize.

The Annals of Statistics, to appear

  • 4. Dou, X. & Liang, T. (2019). — Training Neural Networks as Learning Data-adaptive Kernels: Provable

Representation and Approximation Benefits. Journal of the American Statistical Association, to appear

Peter L Bartlett, Philip M Long, G´ abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. arXiv preprint arXiv:1906.11300, 2019. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018. Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis. A model of double descent for high-dimensional binary linear classification. arXiv preprint arXiv:1911.05822, 2019. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018. Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23–37. Springer, 1995. 30 / 37

slide-51
SLIDE 51

References

PROOF IDEA: RESTRICTED LOWER ISOMETRY

Proof Idea: on a filtration of spaces , establish restricted lower isometry .

Koltchinskii and Mendelson (2015); Mendelson (2014)

31 / 37

slide-52
SLIDE 52

References

PROOF IDEA: RESTRICTED LOWER ISOMETRY

Proof Idea: on a filtration of spaces indexed by polynomial basis, establish restricted lower isometry of the empirical kernel. Define nK ∶= [k(xi, xj)]i,j∈[n] ∈ Rn×n nKij = h ( x⊺

i xj

d ) =

ι=0

αι ( x⊺

i xj

d )

ι

= ∑

r1,⋯,rd≥0

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd Define filtrated empirical kernel nK[≤ι]

ij

∶= ∑

r1,⋯,rd≥0 r1+⋯+rd≤ι

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd cr1⋯rd = (r1+⋯+rd)!

r1!⋯rd!

, pr1⋯rd(xi) = (xi[1])r1⋯(xi[d])rd monomials with multi-index r1⋯rd

31 / 37

slide-53
SLIDE 53

References

RESTRICTED LOWER ISOMETRY OF KERNEL

filtrated empirical kernel nK[≤ι]

ij

∶= ∑

r1,⋯,rd≥0 r1+⋯+rd≤ι

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd nK[≤ι] = Φ

  • n×(ι+d

ι )

⋅ Φ⊺

  • (ι+d

ι )×n

filtrated polynomial features Φi,(r1⋯rd) = (cr1⋯rdαr1+⋯+rd)1/2 pr1⋯rd(xi)/d(r1+⋯+rd)/2 filtrated sample covariance operator Θ[≤ι] ∶= 1 n Φ⊺

  • (ι+d

ι )×n

⋅ Φ

  • n×(ι+d

ι ) 32 / 37

slide-54
SLIDE 54

References

RESTRICTED LOWER ISOMETRY OF KERNEL

filtrated empirical kernel nK[≤ι]

ij

∶= ∑

r1,⋯,rd≥0 r1+⋯+rd≤ι

cr1⋯rdαr1+⋯+rdpr1⋯rd(xi)pr1⋯rd(xj)/dr1+⋯+rd nK[≤ι] = Φ

  • n×(ι+d

ι )

⋅ Φ⊺

  • (ι+d

ι )×n

filtrated polynomial features Φi,(r1⋯rd) = (cr1⋯rdαr1+⋯+rd)1/2 pr1⋯rd(xi)/d(r1+⋯+rd)/2 filtrated sample covariance operator Θ[≤ι] ∶= 1 n Φ⊺

  • (ι+d

ι )×n

⋅ Φ

  • n×(ι+d

ι )

Restricted Lower Isometry of Kernel: all non-zero eigenvalues of K[≤ι] is lower bounded by d−ι, i.e., λmin (Θ[≤ι]) ≿ d−ι

32 / 37

slide-55
SLIDE 55

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Assume that Taylor coefficients of h satisfy αi > 0 ∀i. Consider any positive integer ι that satisfy dι log d = o(n). and ι < ν. ν is the tail decay of P. Then with probability at least 1 − exp(−C ⋅ n/dι), all non-zero eigenvalues of K[≤ι] ≥ C ⋅ d−ι. Lemma (L., Rakhlin & Zhai, ’19).

33 / 37

slide-56
SLIDE 56

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Assume that Taylor coefficients of h satisfy αi > 0 ∀i. Consider any positive integer ι that satisfy dι log d = o(n). and ι < ν. ν is the tail decay of P. Then with probability at least 1 − exp(−C ⋅ n/dι), all non-zero eigenvalues of K[≤ι] ≥ C ⋅ d−ι. Lemma (L., Rakhlin & Zhai, ’19). Some wrong but useful intuition:

  • eigenvalues of K[≤ι] equals that of Θ[≤ι]

33 / 37

slide-57
SLIDE 57

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Assume that Taylor coefficients of h satisfy αi > 0 ∀i. Consider any positive integer ι that satisfy dι log d = o(n). and ι < ν. ν is the tail decay of P. Then with probability at least 1 − exp(−C ⋅ n/dι), all non-zero eigenvalues of K[≤ι] ≥ C ⋅ d−ι. Lemma (L., Rakhlin & Zhai, ’19). Some wrong but useful intuition:

  • eigenvalues of K[≤ι] equals that of Θ[≤ι]
  • suppose monomials ∏d

i=1(x[i])ri are orthogonal (wrong), then

E [Θ[≤ι]] = diag(C(0), ⋯, C(ι′)⋅d−ι′, ⋯, C(ι)⋅d−ι ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

(d+ι−1

d−1 ) such entries

)

33 / 37

slide-58
SLIDE 58

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Assume that Taylor coefficients of h satisfy αi > 0 ∀i. Consider any positive integer ι that satisfy dι log d = o(n). and ι < ν. ν is the tail decay of P. Then with probability at least 1 − exp(−C ⋅ n/dι), all non-zero eigenvalues of K[≤ι] ≥ C ⋅ d−ι. Lemma (L., Rakhlin & Zhai, ’19). Some wrong but useful intuition:

  • eigenvalues of K[≤ι] equals that of Θ[≤ι]
  • suppose monomials ∏d

i=1(x[i])ri are orthogonal (wrong), then

E [Θ[≤ι]] = diag(C(0), ⋯, C(ι′)⋅d−ι′, ⋯, C(ι)⋅d−ι ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ

(d+ι−1

d−1 ) such entries

)

  • even so, standard concentration (fails, at least apply naively)

sup

u∈B

(d+ι ι ) 2

u⊺ (Θ[≤ι] − E [Θ[≤ι]]) u ≤ 1 √n Var ⋯

33 / 37

slide-59
SLIDE 59

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas.

34 / 37

slide-60
SLIDE 60

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. Idea 1: Gram-Schimdt process on polynomials, weakly-dependent {1, t, t2, ⋯} → {1, q1(t), q2(t), ⋯} q orthogonal polynomial basis on L2

P

34 / 37

slide-61
SLIDE 61

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. Idea 1: Gram-Schimdt process on polynomials, weakly-dependent {1, t, t2, ⋯} → {1, q1(t), q2(t), ⋯} q orthogonal polynomial basis on L2

P

Φi,(r1⋯rd) → Ψi,(r1⋯rd) = (cr1⋯rdαr1+⋯+rd)1/2 ∏

j∈[d]

qrj(xi[j])/d(r1+⋯+rd)/2 Φ = ΨΛ, Λ ∈ R(ι+d

ι )×(ι+d ι )

upper-triangular

34 / 37

slide-62
SLIDE 62

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. Idea 1: Gram-Schimdt process on polynomials, weakly-dependent {1, t, t2, ⋯} → {1, q1(t), q2(t), ⋯} q orthogonal polynomial basis on L2

P

Φi,(r1⋯rd) → Ψi,(r1⋯rd) = (cr1⋯rdαr1+⋯+rd)1/2 ∏

j∈[d]

qrj(xi[j])/d(r1+⋯+rd)/2 Φ = ΨΛ, Λ ∈ R(ι+d

ι )×(ι+d ι )

upper-triangular Claim: weakly-dependent ⇒ ∥Λ∥op, ∥Λ−1∥op ≤ C(ι) u⊺Θ[≤ι]u = 1 n ∥Φu∥2 = 1 n∥ΨΛu∥2 ≥ λmin ( 1 nΨ⊺Ψ)∥Λu∥2 ≍ λmin ( 1 n Ψ⊺Ψ)∥u∥2

34 / 37

slide-63
SLIDE 63

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. Idea 2: small-ball approach rather than standard concentration

35 / 37

slide-64
SLIDE 64

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. Idea 2: small-ball approach rather than standard concentration lower bound λmin ( 1

nΨ⊺Ψ)

equiv. ∀u, ∥u∥ = 1, lower bound ∥Ψu∥2 utilize non-negativity ∥Ψu∥2 = 1 n

n

i=1

⟨Ψ(xi), u⟩2 ≥ c1E[⟨Ψ(xi), u⟩2] ⋅ 1 n

n

i=1

I⟨Ψ(xi),u⟩2≥c1E[⟨Ψ(X),u⟩2] small-ball property, ∃ constants c1, c2 P (⟨Ψ(xi), u⟩2 ≥ c1E[⟨Ψ(X), u⟩2]) ≥ c2 which will imply w.p. at least 1 − exp(−c ⋅ n) 1 n

n

i=1

I⟨Ψ(xi),u⟩2≥c1E[⟨Ψ(X),u⟩2] ≥ c2/2 Non-trivial: verify small-ball property for polynomials (weakly dependent) via Paley-Zygmund

35 / 37

slide-65
SLIDE 65

References

RESTRICTED LOWER ISOMETRY OF KERNEL

Then, how to make it right? Two Ideas. all non-zero eigenvalues of K[≤ι] ≥ C ⋅ d−ι. Lemma (L., Rakhlin & Zhai, ’19).

Mendelson (2014); Liang et al. (2019); Ghorbani et al. (2019)

35 / 37

slide-66
SLIDE 66

References

INTUITION: WEAKLY DEPENDENT

For any three distinct polynomial features indexed by (r1⋯rd), (r′

1⋯r′ d), (r′′ 1 ⋯r′′ d )

j∈[d]

qrj(x[j]), ∏

j∈[d]

qr′

j (x[j]), ∏

j∈[d]

qr′′

j (x[j])

Third moment E [qr1⋯rdqr′

1⋯r′ dqr′′ 1 ⋯r′′ d ] ≠ 0

  • nly if ∀j ∈ [d], rj + r′

j ≥ r′′ j .

Among such triplets, at most 32ι

dι = O(1/dι) fraction has non-zero third moment.

36 / 37

slide-67
SLIDE 67

References

BACK TO MULTIPLE DESCENT PROOF: SKETCH

Decompose Risk to Bias and Variance. Surprisingly, both terms can be bounded by Ex∼Pd∥k(X, X)−1k(X, x)∥2.

37 / 37

slide-68
SLIDE 68

References

BACK TO MULTIPLE DESCENT PROOF: SKETCH

Decompose Risk to Bias and Variance. Surprisingly, both terms can be bounded by Ex∼Pd∥k(X, X)−1k(X, x)∥2. Sketch: Ex∥k(X, X)−1k(X, x)∥2 ≲

ι

i=0

Ex∥K−1 1 n (Xx)i/di∥2 + Ex∥K−1 1 n

i=ι+1

(Xx)i/di∥2 ≲ 1 n2

ι

i=0

Ex∥K−1(Xx)i/di∥2 + ∥(nK)−1∥2

  • p ⋅ Ex∥

i=ι+1

(Xx)i/di∥2 ≲ 1 n2

ι

i=0

Ex [∥(K[≤i])+∥2

  • p ⋅ ∥(Xx)i/di∥2] +

n dι+1 ≲ 1 n2

ι

i=0

Ex [d2i ⋅ ∥(Xx)i/di∥2] + n dι+1 use restricted lower isometry ≲ dι n + n dι+1 .

37 / 37