Minimum-Norm Interpolation in Statistical Learning: new phenomena in - - PowerPoint PPT Presentation

minimum norm interpolation in statistical learning new
SMART_READER_LITE
LIVE PREVIEW

Minimum-Norm Interpolation in Statistical Learning: new phenomena in - - PowerPoint PPT Presentation

Min-Norm Interpolation Regression Classification Minimum-Norm Interpolation in Statistical Learning: new phenomena in high dimensions Tengyuan Liang Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) Classification: with Pragya Sur


slide-1
SLIDE 1

Min-Norm Interpolation Regression Classification

Minimum-Norm Interpolation in Statistical Learning: new phenomena in high dimensions Tengyuan Liang

Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) Classification: with Pragya Sur (Harvard)

1 / 25

slide-2
SLIDE 2

Min-Norm Interpolation Regression Classification

OUTLINE

  • Motivation: min-norm interpolants for over-parametrized models
  • Regression: multiple descent of risk for kernels/neural networks
  • Classification: precise asymptotics of boosting algorithms

2 / 25

slide-3
SLIDE 3

Min-Norm Interpolation Regression Classification

OVERPARAMETRIZED REGIME OF STAT/ML

Model class complex enough to interpolate the training data.

Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018a,b); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error)

Kernel Regression on MNIST

digits pair [i,j] [2,5] [2,9] [3,6] [3,8] [4,7]

λ = 0: the interpolants on training data.

MNIST data from LeCun et al. (2010)

3 / 25

slide-4
SLIDE 4

Min-Norm Interpolation Regression Classification

OVERPARAMETRIZED REGIME OF STAT/ML

Model class complex enough to interpolate the training data.

Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018a,b); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda 10 1 log(error)

Kernel Regression on MNIST

digits pair [i,j] [2,5] [2,6] [2,7] [2,8] [2,9] [3,5] [3,6] [3,7] [3,8] [3,9] [4,5] [4,6] [4,7] [4,8] [4,9]

λ = 0: the interpolants on training data.

MNIST data from LeCun et al. (2010)

3 / 25

slide-5
SLIDE 5

Min-Norm Interpolation Regression Classification

OVERPARAMETRIZED REGIME OF STAT/ML

In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle: among the models that interpolate, algorithms favor certain form of minimalism.

4 / 25

slide-6
SLIDE 6

Min-Norm Interpolation Regression Classification

OVERPARAMETRIZED REGIME OF STAT/ML

Principle: among the models that interpolate, algorithms favor certain form of minimalism.

  • overparametrized linear model and matrix factorization
  • kernel regression
  • support vector machines, Perceptron
  • boosting, AdaBoost
  • two-layer ReLU networks, deep neural networks

4 / 25

slide-7
SLIDE 7

Min-Norm Interpolation Regression Classification

OVERPARAMETRIZED REGIME OF STAT/ML

Principle: among the models that interpolate, algorithms favor certain form of minimalism.

  • overparametrized linear model and matrix factorization
  • kernel regression
  • support vector machines, Perceptron
  • boosting, AdaBoost
  • two-layer ReLU networks, deep neural networks

minimalism typically measured in form of certain norm motivates the study of min-norm interpolants

4 / 25

slide-8
SLIDE 8

Min-Norm Interpolation Regression Classification

MIN-NORM INTERPOLANTS

minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ̂ f = arg min

f

∥f∥norm, s.t. yi = f(xi) ∀i ∈ [n]. Classification ̂ f = arg min

f

∥f∥norm, s.t. yi ⋅ f(xi) ≥ 1 ∀i ∈ [n].

5 / 25

slide-9
SLIDE 9

Min-Norm Interpolation Regression Classification

Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels with Sasha Rakhlin (MIT), Xiyu Zhai (MIT)

Regression ̂ f = arg min

f

∥f∥norm, s.t. yi = f(xi) ∀i ∈ [n].

6 / 25

slide-10
SLIDE 10

Min-Norm Interpolation Regression Classification

SHAPE OF RISK CURVE

Classic: U-shape curve Recent: double descent curve

Belkin, Hsu, Ma, and Mandal (2018a); Hastie, Montanari, Rosset, and Tibshirani (2019)

Question: shape of the risk curve w.r.t. “over-parametrization”?

7 / 25

slide-11
SLIDE 11

Min-Norm Interpolation Regression Classification

SHAPE OF RISK CURVE

Classic: U-shape curve Recent: double descent curve

Belkin, Hsu, Ma, and Mandal (2018a); Hastie, Montanari, Rosset, and Tibshirani (2019)

Question: shape of the risk curve w.r.t. “over-parametrization”? We model the intrinsic dim. d = nα with α ∈ (0, 1), with feature cov. Σd = Id. We consider the non-linear Kernel Regression model.

7 / 25

slide-12
SLIDE 12

Min-Norm Interpolation Regression Classification

DATA GENERATING PROCESS DGP.

  • {xi}n

i=1 i.i.d

∼ µ = P⊗d, dist. of each coordinate satisfies weak moment condition.

  • target f⋆(x) ∶= E[Y∣X = x], with bounded Var[Y∣X = x].

Kernel.

  • h ∈ C∞(R), h(t) = ∑∞

i=0 αiti with αi ≥ 0.

  • inner product kernel k(x, z) = h (⟨x, z⟩/d).

Target Function.

  • Assume f⋆(x) = ∫ k(x, z)ρ⋆(z)µ(dz) with ∥ρ⋆∥µ ≤ C.

8 / 25

slide-13
SLIDE 13

Min-Norm Interpolation Regression Classification

DATA GENERATING PROCESS Given n i.i.d. data pairs (xi, yi) ∼ PX,Y. Risk curve for minimum RKHS norm ∥ ⋅ ∥H interpolantŝ f ? ̂ f = arg min

f

∥f∥H, s.t. yi = f(xi) ∀i ∈ [n].

8 / 25

slide-14
SLIDE 14

Min-Norm Interpolation Regression Classification

SHAPE OF RISK CURVE

For any integer ι ≥ 1, consider d = nα where α ∈ ( 1

ι+1, 1 ι ).

Theorem (L., Rakhlin & Zhai, ’19).

slide-15
SLIDE 15

Min-Norm Interpolation Regression Classification

SHAPE OF RISK CURVE

For any integer ι ≥ 1, consider d = nα where α ∈ ( 1

ι+1, 1 ι ).

With probability at least 1 − δ − e−n/dι on the design X ∈ Rn×d, E [∥̂ f − f∗∥2

µ∣X] ≤ C ⋅ (dι

n + n dι+1 ) ≍ n−β, β ∶= min {(ι + 1)α − 1, 1 − ια} .

Here the constant C(δ, ι, h, P) does not depend on d, n.

Theorem (L., Rakhlin & Zhai, ’19).

9 / 25

slide-16
SLIDE 16

Min-Norm Interpolation Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.

10 / 25

slide-17
SLIDE 17

Min-Norm Interpolation Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N 10 / 25

slide-18
SLIDE 18

Min-Norm Interpolation Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N

  • over-parametrization: towards over-parametrized regime, the good rate at the

bottom of the valley is better

10 / 25

slide-19
SLIDE 19

Min-Norm Interpolation Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4

1/2

= = −

  • multiple-descent behavior of the rates as the scaling d = nα changes.
  • valley: “valley” on the rate curve at d = n

1 ι+1/2 , ι ∈ N

  • over-parametrization: towards over-parametrized regime, the good rate at the

bottom of the valley is better

  • empirical: preliminary empirical evidence of multiple descent

10 / 25

slide-20
SLIDE 20

Min-Norm Interpolation Regression Classification

EMPIRICAL EVIDENCE

empirical evidence of multiple-descent behavior as the scaling d = nα changes.

11 / 25

slide-21
SLIDE 21

Min-Norm Interpolation Regression Classification

MULTIPLE DESCENT

1/2 1/3 1 1/4 1/2

= = −

  • theory

empirical

12 / 25

slide-22
SLIDE 22

Min-Norm Interpolation Regression Classification

APPLICATION TO WIDE NEURAL NETWORKS

Neural Tangent Kernel (NTK)

Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)......

kNTK(x, x′) = U( ⟨x, x′⟩ ∥x∥∥x′∥), with U(t) = 1 4π (3t(π − arccos(t)) + √ 1 − t2) Compositional Kernel of Deep Neural Network (DNN)

Daniely et al. (2016); Poole et al. (2016); Liang and Tran-Bach (2020)

kDNN(x, x′) =

i=0

αi ⋅ ( ⟨x, x′⟩ ∥x∥∥x′∥)i

13 / 25

slide-23
SLIDE 23

Min-Norm Interpolation Regression Classification

APPLICATION TO WIDE NEURAL NETWORKS

Neural Tangent Kernel (NTK)

Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)......

kNTK(x, x′) = U( ⟨x, x′⟩ ∥x∥∥x′∥), with U(t) = 1 4π (3t(π − arccos(t)) + √ 1 − t2) Compositional Kernel of Deep Neural Network (DNN)

Daniely et al. (2016); Poole et al. (2016); Liang and Tran-Bach (2020)

kDNN(x, x′) =

i=0

αi ⋅ ( ⟨x, x′⟩ ∥x∥∥x′∥)i Multiple descent phenomena hold for kernels including NTK, and composi- tional kernel of DNN. Corollary (L., Rakhlin & Zhai, ’19).

13 / 25

slide-24
SLIDE 24

Min-Norm Interpolation Regression Classification

Precise High-Dimensional Asymptotic Theory for Boosting and Min-ℓ1-Norm Interpolated Classifiers with Pragya Sur (Harvard)

Classification ̂ f = arg min

f

∥f∥norm, s.t. yi ⋅ f(xi) ≥ 1 ∀i ∈ [n].

14 / 25

slide-25
SLIDE 25

Min-Norm Interpolation Regression Classification

PROBLEM FORMULATION

Given n-i.i.d. data pairs {(xi, yi)}1≤i≤n, with (x, y) ∼ P yi ∈ {±1} binary labels, xi ∈ Rp feature vector (weak learners) Consider when data is linearly separable P (∃θ ∈ Rp, yix⊺

i θ > 0 for 1 ≤ i ≤ n) → 1 .

Natural to consider overparametrized regime p/n → ψ ∈ (0, ∞) .

15 / 25

slide-26
SLIDE 26

Min-Norm Interpolation Regression Classification

BOOSTING/ADABOOST “... mystery of AdaBoost as the most important unsolved problem in Machine Learn- ing” Wald Lecture, Breiman (2004) “An important open problem is to derive more careful and precise bounds which can be used for this purpose. Besides paying closer attention to constant factors, such an analysis might also involve the measurement of more sophisticated statistics.”

Schapire, Freund, Bartlett, and Lee (1998)

16 / 25

slide-27
SLIDE 27

Min-Norm Interpolation Regression Classification

ℓ1 GEOMETRY, MARGIN, AND INTERPOLATION min-ℓ1-norm interpolation equiv. max-ℓ1-margin max

∥θ∥1≤1 min 1≤i≤n yix⊺ i θ =∶ κℓ1(X, y) .

Prior understanding: generalization error < 1 √nκ ⋅ (log factors, constants)

Schapire, Freund, Bartlett, and Lee (1998)

  • ptimization steps < 1

κ2 ⋅ (log factors, constants)

Rosset, Zhu, and Hastie (2004); Zhang and Yu (2005); Telgarsky (2013)

17 / 25

slide-28
SLIDE 28

Min-Norm Interpolation Regression Classification

ℓ1 GEOMETRY, MARGIN, AND INTERPOLATION Prior understanding: generalization error < 1 √nκ ⋅ (log factors, constants)

Schapire, Freund, Bartlett, and Lee (1998)

  • ptimization steps < 1

κ2 ⋅ (log factors, constants)

Rosset, Zhu, and Hastie (2004); Zhang and Yu (2005); Telgarsky (2013)

However, many questions remain: Statistical

  • how large is the ℓ1-margin κℓ1(X, y)?
  • angle between the interpolated clasifier ˆ

θ and the truth θ⋆?

  • precise generalization error of Boosting? relation to Bayes Error?

Computational

  • effect of increasing overparametrization ψ = p/n on optimization?
  • proportion of weak-learners activated by Boosting with zero initialization?

17 / 25

slide-29
SLIDE 29

Min-Norm Interpolation Regression Classification

DATA GENERATING PROCESS

  • DGP. xi ∼ N (0, Λ) i.i.d. with diagonal cov. Λ ∈ Rp×p, and yi are generated with

non-decreasing f ∶ R → [0, 1], P(yi = +1∣xi) = 1 − P(yi = −1∣xi) = f(x⊺

i θ⋆) ,

with some θ⋆ ∈ Rp. Consider high-dim asymptotic regime with overparametrized ratio p/n → ψ ∈ (0, ∞), n, p → ∞. signal strength ∶ ∥Λ1/2θ⋆∥ → ρ ∈ (0, ∞), coordinate ∶ ¯ wj = √p λ1/2

j

θ⋆,j ρ , 1 ≤ j ≤ p. Assume 1 p

p

j=1

δ(λj,¯

wj) Wasserstein-2

⇒ µ, a dist. on R>0 × R

18 / 25

slide-30
SLIDE 30

Min-Norm Interpolation Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

For ψ ≥ ψ⋆ (separability threshold), sharp asymptotic characterization holds: Margin: lim

n,p→∞ p/n→ψ

p1/2 ⋅ κℓ1(X, y) = κ⋆(ψ, µ) , a.s. Generalization error: lim

n,p→∞ p/n→ψ

Px,y (y ⋅ x⊺ ˆ θℓ1 < 0) = Err⋆(ψ, µ) , a.s. Theorem (L. & Sur, ’20).

19 / 25

slide-31
SLIDE 31

Min-Norm Interpolation Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

For ψ ≥ ψ⋆ (separability threshold), sharp asymptotic characterization holds: Margin: lim

n,p→∞ p/n→ψ

p1/2 ⋅ κℓ1(X, y) = κ⋆(ψ, µ) , a.s. Generalization error: lim

n,p→∞ p/n→ψ

Px,y (y ⋅ x⊺ ˆ θℓ1 < 0) = Err⋆(ψ, µ) , a.s. Theorem (L. & Sur, ’20). precise asymptotics can also be established on Angle: ⟨ˆ θℓ1, θ⋆⟩Λ ∥ˆ θℓ1∥Λ∥θ⋆∥Λ , Loss: ∑

j∈[p]

ℓ(ˆ θℓ1,j, θ⋆,j)

19 / 25

slide-32
SLIDE 32

Min-Norm Interpolation Regression Classification

PRECISE HIGH-DIM ASYMPTOTIC THEORY FOR BOOSTING

For ψ ≥ ψ⋆ (separability threshold), sharp asymptotic characterization holds: Margin: lim

n,p→∞ p/n→ψ

p1/2 ⋅ κℓ1(X, y) = κ⋆(ψ, µ) , a.s. Generalization error: lim

n,p→∞ p/n→ψ

Px,y (y ⋅ x⊺ ˆ θℓ1 < 0) = Err⋆(ψ, µ) , a.s. Theorem (L. & Sur, ’20). precise asymptotics can also be established on Angle: ⟨ˆ θℓ1, θ⋆⟩Λ ∥ˆ θℓ1∥Λ∥θ⋆∥Λ , Loss: ∑

j∈[p]

ℓ(ˆ θℓ1,j, θ⋆,j)

Gaussian comparison: Gordon (1988); Thrampoulidis et al. (2014, 2015, 2018) ℓ2-margin: Gardner (1988); Shcherbina and Tirozzi (2003); Deng et al. (2019); Montanari et al. (2019)

19 / 25

slide-33
SLIDE 33

Min-Norm Interpolation Regression Classification

THEORY VS. EMPIRICAL

x-axis, varying ψ overparametrization ratio

1 2 3 4 5 6 1 2 3 4 CGMT LP

Margin: p1/2 ⋅ κℓ1 (X, y) → κ⋆(ψ, µ)

1 2 3 4 5 6 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 CGMT LP

Generalization: Px,y (y ⋅ x⊺ ˆ θℓ1 < 0) → Err⋆(ψ, µ)

Blue: empirical (numerical solution via linear programming) vs. Red: theoretical (fixed point via non-linear equation system)

20 / 25

slide-34
SLIDE 34

Min-Norm Interpolation Regression Classification

THEORY VS. EMPIRICAL

x-axis, varying ψ overparametrization ratio

1 2 3 4 5 6 1 2 3 4 CGMT LP

Margin: p1/2 ⋅ κℓ1 (X, y) → κ⋆(ψ, µ)

1 2 3 4 5 6 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 CGMT LP

Generalization: Px,y (y ⋅ x⊺ ˆ θℓ1 < 0) → Err⋆(ψ, µ)

Blue: empirical (numerical solution via linear programming) vs. Red: theoretical (fixed point via non-linear equation system) Strikingly Accurate Asymptotics for Breiman’s Max Min-Margin! max∥θ∥1≤1 min1≤i≤n yix⊺

i θ

20 / 25

slide-35
SLIDE 35

Min-Norm Interpolation Regression Classification

NON-LINEAR EQUATION SYSTEM: FIXED POINT

[L. & Sur, ’20]: κ⋆(ψ, µ) enjoys the analytic characterization via fixed point c1(ψ, κ), c2(ψ, κ), s(ψ, κ)

define Fκ(⋅, ⋅) ∶ R × R≥0 → R≥0 Fκ(c1, c2) ∶= (E [(κ − c1YZ1 − c2Z2)2

+]) 1 2

where ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ Z2 ⊥ (Y, Z1) Zi ∼ N (0, 1), i = 1, 2 P(Y = +1∣Z1) = 1 − P(Y = −1∣Z1) = f(ρ ⋅ Z1) .

21 / 25

slide-36
SLIDE 36

Min-Norm Interpolation Regression Classification

NON-LINEAR EQUATION SYSTEM: FIXED POINT

[L. & Sur, ’20]: κ⋆(ψ, µ) enjoys the analytic characterization via fixed point c1(ψ, κ), c2(ψ, κ), s(ψ, κ)

Fixed point equations for c1, c2, s ∈ R × R>0 × R>0 given ψ > 0, where the expectation is over (Λ, W, G) ∼ µ ⊗ N (0, 1) =∶ Q c1 = − E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ Λ−1/2W ⋅ proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠ c2

1 + c2 2 =

E

(Λ,W,G)∼Q

⎛ ⎜ ⎝ Λ−1/2 proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

⎞ ⎟ ⎠

2

. 1 = E

(Λ,W,G)∼Q

  • Λ−1 proxs (Λ1/2G + ψ−1/2[∂1Fκ(c1, c2) − c1c−1

2 ∂2Fκ(c1, c2)]Λ1/2W)

ψ−1/2c−1

2 ∂2Fκ(c1, c2)

  • with proxλ(t) = arg min

s

{λ∣s∣ + 1 2 (s − t)2} = sgn(t) (∣t∣ − λ)+ T(ψ, κ) ∶= ψ−1/2 [Fκ(c1, c2) − c1∂1Fκ(c1, c2) − c2∂2Fκ(c1, c2)] − s with c1(ψ, κ), c2(ψ, κ), s(ψ, κ).

κ⋆(ψ, µ) ∶= inf{κ ≥ 0 ∶ T(ψ, κ) ≥ 0}

21 / 25

slide-37
SLIDE 37

Min-Norm Interpolation Regression Classification

GENERALIZATION ERROR, BAYES ERROR, AND ANGLE With c⋆

i ∶= ci(ψ, κ⋆(ψ, µ)), i = 1, 2.

Err⋆(ψ, µ) = P (c⋆

1 YZ1 + c⋆ 2 Z2 < 0)

BayesErr(ψ, µ) = P (YZ1 < 0)

22 / 25

slide-38
SLIDE 38

Min-Norm Interpolation Regression Classification

GENERALIZATION ERROR, BAYES ERROR, AND ANGLE With c⋆

i ∶= ci(ψ, κ⋆(ψ, µ)), i = 1, 2.

Err⋆(ψ, µ) = P (c⋆

1 YZ1 + c⋆ 2 Z2 < 0)

BayesErr(ψ, µ) = P (YZ1 < 0) ⟨ˆ θℓ1, θ⋆⟩Λ ∥ˆ θℓ1∥Λ∥θ⋆∥Λ → c⋆

1

√ (c⋆

1 )2 + (c⋆ 2 )2 Mannor et al. (2002); Jiang (2004); Bartlett and Traskin (2007); Bartlett et al. (2004)

Resolves an open question posed in Breiman ’99.

22 / 25

slide-39
SLIDE 39

Min-Norm Interpolation Regression Classification

Statistical and Algorithmic implications significantly improves over prior generalization bounds

  • verparametrization → faster
  • ptimization
  • verparametrization → sparser

solution

23 / 25

slide-40
SLIDE 40

Min-Norm Interpolation Regression Classification

SUMMARY

Research agenda: statistical and computational theory for min-norm interpolants (naive usage of Rademacher complexity, or VC-dim struggles to explain)

24 / 25

slide-41
SLIDE 41

Min-Norm Interpolation Regression Classification

SUMMARY

Research agenda: statistical and computational theory for min-norm interpolants (naive usage of Rademacher complexity, or VC-dim struggles to explain)

  • Regression: [L. & Rakhlin ’18, AOS], [L., Rakhlin & Zhai ’19, COLT]
  • Classification: [L. & Sur ’20]
  • Kernels vs. Neural Networks: [L. & Dou ’19, JASA], [L. & Tran-Bach ’20]

24 / 25

slide-42
SLIDE 42

References

Thank you!

  • Liang, T. & Sur, P. (2020). — A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm

Interpolated Classifiers. arXiv:2002.01586

  • Liang, T., Tran-Bach, H. (2020). — Mehler’s Formula, Branching Process, and Compositional Kernels of Deep

Neural Networks. arXiv:2004.04767

  • Liang, T., Rakhlin, A. & Zhai, X. (2019). — On the Multiple Descent of Minimum-Norm Interpolants and

Restricted Lower Isometry of Kernels. Conference on Learning Theory (COLT), 2020

  • Liang, T. & Rakhlin, A. (2018). — Just Interpolate: Kernel “Ridgeless” Regression Can Generalize.

The Annals of Statistics, 2020

  • Dou, X. & Liang, T. (2019). — Training Neural Networks as Learning Data-adaptive Kernels: Provable

Representation and Approximation Benefits. Journal of the American Statistical Association, 2020

Peter L Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8(Oct):2347–2368, 2007. Peter L. Bartlett, Peter J. Bickel, Peter B¨ uhlmann, Yoav Freund, Jerome Friedman, Trevor Hastie, Wenxin Jiang, Michael J. Jordan, Vladimir Koltchinskii, G´ abor Lugosi, Jon D. McAuliffe, Ya’acov Ritov, Saharan Rosset, Robert E. Schapire, Robert Tibshirani, Nicolas Vayatis, Bin Yu, Tong Zhang, and Ji Zhu. Discussions of boosting papers, and rejoinders. Annals of Statistics, 32(1):85–134, February 2004. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1105988581. 25 / 25