Theory for Minimum Norm Interpolation: Regression and Classification - PowerPoint PPT Presentation

Intro. Min-norm Interpolant Regression Classification Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan Liang Classification: with Pragya Sur (Harvard) Regression: with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) 1 / 37

Intro. Min-norm Interpolant Regression Classification OUTLINE • Motivation: min-norm interpolants • Regression: multiple descent of risk • Classification: boosting on separable data 2 / 37

Intro. Min-norm Interpolant Regression Classification OUTLINE • Motivation: min-norm interpolants • Regression: multiple descent of risk • application to wide neural networks • restricted lower isometry of kernels • small-ball property • Classification: boosting on separable data • precise high-dim asymptotics • convex Gaussian min-max theorem • algorithmic implications on boosting 2 / 37

Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] log(error) [2,9] [3,6] [3,8] [4,7] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 37

Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] [3,5] [4,5] log(error) [2,6] [3,6] [4,6] [2,7] [3,7] [4,7] [2,8] [3,8] [4,8] [2,9] [3,9] [4,9] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 37

Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle : among the models that interpolate , algorithms favor certain form of minimalism . 4 / 37

Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . • over-parametrized linear model and matrix factorization • kernel machines • support vector machines • boosting, AdaBoost • two-layer ReLU networks 4 / 37

Intro. Min-norm Interpolant Regression Classification OVER - PARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . • over-parametrized linear model and matrix factorization • kernel machines • support vector machines • boosting, AdaBoost • two-layer ReLU networks minimalism typically measured in form of certain norm motivates the study of min-norm interpolants 4 / 37

Intro. Min-norm Interpolant Regression Classification MIN - NORM INTERPOLANTS minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 5 / 37

Intro. Min-norm Interpolant Regression Classification R EGRESSION Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels with Sasha Rakhlin (MIT), Xiyu Zhai (MIT) 6 / 37

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? 7 / 37

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Classic: U-shape curve Recent: double descent curve Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019) Question: shape of the risk curve w.r.t. “over-parametrization” ? We model the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) , with feature cov. Σ d = I d . We consider the non-linear Kernel Regression model. 7 / 37

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE We consider the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) . A non-linear Kernel Regression model. DGP. • { x i } n ∼ µ = P ⊗ d . distribution of each coordinate x ∼ P satisfies weak moment i . i . d i = 1 ∀ t > 0, P (∣ x ∣ > t ) ≤ C ( 1 + t ) − ν . • target f ⋆ ( x ) ∶ = E [ Y ∣ X = x ] , with bounded Var [ Y ∣ X = x ] . Kernel. • h ∈ C ∞ ( R ) , h ( t ) = ∑ ∞ i = 0 α i t i with α i ≥ 0. • inner product kernel k ( x , z ) = h (⟨ x , z ⟩/ d ) . Target Function. • Assume f ⋆ ( x ) = ∫ k ( x , z ) ρ ⋆ ( z ) µ ( dz ) with ∥ ρ ⋆ ∥ µ ≤ C . 8 / 37

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE We consider the intrinsic dim. d = n α with α ∈ ( 0 , 1 ) . A non-linear Kernel Regression model. Given n i.i.d. data pairs ( x i , y i ) ∼ P X , Y . Risk curve for minimum RKHS norm ∥ ⋅ ∥ H interpolants ̂ f ? ∥ f ∥ H , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f 8 / 37

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι ) . ι + 1 , 1

Intro. Min-norm Interpolant Regression Classification SHAPE OF RISK CURVE Theorem (L., Rakhlin & Zhai, ’19) . For any integer ι ≥ 1, consider d = n α where α ∈ ( 1 ι ) . ι + 1 , 1 With probability at least 1 − δ − e − n / d ι on the design X ∈ R n × d , E [∥̂ µ ∣ X ] ≤ C ⋅ ( d ι f − f ∗ ∥ 2 d ι + 1 ) ≍ n − β , n n + β ∶ = min {( ι + 1 ) α − 1 , 1 − ια } . Here the constant C ( δ , ι , h , P) does not depend on d , n . 9 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 10 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n 10 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n • over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better 10 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. 1 ι + 1 / 2 , ι ∈ N • valley : “valley” on the rate curve at d = n • over-parametrization : towards over-parametrized regime, the good rate at the bottom of the valley is better • empirical : preliminary empirical evidence of multiple descent 10 / 37

Intro. Min-norm Interpolant Regression Classification EMPIRICAL EVIDENCE empirical evidence of multiple-descent behavior as the scaling d = n α changes. 11 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � theory empirical 12 / 37

Intro. Min-norm Interpolant Regression Classification MULTIPLE DESCENT � = � � 1/4 1/3 1/2 1 ⋯ 0 � �� = � − � 1/2 � multiple-descent behavior of the rates as the scaling d = n α changes. • α = 1: Liang and Rakhlin (2018) • α = 0: Rakhlin and Zhai (2018) • α = 1 double descent: Belkin, Hsu, Ma, and Mandal (2018); Hastie, Montanari, Rosset, and Tibshirani (2019); Bartlett, Long, Lugosi, and Tsigler (2019) • general α , stair-case, random fourier feature: Ghorbani, Mei, Misiakiewicz, and Montanari (2019) 13 / 37

Intro. Min-norm Interpolant Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... 4 π U ( ⟨ x , x ′ ⟩ k NTK ( x , x ′ ) = 1 ∥ x ∥∥ x ′ ∥) √ U ( t ) = 3 t ( π − arccos ( t )) + 1 − t 2 14 / 37

Intro. Min-norm Interpolant Regression Classification APPLICATION TO WIDE NEURAL NETWORKS Neural Tangent Kernel (NTK) Jacot, Gabriel, and Hongler (2018); Du, Zhai, Poczos, and Singh (2018)...... 4 π U ( ⟨ x , x ′ ⟩ k NTK ( x , x ′ ) = 1 ∥ x ∥∥ x ′ ∥) √ U ( t ) = 3 t ( π − arccos ( t )) + 1 − t 2 Corollary (L., Rakhlin & Zhai, ’19) . Our results can be generalized to the following type of kernels α i ⋅ ( ⟨ x , x ′ ⟩ ∞ k ( x , x ′ ) = ∥ x ∥∥ x ′ ∥) i ∑ i = 0 that include NTK. Consider integer ι that satisfies d ι log d ≾ n ≾ d ι + 1 / log d , then Risk ≾ d ι n + n log d d ι + 1 14 / 37

Theory for Minimum Norm Interpolation: Regression and Classification - PowerPoint PPT Presentation

Intro. Min-norm Interpolant Regression Classification Theory for Minimum Norm Interpolation: Regression and Classification in High Dimensions Tengyuan Liang Classification: with Pragya Sur (Harvard) Regression: with Sasha Rakhlin (MIT), Xiyu

First-Order Interpolation Laura Kov acs Interpolation: Craig Interpolation Use of

Minimum-Norm Interpolation in Statistical Learning: new phenomena in high dimensions Tengyuan

Part II: Interpolation and Approximation theory Contents: Review of Lagrange interpolation

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Stability and Lebesgue constants in Good interpolation points RBF interpolation Results

Interpolation Dr. Mihail October 26, 2015 (Dr. Mihail) Interpolation October 26, 2015 1 / 11

3. Interpolation Closing the Gaps of Discretization . . . 3. Interpolation Numerical Programming

Marcinkiewicz interpolation Updated May 18, 2020 Plan 2 Outline: Interpolation of quasinorms

3. Interpolation: Closing the Gaps of Discretization . . . 3. Interpolation: Closing the Gaps of

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

On interpolation in theorem proving Maria Paola Bonacina Visiting: Computer Science Laboratory,

CFA Interpolation Detection Leszek Swirski October 15, 2009 Leszek Swirski CFA

The minimum Euclidean norm point in a polytope: Wolfe's method is exponential Luis Rademacher,

6. Approximation and fitting norm approximation least-norm problems regularized

Interpolation in high dimensions: Non-intrusive reduced order modeling Akil Narayan 1 1 Department

Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks Sasha Rakhlin

Section 3 Interpolation and Polynomial Approximation Numerical Analysis I Xiaojing Ye, Math

Taylors Series and Interpolation CIS 541 - Interpolation Taylor Series interpolates at a

Interpolation Sanzheng Qiao Department of Computing and Software McMaster University July, 2012

Lecture 20: Rotating, Scaling, Shifting and Shearing an Image ECE 417: Multimedia Signal

Interpolated Linear Modeling of the F16 Benchmark Maarten Schoukens April 2017 Overview Global

protoDUNE Wenqiang Gu Brookhaven National Laboratory 1 Sticky Code The 6 LSBs in ADC ASIC