Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: - PowerPoint PPT Presentation

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory Tengyuan Liang joint work with Pragya Sur (Harvard) 1 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OUTLINE ● Motivation: min-norm interpolants under overparametrized regime ● Classification: boosting on separable data ● precise asymptotics of margin ● fixed point of a non-linear system of equations ● statistical and algorithmic implications ● Proof Sketch: Gaussian comparison and convex geometry tools 2 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] log(error) [2,9] [3,6] [3,8] [4,7] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Model class complex enough to interpolate the training data. Zhang, Bengio, Hardt, Recht, and Vinyals (2016) Belkin et al. (2018); Liang and Rakhlin (2018); Bartlett et al. (2019); Hastie et al. (2019) Kernel Regression on MNIST 10 1 digits pair [i,j] [2,5] [3,5] [4,5] log(error) [2,6] [3,6] [4,6] [2,7] [3,7] [4,7] [2,8] [3,8] [4,8] [2,9] [3,9] [4,9] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 lambda λ = 0: the interpolants on training data. MNIST data from LeCun et al. (2010) 3 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML In fact, many models behave the same on training data. Practical methods or algorithms favor certain functions! Principle : among the models that interpolate , algorithms favor certain form of minimalism . 4 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks (?) 4 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch OVERPARAMETRIZED REGIME OF STAT / ML Principle : among the models that interpolate , algorithms favor certain form of minimalism . ● overparametrized linear model and matrix factorization ● kernel regression ● support vector machines, Perceptron ● boosting, AdaBoost ● two-layer ReLU networks, deep neural networks (?) minimalism typically measured in form of certain norm motivates the study of min-norm interpolants 4 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch MIN - NORM INTERPOLANTS minimalism typically measured in form of certain norm motivates the study of min-norm interpolants Regression ∥ f ∥ norm , s . t . y i = f ( x i ) ∀ i ∈ [ n ] . ̂ f = arg min f Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 5 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Precise High-Dimensional Asymptotic Theory for Boosting and Min- L 1 -Norm Interpolated Classifiers tyliang.github.io/Tengyuan.Liang/pdf/Liang-Sur-20.pdf Classification ̂ ∥ f ∥ norm , s . t . y i ⋅ f ( x i ) ≥ 1 ∀ i ∈ [ n ] . f = arg min f 6 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch PROBLEM FORMULATION Given n -i.i.d. data pairs {( x i , y i )} 1 ≤ i ≤ n , with ( x , y ) ∼ P y i ∈ { ± 1 } binary labels, x i ∈ R p feature vector (weak learners) Consider when data is linearly separable P ( ∃ θ ∈ R p , y i x ⊺ i θ > 0 for 1 ≤ i ≤ n ) → 1 . Natural to consider overparametrized regime p / n → ψ ∈ ( 0 , ∞ ) . 7 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch B OOSTING /A DA B OOST Initialize θ 0 = 0 ∈ R p , set data weights η 0 = ( 1 / n , ⋯ , 1 / n ) ∈ ∆ n . At time t ≥ 0: t ∶= arg max j ∈[ p ] ∣ η ⊺ t Z e j ∣ , set γ t = η ⊺ 1. Learner/Feature Selection: j ⋆ t Z e j ⋆ ; t 2. Adaptive Stepsize: α t = 1 2 log ( 1 + γ t 1 − γ t ) ; 3. Coordinate Update: θ t + 1 = θ t + α t ⋅ e j ⋆ ; t 4. Weight Update: η t + 1 [ i ] ∝ η t [ i ] exp ( − α t y i x ⊺ ) , normalized η t + 1 ∈ ∆ n . i e j ⋆ t Terminate after T steps, and output the vector θ T . Freund and Schapire (1995, 1996) 8 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch B OOSTING /A DA B OOST “... mystery of AdaBoost as the most important unsolved problem in Machine Learn- ing” Wald Lecture, Breiman (2004) 8 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Generalization: for all f ( x ) = x ⊺ θ /∥ θ ∥ 1 and κ > 0, √ √ log ( 1 / δ ) n P ( y f ( x ) < 0 ) ≤ 1 log n log p ■ ( y i f ( x i ) < κ ) + + , w.p. 1 − δ ∑ n κ 2 n i = 1 n �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� generalization error empirical margin Schapire, Freund, Bartlett, and Lee (1998) Choose classifier f that maximizes minimal margin κ 1 ≤ i ≤ n y i x ⊺ i θ /∥ θ ∥ 1 κ = max θ ∈ R p min 1 √ n κ ⋅ (log factors, constants) generalization error < 9 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Generalization: for all f ( x ) = x ⊺ θ /∥ θ ∥ 1 and κ > 0, √ √ log ( 1 / δ ) n P ( y f ( x ) < 0 ) ≤ 1 log n log p ■ ( y i f ( x i ) < κ ) ∑ + + , w.p. 1 − δ n n κ 2 n i = 1 �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� generalization error empirical margin Schapire, Freund, Bartlett, and Lee (1998) “An important open problem is to derive more careful and precise bounds which can be used for this purpose. Besides paying closer attention to constant factors, such an analysis might also involve the measurement of more sophisticated statistics.” Schapire, Freund, Bartlett, and Lee (1998) 9 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch KEY : EMPIRICAL MARGIN Empirical margin is key to Generalization and Optimization. Optimization: for AdaBoost, p -weak learners, Z ∶ = y ○ X ∈ R n × p γ 2 n T ■ ( − y i x ⊺ i θ T > 0 ) ≤ ne ⋅ exp ( − 2 ( 1 + o ( γ t ))) . t ∑ ∑ i = 1 t = 1 By Minimax Thm. ∣ γ t ∣ = ∥ Z ⊺ η t ∥ ∞ ≥ min ∥ Z ⊺ η ∥ ∞ = min ∥ θ ∥ 1 ≤ 1 η ⊺ Z θ = max 1 ≤ i ≤ n e ⊺ max ∥ θ ∥ 1 ≤ 1 min i Z θ ≥ κ η ∈ ∆ n η ∈ ∆ n Freund and Schapire (1995); Zhang and Yu (2005) Stopping time (zero-training error) optimization steps < 1 κ 2 ⋅ (log factors, constants) 10 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch L 1 GEOMETRY , MARGIN , AND INTERPOLATION We consider min- L 1 -norm interpolated classifier on separable data ∥ θ ∥ 1 , s.t. y i x ⊺ i θ ≥ 1 , ∀ i ∈ [ n ] . ˆ θ ℓ 1 = arg min θ Algorithmic: on separable data, Boosting algorithm θ T , s boost with infinitesimal stepsize s agrees with the min- L 1 -norm interpolation asymptotically T → ∞ θ T , s boost /∥ θ T , s boost ∥ 1 = ˆ lim s → 0 lim θ ℓ 1 . Freund and Schapire (1995); Rosset et al. (2004); Zhang and Yu (2005) 11 / 35

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch L 1 GEOMETRY , MARGIN , AND INTERPOLATION min- L 1 -norm interpolation equiv. max- L 1 -margin 1 ≤ i ≤ n y i x ⊺ i θ = ∶ κ ℓ 1 ( X , y ) . ∥ θ ∥ 1 ≤ 1 min max Prior understanding: 1 √ n κ ⋅ (log factors, constants) generalization error < optimization steps < 1 κ 2 ⋅ (log factors, constants) 12 / 35

Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: - PowerPoint PPT Presentation

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory Tengyuan Liang joint work with Pragya Sur

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Interpolated and Warped 2-D Digital Interpolated and Warped 2-D Digital Waveguide Mesh Algorithms

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

Interpolated Linear Modeling of the F16 Benchmark Maarten Schoukens April 2017 Overview Global

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Steganalysis by Ensemble Classifiers with Boosting by Regression, and Post-Selection of Features

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

CS 188: Artificial Intelligence Bayes Nets: Sampling Instructors: Dan Klein and Pieter Abbeel

Mrs. Inlow 8 th Grade English Tonights slides can be accessed at the link above. 1st Fiction

DEVIRTUALIZATION IN LLVM Piotr Padlewski University of Warsaw piotr.padlewski@gmail.com

Proactive Detection of Collaboration Conflicts Yuriy Brun Reid Holmes Michael D. Ernst David

"Life is too short to occupy oneself with the slaying of the slain more than once."

FI RST QUARTER 2 0 1 6 RESULTS First Quarter 2016 April 29, 2016 FORW ARD LOOKI NG STATEMENTS

Class ass Interact eraction ion CONTENT NT Classroom Routines and Expectations

Publication Processes and Strategy 20121109, Chalmers, Gteborg Robert Feldt Based on slides

Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: - PowerPoint PPT Presentation

Intro. Min-Norm Interpolation Boosting and Margin Main Results: Precise Asymptotics Proof Sketch Boosting, Min-Norm Interpolated Classifiers, and Overparametrization: a precise asymptotic theory Tengyuan Liang joint work with Pragya Sur

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Interpolated and Warped 2-D Digital Interpolated and Warped 2-D Digital Waveguide Mesh Algorithms

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

Interpolated Linear Modeling of the F16 Benchmark Maarten Schoukens April 2017 Overview Global

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Steganalysis by Ensemble Classifiers with Boosting by Regression, and Post-Selection of Features

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

CS 188: Artificial Intelligence Bayes Nets: Sampling Instructors: Dan Klein and Pieter Abbeel

Mrs. Inlow 8 th Grade English Tonights slides can be accessed at the link above. 1st Fiction

DEVIRTUALIZATION IN LLVM Piotr Padlewski University of Warsaw piotr.padlewski@gmail.com

Proactive Detection of Collaboration Conflicts Yuriy Brun Reid Holmes Michael D. Ernst David

&quot;Life is too short to occupy oneself with the slaying of the slain more than once.&quot;

FI RST QUARTER 2 0 1 6 RESULTS First Quarter 2016 April 29, 2016 FORW ARD LOOKI NG STATEMENTS

Class ass Interact eraction ion CONTENT NT Classroom Routines and Expectations

Publication Processes and Strategy 20121109, Chalmers, Gteborg Robert Feldt Based on slides

"Life is too short to occupy oneself with the slaying of the slain more than once."