Less is More: Computational Regularization by Subsampling Lorenzo - PowerPoint PPT Presentation

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris

A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization

A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08)

A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet ’08) Computational Regularization: Computation “tricks”=regularization

Supervised Learning Problem: Estimate f ∗ f ∗

Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 )

Supervised Learning Estimate f ∗ given S n = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Problem: f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) The Setting y i = f ∗ ( x i ) + ε i i ∈ { 1 , . . . , n } ◮ ε i ∈ R , x i ∈ R d random (bounded but with unknown distribution) ◮ f ∗ unknown

Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n

Non-linear/non-parametric learning M � � f ( x ) = c i q ( x, w i ) i =1 ◮ q non linear function ◮ w i ∈ R d centers ◮ c i ∈ R coefficients ◮ M = M n could/should grow with n Question: How to choose w i , c i and M given S n ?

Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

Learning with Positive Definite Kernels There is an elegant answer if: ◮ q is symmetric ◮ all the matrices � Q ij = q ( x i , x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba ’70; Sch¨ olkopf et al. ’01) ◮ M = n , ◮ w i = x i , ◮ c i by convex optimization! 1 They have non-negative eigenvalues

Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� i =1 any length! any center! 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )

Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin n f ∈H i =1 where 2 M � c i q ( x, w i ) , c i ∈ R , w i ∈ R d H = { f | f ( x ) = , M ∈ N } � �� i =1 any length! any center! Solution n � � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 2 The norm is induced by the inner product � f, f ′ � = � i,j c i c ′ j q ( x i , x j )

KRR: Statistics

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ =

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1

KRR: Statistics Well understood statistical properties: Classical Theorem If f ∗ ∈ H , then 1 1 f λ ∗ ( x ) − f ∗ ( x )) 2 � E ( � √ n √ n λ ∗ = Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels f λ ∗ ( x ) − f ∗ ( x )) 2 � n − 1 2 s E ( � λ ∗ = n − 2 s +1 , 2 s +1 3. Adaptive tuning , e.g. via cross validation 4. Proofs : inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)

KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q

KRR: Optimization � n � c = ( � Q + λnI ) − 1 � f λ = c i q ( x, x i ) with y i =1 Linear System Complexity ◮ Space O ( n 2 ) ◮ Time O ( n 3 ) b c = b y Q BIG DATA? Running out of time and space ... Can this be fixed?

Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ

Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ

Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes!

Beyond Tikhonov: Spectral Filtering Q + λI ) − 1 approximation of ˆ Q † controlled by λ ( ˆ Q † by saving computations? Can we approximate ˆ Yes! Spectral filtering (Engl ’96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆ Q ) ∼ ˆ Q † The filter function g λ defines the form of the approximation

Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . .

Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0

Spectral filtering Examples ◮ Tikhonov- ridge regression ◮ Truncated SVD– principal component regression ◮ Landweber iteration– GD/ L 2 -boosting ◮ nu-method– accelerated GD/Chebyshev method ◮ . . . Landweber iteration (truncated power series). . . � t − 1 c t = g t ( ˆ ( I − γ ˆ Q ) r � Q ) = γ y r =0 . . . it’s GD for ERM!! c r = c r − 1 − γ ( ˆ r = 1 . . . t Qc r − 1 − ˆ y ) , c 0 = 0

Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space n 3 n 2 Tikhonov n 2 λ − 1 n 2 GD ∗ n 2 λ − 1 / 2 n 2 Accelerated GD ∗ n 2 λ − γ n 2 Truncated SVD ∗ Note t: λ − 1 = t , for iterative methods ∗

Semiconvergence 0.24 Empirical Error Expected Error 0.22 0.2 Error 0.18 0.16 0.14 0.12 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Iteration ◮ Iterations control statistics and time complexity

Computational Regularization

Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ...

Computational Regularization BIG DATA? Running out of ✭✭✭✭ ❤❤❤❤ time and space ... Is there a principle to control statistics, time and space complexity?

Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

Subsampling 1. pick w i at random...

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n

Subsampling 1. pick w i at random... from training set (Smola, Scholk¨ opf ’00) w 1 , . . . , ˜ ˜ w M ⊂ x 1 , . . . x n M ≪ n 2. perform KRR on � M w i ∈ R d , ✘✘✘ ✘ ✘ w i ) , c i ∈ R , ✘✘✘ H M = { f | f ( x ) = M ∈ N } . c i q ( x, ˜ i =1

Less is More: Computational Regularization by Subsampling Lorenzo - PowerPoint PPT Presentation

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris A

Subsampling versus bootstrap in resampling-based model selection for multivariable regression

Subsampling at Information Theoretically Optimal Rates Adel Javanmard, Andrea Montanari Stanford

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano,

10. Regularization More on tradeoffs Regularization Effect of using different norms

On Windowing as a subsampling method for Distributed Data Mining David Mart nez-Galicia

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Graph Convolutional Network Heting Gao University of Illinois at Urbana-Champaign

Effects of breaking vibrational energy equipartition on measurements of temperature in

Storage ring-based Coherent THz Synchrotron Radiation Source Research at MIT-Bates Fuhua Wang

MobilityFirst Architecture Summary WINLAB Research Review May 14, 2012 Contact: D. Raychaudhuri

Figure 1: Sc hematic of the exp erimen tal set up. The gure notes the p

Action and Adaptation: Lessons from Neurobiology and Challenges for Robot Cognitive Architectures

Objectives Finish implementation of Stable Matching Get out your handouts Survey of

AHRQ National Web Conference on Applying Advanced Analytics in Clinical Care Presented by:

Less is More: Computational Regularization by Subsampling Lorenzo - PowerPoint PPT Presentation

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris A

Subsampling versus bootstrap in resampling-based model selection for multivariable regression

Subsampling at Information Theoretically Optimal Rates Adel Javanmard, Andrea Montanari Stanford

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano,

10. Regularization More on tradeoffs Regularization Effect of using different norms

On Windowing as a subsampling method for Distributed Data Mining David Mart nez-Galicia

Learn more Do more Be more Learn more Do more Be more UNITY Learn more Do

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Graph Convolutional Network Heting Gao University of Illinois at Urbana-Champaign

Effects of breaking vibrational energy equipartition on measurements of temperature in

Storage ring-based Coherent THz Synchrotron Radiation Source Research at MIT-Bates Fuhua Wang

MobilityFirst Architecture Summary WINLAB Research Review May 14, 2012 Contact: D. Raychaudhuri

Figure 1: Sc hematic of the exp erimen tal set up. The gure notes the p

Action and Adaptation: Lessons from Neurobiology and Challenges for Robot Cognitive Architectures

Objectives Finish implementation of Stable Matching Get out your handouts Survey of

AHRQ National Web Conference on Applying Advanced Analytics in Clinical Care Presented by:

Regularization Overview Regularization Overview Problems & Multicollinearity We will