Iterative Convex Regularization Lorenzo Rosasco Universita di - PowerPoint PPT Presentation

Iterative Convex Regularization Lorenzo Rosasco Universita’ di Genova Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14 ongoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

Early Stopping Iterative Convex Regularization Lorenzo Rosasco Universita’ di Genova Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14 ongoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

Plan Optimization & Statistics/Estimation part I: introduction to iterative regularization • part II: iterative convex regularization: problem and results •

Linear Inverse Problems Φ y Φ : H → G Φ w = y, G H linear and bounded Moore-Penrose Solution w † = arg min R ( w ) Φ w = y strongly convex lsc Examples : *endless list here*

Data Φ w = y Data Type I k y � ˆ y k  δ Data Type II � � � Φ ∗ y − ˆ Φ ∗ ˆ � ≤ δ � � y Φ : H → ˆ ˆ G � � Φ ∗ ˆ � Φ ∗ Φ − ˆ Φ � ≤ η � � • Data type I: Deterministic/stochastic noise […] • Data type II: stochastic noise statistical Learning [R. et al. ’05], also econometrics, discretized PDEs (?)

Learning* as an Inverse Problem [De vito et al. ‘05] w † , X i ⌦ ↵ Y i = + N i , i = 1 , . . . , n Can be shown to fit Data Type II with n Φ = 1 Φ ∗ Φ = E XX T , Φ ∗ ˆ X ˆ X i X T i n 1 i =1 δ , η ∼ √ n n y = 1 X ˆ Φ ∗ y = E XY, Φ ∗ ˆ X i Y i n i =1 Nonparametric extensions via RKHS theory: Covariance operators become integral operators *Random Design Regression

Tikhonov Regularization 2 � � � ˆ w λ = arg min ˆ Φ w − ˆ + λ R ( w ) , λ ≥ 0 y � � � w ∈ H Variance Computations ˆ w t, λ k Φ w � y k 2 + λ R ( w ) w λ = arg min w ∈ H Bias - New Trade-Offs (?) - Complexity of Model selection? w † = arg min R ( w ) Φ w = y

From Tikhonov Regularization …to Landweber Regularization R ( w ) = k w k 2 w † = Φ † y ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y t X ( I − Φ ∗ Φ ) j Φ ∗ y ∼ j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 2 1.5 1 w † = Φ † y 0.5 Y 0 − 0.5 ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y − 1 t − 1.5 X ( I − Φ ∗ Φ ) j Φ ∗ y − 2 ∼ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 Emp Err w † = ˆ w † � Φ † ˆ � � ˆ � , w t − ˆ ˆ y w † = Φ † y 1 2 t ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y 10 10 Val Err t � ˆ w t − w † � � X ( I − Φ ∗ Φ ) j Φ ∗ y � ∼ j =0 1 2 10 10 t w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 Emp Err w † = ˆ w † � Φ † ˆ � � ˆ � , w t − ˆ ˆ y w † = Φ † y 1 2 t ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y 10 10 Semi-Convergence Val Err t � ˆ w t − w † � � X ( I − Φ ∗ Φ ) j Φ ∗ y � ∼ j =0 1 2 10 10 t w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

Remarks I R ( w ) = k w k 2 Data type I: • History : iteration+semiconvergence [Landweber ’50] …[…Nemirovski’86…] • Other iterative approaches— some acceleration: nu-method /Chebyshev method [Brakhage ‘87, Nemirovski Polyak'84], conjugate gradient [Nemirovski’86…]… • Deterministic noise [Engl et al. ’96], stocastic noise […,Buhlmann, Yu ’02 (L2 Boosting),Bissantz et al. ’07] • Extensions to noise in the operator [Nemirovski’86,…] • Nonlinear problems [Kaltenbacher et al. ’08] • Banach Spaces [Schuster et al. ‘12]

Remarks II R ( w ) = k w k 2 Data type II: • Deterministic noise Landweber and nu-method [De Vito et al. ‘06] • Stochastic noise/learning Landweber and nu-method [Ong Canu ’04, R et al ’04, Yao et al.’05, Bauer et al. ’06, Caponetto Yao ’07, Raskutti et al.’13] • …also conjugate gradient [Blanchard Cramer ‘10] • …and incremental gradient aka multiple passes SGD [R et al.’14] • …and (convex) loss, subgradient method [Lin, R, Zhou ‘15] • Works really well in practice [Huang et al. ’14, Perronnin et al. ‘13] • Regularization “path” is for free

Remarks III Emp Err w † = ˆ w † � Φ † ˆ � ˆ � � , w t − ˆ ˆ y Take home message Computations/iterations 1 2 t 10 10 Semi-Convergence control Val Err stability/regularization � ˆ w t − w † � � � New trade-offs? 1 2 10 10 w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

Can we derive iterative regularization for any (strongly) convex regularization?

Plan part I: introduction to iterative regularization • part II: iterative convex regularization: problem and • results

w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y ) How can I tell the iteration which regularization I want to use? w † = arg min R ( w ) Φ w = y

Iterative Regularization and Early Stopping w t = A ( w 0 , . . . , w t − 1 , Φ , y ) Convergence Exact � w t − w † � � � → 0 , t → ∞ Noisy ∃ t † = t † ( w † , δ , η ) � ˆ � → 0 , w t † − w † � � ( δ , η ) → 0 s.t. Error Bounds ∃ t † = t † ( w † , δ , η ) � ˆ � ≤ ε ( w † , δ , η ) w t † − w † � � s.t. • adaptivity, e.g. via discrepancy or Lepskii principles

Dual Forward Backward (DFB) w † = arg min 2 k · k 2 , R = F + α R ( w ) α � 0 Φ w = y convex lsc � � � − α − 1 Φ ∗ v t w t = prox α − 1 F ( ∀ t ∈ N ) γ t = α v t +1 = v t + γ t ( Φ w t − y ) . • Analogous iteration for noisy data • Special case of dual forward backward splitting [Combettes et al. ’10]… • …also a form of augmented Lagrangian method/ADMM [see Beck Teboulle ‘14] • …also can be shown to be equivalent to linearized Bregmanized operator splitting [Burger, Osher et al. …] • Reduces to Landweber iteration if we consider only the squared norm

Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � t ) Theorem . If there exists v † ∈ G such that Φ ∗ v † ∈ ∂ R ( w † ) the DFB sequence ( w t ) t for v 0 = 0 satisfies k w t � w † k  k v † k p t α 2 k w t � w † k 2  D ( v t ) � D ( v † ) α Proof idea

Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � c δ t t ) Theorem . Let ( w t ) t , ( ˆ w t ) t be the DFB sequences for ˆ v 0 = v 0 = 0 . Then it holds w t � w t k  2 t δ k ˆ k Φ k

Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � c δ t t ) t † = c δ − 2 / 3 ⇒ � ≤ c δ 1 / 3 � ˆ w t † − w † � �

Analysis for Data Type II [R. Villa Vu et al.’14] �  k ˆ � ˆ w t � w † � � w t � w † � � � w t � w t k + � ( δ + η )(1 + c ) t � / ( α � v † � √ � t ) c p ˆ t � w † k  t = c log 1 / ( δ + η ) ) k ˆ w ˆ log(1 / p δ + η ) p

Remarks Data Type I • General convex setting— only weak convergence [Burger, Osher et al. ~’09’10], no stability results, no strong convergence. • Sparsity based regularization [Osher et al. ’14] Data Type II • No previous results , either convergence or error bounds. • Directly give results for statistical learning . • Acceleration possible, but stability harder to prove (e.g. via dual FISTA, Chambolle Pock…) • Polynomial estimates of variance under stronger conditions (satisfied in certain smooth cases, e.g. Landweber) • Connections to regularization path , e.g. Lasso path/Lars Results…

Iterative Convex Regularization Lorenzo Rosasco Universita di - PowerPoint PPT Presentation

Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo,

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

16. Review of convex optimization Convex sets and functions Convex programming models

Iterative regularization via dual diagonal descent Silvia Villa Department of Mathematics

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Linear Regression II, SGD, Perceptron Milan Straka October 14, 2019 Charles University in

Graph Theoretic Approaches to Atom ic Vibrations in Fullerenes ERNESTO ESTRADA Department of

t r

Linear Algebra for Machine Learning Sargur N. Srihari srihari@cedar.buffalo.edu 1 Deep

Generic Ontology Learners on Application Domains Francesca Fallucchi 1 Maria Teresa Pazienza 1

Interference Alignment Approaches: Delayed CSIT and Alignment Matrix Jhanak Parajuli Jacobs

Geographic Data Science - Lecture V Space, formally Dani Arribas-Bel Today The need to

Creating smoothed maps with the help of the command Nick Deschacht

Iterative Convex Regularization Lorenzo Rosasco Universita di - PowerPoint PPT Presentation

Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo,

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

16. Review of convex optimization Convex sets and functions Convex programming models

Iterative regularization via dual diagonal descent Silvia Villa Department of Mathematics

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

Convex Analysis Jos e De Don a September 2004 Centre of Complex Dynamic Systems and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS133 Computational Geometry Convex Hull 4/12/2018 1 Convex Hull Given a set of n points,

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Linear Regression II, SGD, Perceptron Milan Straka October 14, 2019 Charles University in

Graph Theoretic Approaches to Atom ic Vibrations in Fullerenes ERNESTO ESTRADA Department of

t r

Linear Algebra for Machine Learning Sargur N. Srihari srihari@cedar.buffalo.edu 1 Deep

Generic Ontology Learners on Application Domains Francesca Fallucchi 1 Maria Teresa Pazienza 1

Interference Alignment Approaches: Delayed CSIT and Alignment Matrix Jhanak Parajuli Jacobs

Geographic Data Science - Lecture V Space, formally Dani Arribas-Bel Today The need to

Creating smoothed maps with the help of the command Nick Deschacht

Regularization Overview Regularization Overview Problems & Multicollinearity We will