The importance of better models in stochastic optimization John - PowerPoint PPT Presentation

Models in stochastic optimization Conditions on our models (convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18]

Models in stochastic optimization Conditions on our models ( ρ -weakly convex case) i. Convex model: y �→ f x ( y ; s ) is convex ii. Lower bound: f x ( y ; s ) ≤ f ( y ; s ) + ρ ( s ) � x − y � 2 2 2 iii. Local correctness: f x ( x ; s ) = f ( x ; s ) and ∂f x ( x ; s ) ⊂ ∂f ( x ; s ) [D. & Ruan 17; Davis & Drusvyatskiy 18; Asi & D. 19]

Modeling conditions Model f x ( y ) of f near x f ( x )

Modeling conditions Model f x ( y ) of f near x f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) ( x 0 , f ( x 0 ))

Modeling conditions Model f x ( y ) of f near x truncated f ( x ) f x 0 ( y ) = f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f ( x 0 ) + ∇ f ( x 0 ) T ( y − x 0 ) � f x 0 ( y ) = + ( x 0 , f ( x 0 ))

Models in stochastic optimization Linear Truncated x 1 x 0 i. (Sub)gradient: f x ( y ) = f ( x ) + � f ′ ( x ) , y − x � ii. Truncated: f x ( y ) = ( f ( x ) + � f ′ ( x ) , y − x � ) ∨ inf x f ( x ) iii. Bundle/multi-line: f x ( y ) = max { f ( x i ) + � f ′ ( x i ) , x − x i �} iv. Prox-linear: f x ( y ) = h ( c ( x ) + ∇ c ( x ) T ( y − x ))

The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X

Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

The aProx family Iterate: iid ◮ Sample S k ∼ P ◮ Update by minimizing model � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 α k x ∈ X

Divergence of a gradient method

Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1

Stability guarantees (convex) Use full stochastic-proximal method, � � 1 � x − x k � 2 x k +1 = argmin f ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 18) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and E [ � f ′ ( x ⋆ ; S ) � 2 ] ≤ σ 2 . Then k � E [dist( x k , X ⋆ ) 2 ] ≤ dist( x 0 , X ⋆ ) 2 + σ 2 α 2 i i =1 Theorem (Asi & D. 18) Under the same assumptions, dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k

Stability guarantees (convex) Use any model with f x ( y ; s ) ≥ inf z f ( z ; s ) (i.e. good lower bound) � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + . 2 α k x ∈ X Theorem (Asi & D. 19) Assume X ⋆ = argmin x ∈X F ( x ) is non-empty and there exists p < ∞ such that � � � 2 ] ≤ C (1 + dist( x, X ⋆ ) p ) . � f ′ ( x ; S ) E [ Then dist( x k , X ⋆ ) a.s. dist( x k , X ⋆ ) < ∞ and sup → 0 . k

Example behaviors � m 1 i =1 ( a T i x − b i ) 2 On least-squares objective F ( x ) = 2 m 10 6 SGM 10 5 Prox 10 4 10 3 10 2 10 1 10 0 10 1 10 2 0 100 200 300 400 500

Classical asymptotic analysis Theorem (Polyak & Juditsky 92) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are globally smooth. For x k generated by stochastic gradient method, k 1 � ( x i − x ⋆ ) d � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � √ � N . k i =1

truncated New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1

New asymptotic analysis (convex case) Theorem (Asi & D. 18) Let F be convex and strongly convex in a neighborhood of x ⋆ , and assume that f ( x ; S ) are smooth near x ⋆ . Then if x k remain bounded and the models f x k ( · ; S k ) satisfy our conditions, k 1 � � 0 , ∇ 2 F ( x ⋆ ) − 1 Cov( ∇ f ( x ⋆ ; S )) ∇ 2 F ( x ⋆ ) − 1 � ( x i − x ⋆ ) d √ � N . k i =1 truncated ◮ Optimal by local minimax theorem [H´ ajek 72; Le Cam 73; D. & Ruan 19] ◮ Key insight: subgradients of f x k ( · ; S k ) close to ∇ f ( x k ; S k )

Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and

Convergence to stationarity in weakly convex cases Convergence requires Moreau envelope [Davis & Drusvyatskiy 18] � � F ( y ) + λ 2 � y − x � 2 F λ ( x ) := inf , 2 y ∈ X Important properties: ◮ Proximal mapping: � � F ( y ) + λ x λ := prox F/λ ( x ) := argmin 2 � y − x � 2 2 y ∈ X satisfies ∇ F λ ( x ) = λ ( x − x λ ) ◮ Near stationarity and decrease: F ( x λ ) ≤ F ( x ) dist(0 , ∂F ( x λ )) ≤ �∇ F λ ( x ) � 2 and Convergence: Say iterates x k converge if ∇ F λ ( x k ) → 0

Moreau envelope of the absolute value For F ( x ) = | x | , F � λ 2 x 2 if | x | ≤ λ − 1 F λ ( x ) = | x | − 1 if | x | > λ − 1 2 λ F λ ◮ F ′ λ ( x ) = λx ◮ | F ′ λ ( x ) | = λ dist( x, 0) ◮ prox step x λ = 0 if | x | ≤ 1 /λ

Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1

Convergence in weakly convex cases Use regularized stochastic-proximal point method, � � f ( x ; S k ) + ρ ( S k ) 1 � x − x k � 2 � x − x k � 2 x k +1 = argmin 2 + . 2 2 2 α k x ∈ X Theorem (Asi & D. 19) Let random f be ρ ( s ) weakly convex with E [ ρ 2 ( S )] < ∞ . With proximal-point iteration, iterates x k satisfy F λ ( x k ) a.s. → G and ∞ � α k �∇ F λ ( x k ) � 2 2 < ∞ . k =1 Proposition (Asi & D. 19) If iterates x k remain bounded and image of stationary points has measure zero, ∇ F λ ( x k ) a.s. → 0 .

What is an easy problem? ◮ Interpolation problems [Belkin, Hsu, Mitra 18; Ma, Bassily, Belkin 18] ◮ Overparameterized linear systems (Kaczmarz algorithms) [Strohmer & Vershynin 09; Needell, Srebro, Ward 14; Needell & Tropp 14] ◮ Random projections for linear constraints [Leventhal & Lewis 10] 4 subsamples) (a) MNIST (b) CIFAR-10 (c) SVHN (

truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x

truncated What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18]

What is an easy problem? � minimize F ( x ) := E [ f ( x ; S )] = f ( x ; s ) dP ( s ) x Definition: Problem is easy if there exists x ⋆ such that f ( x ⋆ ; S ) = inf x f ( x ; S ) with probability 1. [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18; Belkin, Rakhlin, Tsybakov 18] truncated One additional condition iv. The models f x satisfy x ⋆ ∈ X f ( x ⋆ ; s ) f x ( y ; s ) ≥ inf

Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1

Easy strongly convex problems Theorem (Asi & D. 18) Let the function F satisfy the growth condition F ( x ) ≥ F ( x ⋆ ) + λ 2 dist( x, X ⋆ ) 2 where X ⋆ = argmin x F ( x ) , and be easy. Then � � � � k � E [dist( x k , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp − c α i , exp ( − ck ) i =1 ◮ Adaptive no matter the stepsizes ◮ Most other results (e.g. for SGM [Schmidt & Le Roux 13; Ma, Bassily, Belkin 18] ) require careful stepsize choices

Sharp problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m +

Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m

Sharp convex problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] ◮ Piecewise linear objectives � m � � ◮ Hinge loss F ( x ) = 1 1 − a T i x i =1 m + � m ◮ Projection onto intersections: F ( x ) = 1 i =1 dist( x, C i ) m Theorem (Asi & D. 18) Let F have sharp growth and be easy. If F is convex, � � �� k � E [dist( x k +1 , X ⋆ ) 2 ] ≤ max dist( x 1 , X ⋆ ) 2 . exp( − ck ) , exp − c α i i =1

Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19]

Sharp weakly problems Definition: An objective F is sharp if F ( x ) ≥ F ( x ⋆ ) + λ dist( x, X ⋆ ) for X ⋆ = argmin F ( x ) . [Ferris 88; Burke & Ferris 95] � � ( Ax ) 2 − ( Ax ⋆ ) 2 � ◮ Phase retrieval F ( x ) = 1 � m 1 ◮ Blind deconvolution [Charisopoulos et al. 19] Theorem (Asi & D. 19) Let F have sharp growth and be easy. There exists c ∈ (0 , 1) such that on the event x k → X ⋆ , dist( x k , X ⋆ ) lim sup < ∞ . (1 − c ) k k

Outline Motivating experiments Models in optimization Stochastic optimization Stability is better Nothing gets worse Beyond convexity Adaptivity in easy problems Revisiting experimental results

Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x

Methods Iterate � � 1 � x − x k � 2 x k +1 = argmin f x k ( x ; S k ) + 2 2 α k x ◮ Stochastic gradient f x k ( x ; S k ) = f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � ◮ Truncated gradient ( f ≥ 0 ): � � f ( x k ; S k ) + � f ′ ( x k ; S k ) , x − x k � f x k ( x ; S k ) = + ◮ (Stochastic) proximal point f x k ( x ; S k ) = f ( x ; S k )

Linear regression with low noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0

Linear regression with no noise m 1 � ( a T i x − b i ) 2 F ( x ) = 2 m i =1 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 10 2 10 1 10 0 10 1 10 2 10 3 10 4 Initial stepsize α 0

Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5

Linear regression with “poor” conditioning Accuracy epsilon = 0.055 1000 900 800 Proximal 700 SGM Truncated 600 Bundle 500 400 300 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Poor conditioning? κ ( A ) = 15

Multiclass hinge loss: no noise f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 SGM Truncated 8000 Prox Time to ǫ -accuracy 6000 4000 2000 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0

Multiclass hinge loss: small label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy SGM 6000 Truncated Prox 4000 2000 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0

Multiclass hinge loss: substantial label flipping f ( x ; ( a, l )) = max i � = l [1 + � a, x i − x l � ] + 10000 8000 Time to ǫ -accuracy 6000 4000 SGM Truncated 2000 Prox 0 1 10 1 10 4 10 5 10 10 0 10 2 10 3 Initial stepsize α 0

(Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15]

(Robust) Phase retrieval [Cand` es, Li, Soltanolkotabi 15] Observations (usually) b i = � a i , x ⋆ � 2 yield objective m f ( x ) = 1 � |� a i , x � 2 − b i | m i =1

Phase retrieval without noise m F ( x ) = 1 � |� a i , x � 2 − b i | m i =1 1000 800 Time to ǫ -accuracy 600 Proximal SGM Truncated 400 200 0 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Initial stepsize α 0

The importance of better models in stochastic optimization John - PowerPoint PPT Presentation

The importance of better models in stochastic optimization John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019 Outline Motivating experiments Models in optimization Stochastic optimization

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Architecture Research On Transport Information Services of EXPO 2010 Shanghai China Better City,

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

Dr. Agnieszka (Aga) Palalas October 2013 Speech Technologies and Their Applications in Mobile

Computing Quadratic Invariants Pierre Roux 1 , 2 , 3 Pierre-Loc Garoche 1 October 4, 2014 1

PCORnet Obesity Observational Research Initiative Applicant Town Hall March 23, 2015 1 Agenda

Automation and Computation in the Lean Theorem Prover Robert Y. Lewis 1 Leonardo de Moura 2 1

Size-Based Termination: Semantics and Generalizations Cody Roux INRIA-Lorraine Pareo June 7,

From winning Strategies to Nash equilibria Stphane Le Roux & Arno Pauly ENS Paris-Saclay

John D Alexander, PhD Forest Service Research and Development Seminar Series: Innovations in

Causality Abstractions in Non-Deterministic Automata Networks Loc Paulev LRI, CNRS /

The importance of better models in stochastic optimization John - PowerPoint PPT Presentation

The importance of better models in stochastic optimization John Duchi (based on joint work with Feng Ruan and Hilal Asi) Stanford University Les Houches 2019 Outline Motivating experiments Models in optimization Stochastic optimization

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

The Importance of The Importance of The Importance of The Importance of Mechanical Insulation

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Various Topics Outline 1. Dynamic (time-varying) Optimization Problems 2. Stochastic

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Architecture Research On Transport Information Services of EXPO 2010 Shanghai China Better City,

Stochastic Processes Will Perkins March 7, 2013 Stochastic Processes Q: What is a Stochastic

Dr. Agnieszka (Aga) Palalas October 2013 Speech Technologies and Their Applications in Mobile

Computing Quadratic Invariants Pierre Roux 1 , 2 , 3 Pierre-Loc Garoche 1 October 4, 2014 1

PCORnet Obesity Observational Research Initiative Applicant Town Hall March 23, 2015 1 Agenda

Automation and Computation in the Lean Theorem Prover Robert Y. Lewis 1 Leonardo de Moura 2 1

Size-Based Termination: Semantics and Generalizations Cody Roux INRIA-Lorraine Pareo June 7,

From winning Strategies to Nash equilibria Stphane Le Roux &amp; Arno Pauly ENS Paris-Saclay

John D Alexander, PhD Forest Service Research and Development Seminar Series: Innovations in

Causality Abstractions in Non-Deterministic Automata Networks Loc Paulev LRI, CNRS /

From winning Strategies to Nash equilibria Stphane Le Roux & Arno Pauly ENS Paris-Saclay