Stochastic optimization and sparse statistical recovery: An optimal - PowerPoint PPT Presentation

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les Houches, France

Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse

Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d

Introduction Sparse optimization: θ ∗ = arg min θ ∈ R d E P [ ℓ ( θ ; z )] = arg min θ L ( θ ) , such that θ ∗ is s-sparse Loss function ℓ is convex P unknown, can sample from it High dimensional setup: n ≪ d Want linear time and statistically (near) optimal algorithm

Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse

Example 1 : Computational genomics d −1 A C C G T G T S 1 A G C T A C T sign n n n = θ ∗ X y 1 C C G T A G A S C Predict disease susceptibility from genome Depends on very few genes, θ ∗ is sparse Sparse logistic regression: θ ∗ = arg min θ E P [log(1 + exp( − y θ T x ))] .

Example 2 : Compressed sensing θ ∗ y X w S n = n × d + S C Recover unknown signal θ ∗ from noisy measurements Sparse linear regression: θ ∗ = arg min θ E P [( y − θ T x ) 2 ] .

Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1

Approach 1: M -estimation (batch optimization) Draw n i.i.d. samples Obtain � θ n � n 1 � θ n = arg min ℓ ( θ ; z i ) + λ n � θ � 1 n θ i =1 Statistical arguments for consistency, � θ n → θ ∗ Convex optimization to compute � θ n

Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5

Batch optimization Convergence depends on properties of � n 1 ℓ ( θ ; z i ) + λ n � θ � 1 n i =1 Sample loss not (globally) strongly convex for n < d Poor smoothness when n ≪ d 25 20 15 10 5 0 5 5 0 0 −5 −5 But, smooth and strongly convex in sparse directions Example: Least-squares loss with random design

Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n n = 2500 2 p= 5000 p=10000 0 p=20000 θ � ) (rescaled) −2 −4 log( � θ t − ˆ −6 −8 −10 50 100 150 Iteration Count

Fast convergence of gradient descent We prove (global) linear convergence of gradient descent based on � n sparse condition number of 1 i =1 ℓ ( θ ; z i ) n α = 16 . 3069 n = 2500 2 2 p= 5000 p= 5000 p=10000 p=10000 0 0 p=20000 p=20000 θ � ) (rescaled) θ � ) (rescaled) −2 −2 −4 −4 log( � θ t − ˆ log( � θ t − ˆ −6 −6 −8 −8 −10 −10 50 100 150 50 100 150 Iteration Count Iteration Count

Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration

Computational complexity of batch optimization Convergence rate captures number of iterations Each iteration has complexity O ( nd ) One pass over data at each iteration But we wanted linear time algorithm!

Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t )

Approach 2: Stochastic optimization Directly minimize E P [ ℓ ( θ ; z )] Use samples to obtain gradient estimates θ t +1 = θ t − α t ∇ ℓ ( θ t ; z t ) Stop after one pass over data Statistically, often competitive with batch (that is, � θ n − θ ∗ � 2 ≈ � � θ n − θ ∗ � 2 ) Precise rates depend on the problem structure

Structural assumptions θ ∗ is s -sparse Make additional structural assumptions on L ( θ ) = E P [ ℓ ( θ ; z )] L is Locally Lipschitz L is Locally strongly convex (LSC)

Locally Lipschitz functions Definition (Locally Lipschitz function) L is locally G -Lipschitz in ℓ 1 -norm, meaning that |L ( θ ) − L (˜ θ ) | ≤ G � θ − ˜ θ � 1 , if � θ − θ ∗ � 1 ≤ R and � ˜ θ − θ ∗ � 1 ≤ R . Globally Lipschitz Locally Lipschitz

Locally strongly convex functions Definition (Locally strongly convex function) There is a constant γ > 0 such that θ − θ � + γ L (˜ θ ) ≥ L ( θ ) + �∇L ( θ ) , ˜ 2 � θ − ˜ θ � 2 2 , if � θ � 1 ≤ R and � ˜ θ � 1 ≤ R Locally Strongly convex Globally strongly convex

Stochastic optimization and structural conditions Method Sparsity LSC Convergence � d � SGD O �� T � s 2 log d Mirror descent/RDA/FOBOS/COMID O T � � s log d Our Method O T

Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t

Some previous methods All methods based on observing g t such that E [ g t ] ∈ ∂ L ( θ t ) Stochastic gradient descent: based on ℓ 2 distances, exploits LSC 1 θ t +1 = arg min θ � g t , θ � + � θ − θ t � 2 2 2 α t Stochastic dual averaging: based on ℓ p distances, exploits sparstity when p ≈ 1 � t 1 θ t +1 = arg min � g s , θ � + � θ � 2 p 2 α t θ s =1 Need to reconcile the geometries for exploiting both structures

RADAR algorithm: outline Based on Juditsky and Nesterov (2011) Recall the minimization problem: min θ E [ ℓ ( θ ; z )] Algorithm proceeds over K epochs At epoch i , solve the regularized problem: θ ∈ Ω i E [ ℓ ( θ ; z )] + λ i � θ � 1 min where Ω i = θ ∈ R d : � θ − y i � 2 p ≤ R 2 i

RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

RADAR algorithm: First epoch Require: R 1 such that � θ ∗ � 1 ≤ R 1 2log d Perform stochastic dual averaging with p = 2log d − 1 ≈ 1 Initialize θ 1 = 0, y 1 = 0 Observe g t where E [ g t ] ∈ ∂ L ( θ t ) and ν t ∈ ∂ � θ t � 1 θ ∗ Update R 2 y 1 = 0 µ t + g t + λ 1 ν t µ t +1 = R 1 1 θ t +1 � θ, µ t +1 � + � θ � 2 = arg min p 2 α t � θ � p ≤ R 1

Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch R 2 θ ∗ y 2

Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 R 2 θ ∗ y 2

Initializing next epoch Update y 2 = ¯ θ T Update R 2 2 = R 2 1 / 2 √ Update λ 2 = λ 1 / 2 Initialize θ 1 = y 2 for next epoch Now use updates µ t + g t + λ 2 ν t µ t +1 = 1 θ t +1 � θ, µ t +1 � + � θ − y 2 � 2 = arg min p 2 α t � θ − y 2 � p ≤ R 2 Each step still O ( d ) R 2 θ ∗ y 2

Stochastic optimization and sparse statistical recovery: An optimal - PowerPoint PPT Presentation

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions Alekh Agarwal Microsoft Research Joint work with Sahand Negahban and Martin Wainwright Workshop on Optimization and Statistical Learning 2013, Les

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Dual Effect in Stochastic Optimization February 10, 2015 P. Carpentier Master MMMEF Cours

Sparse Recovery and Fourier Sampling Eric Price MIT Eric Price (MIT) Sparse Recovery and

Stochastic Optimization and Discretization January 06, 2021 P. Carpentier Master Optimization

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Adaptive Sparse Recovery Eric Price MIT 2012-04-26 Joint work with Piotr Indyk and David

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Introduction to Stochastic Optimization January 13, 2015 P. Carpentier Master MMMEF Cours

Stochastic Online Optimization Jian Li Institute of Interdisciplinary Information Sciences

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Fast sparse methods for genomics data Jean-Philippe Vert Optimization and Statistical Learning

Tutorial: Sparse Recovery Using Sparse Matrices Piotr Indyk MIT Problem Formulation

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Sparse Coding and Dictionary Learning for Image Analysis Part I: Optimization for Sparse Coding

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Optimality and Support Projection Algorithm for Sparsity Constrained Minimization Lili Pan

Optimal Approximation of Queries Using Tractable Propositional Languages Robert Fink and Dan

CS 758/858: Algorithms http://www.cs.unh.edu/~ruml/cs758 Greedy Huffman Coding Wheeler Ruml

Optimal decentralized control of coupled subsystems with control sharing Aditya Mahajan McGill

Random Oracles in a Quantum World Dan Boneh 1 ur Dagdelen 2 Marc Fischlin 2 Ozg Anja Lehmann

Further exploitation of the RB framework Yvon Maday, Laboratoire Jacques-Louis Lions Sorbonne

Some Approaches to Complexity Reduction: Application to Computational Chemistry Yvon Maday,

Sea earch of E threshold in 70 70 Ni E1 1 strenght around th Ni O.Wieland, A.Bracco, R.