Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT - PowerPoint PPT Presentation

Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � 9

Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 9

Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 – Part 2 9

Outline: bibliography a) Non-parametric Stochastic Approximation with Large Step-sizes, A. Dieuleveut and F. Bach, in the Annals of Statistics b) Harder, Better, Faster, Stronger Convergence Rates for Least-squares Regression, A. Dieuleveut, N. Flammarion and F. Bach, in Journal of Machine Learning Research c) Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains, A. Dieuleveut, A. Durmus, F. Bach, under submission. Quadratic loss Smooth loss FD Non-parametric a) � � � b) � � � c) � � � Part 1 – Part 2 – Part 3 9

Outline 1. Introduction. 2. A warm up! Results in finite dimension, ( d ≫ n ) ◮ Averaged stochastic descent: adaptivity ◮ Acceleration: two optimal rates 3. Non-parametric stochastic approximation 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. 10

Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . 11

Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . � Φ( X )Φ( X ) ⊤ � ∈ R d × d : for θ ∗ the best linear predictor, Let Σ = E � � 2 � Σ 1 / 2 ( θ − θ ∗ ) � � R ( θ ) − R ( θ ∗ ) = . � � � Φ( X ) � 2 � Let R 2 := E , σ 2 := E � ( Y − � θ ∗ , Φ( X ) � ) 2 � . 11

Behavior of Stochastic Approximation in high dimension Least-squares regression in finite dimension: �� 2 � R ( θ ) = E ρ � θ, Φ( X ) � − Y . � Φ( X )Φ( X ) ⊤ � ∈ R d × d : for θ ∗ the best linear predictor, Let Σ = E � � 2 � Σ 1 / 2 ( θ − θ ∗ ) � � R ( θ ) − R ( θ ∗ ) = . � � � Φ( X ) � 2 � Let R 2 := E , σ 2 := E � ( Y − � θ ∗ , Φ( X ) � ) 2 � . Consider stochastic gradient descent ( a.k.a., Least-Mean-Squares ) Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � � 2 − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) 11

Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� Variance Bias † Dieuleveut and Bach [2015]. 12

Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� Variance Bias Variance term Bias term 2 � Σ − 1 / 2 ( θ ∗ − θ 0 ) � � θ ∗ − θ 0 � 2 γσ 2 tr (Σ) σ 2 d n γ n γ 2 n 2 α = 1 α → ∞ r = 1 / 2 . r = 1 . † Dieuleveut and Bach [2015]. 12

Theorem 1 † , consequences Theorem 1 For any γ ≤ 4 R 2 , for any α > 1, for any r ≥ 0, for any n ∈ N , � � 2 � − R ( θ ∗ ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n . n 1 − 1 /α γ 2 r n min(2 r , 2) � �� Variance Bias Variance term Bias term 2 � Σ − 1 / 2 ( θ ∗ − θ 0 ) � � θ ∗ − θ 0 � 2 γσ 2 tr (Σ) σ 2 d n γ n γ 2 n 2 α = 1 α → ∞ r = 1 / 2 . r = 1 . � �� Recovers Improves Bach and Moulines [2013] asymptotic Bias † Dieuleveut and Bach [2015]. 12

Theorem 1, consequences Theorem 1 For any γ ≤ 4 R 2 , for any n ∈ N , � � � � 2 � 4 σ 2 γ 1 /α tr (Σ 1 /α ) � Σ 1 / 2 − r ( θ ∗ − θ 0 ) + 4 � ¯ � E R θ n − R ( θ ∗ ) ≤ inf . n 1 − 1 /α γ 2 r n min(2 r , 2) α> 1 , r ≥ 0 � �� Variance Bias γ 1 /α tr (Σ 1 /α ) Adaptivity n 1 − 1 /α Upper bound on the variance term as a function of α . d ≫ n . α 13

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML 14

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Builds an estimator given n observations. � statistical lower bound : σ 2 d n 14

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in t iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n t 2 14

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in t iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n t 2 here, n = t . 14

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in n iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n n 2 here, n = t . 14

Limits to SA performance: two lower bounds Stochastic Approximation in Supervised ML Approximates the minimum of an ( L − smooth) function in n iterations , using first order Builds an estimator given n information. observations. � statistical lower bound : � optimization lower bound : σ 2 d L � θ 0 − θ ∗ � 2 . n n 2 here, n = t . Theorem 1, for Av-SGD, gives as upper bound: � � ; L 2 � � � 2 L � θ 0 − θ ∗ � 2 σ 2 d � Σ − 1 / 2 ( θ 0 − θ ∗ ) + min . n 2 n n 14

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known. † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration † Optimal rate (for deterministic optimization) , is achieved by accelerated gradient descent : � θ n η n − 1 − γ n f ′ ( η n − 1 ) = η n = θ n + δ n ( θ n − θ n − 1 ) . Problem: acceleration is sensitive to noise [d’Aspremont, 2008]. Combining SGD, acceleration and averaging, ◮ using extra regularization, ◮ and for “additive” noise model only, we achieve both of the optimal rates. Caveat: LMS recursion does not provide an additive noise oracle. Different recursion with Σ known. † Dieuleveut, Flammarion, Bach [2016] 15

Acceleration and averaging More precisely we consider: ν n − 1 − γ R ′ θ n = n ( ν n − 1 ) − γλ ( ν n − 1 − θ 0 ) � � ν n = θ n + δ θ n − θ n − 1 , Theorem For any γ ≤ 1 / 2 R 2 , for δ = 1, and λ = 0, − R ( θ ∗ ) ≤ 8 σ 2 d n + 1 + 36 � θ 0 − θ ∗ � 2 � � R (¯ E θ n ) γ ( n + 1) 2 . Optimal rate from both statistical and optimization point of view. 16

Outline 1. Introduction. 2. A warm up! Results in finite dimension, ( d ≫ n ) 3. Non-parametric stochastic approximation ◮ Averaged stochastic descent: statistical rate of convergence ◮ Acceleration: improving convergence in ill-conditioned regimes 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. 17

Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ 18

Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . 18

Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . 18

Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . Moreover, for any function g in L 2 ρ X , the excess risk is: R ( g ) − R ( g ρ ) = � g − g ρ � 2 ρ X . L 2 18

Non-parametric Random Design Least Squares Regression Goal: � ( Y − g ( X )) 2 � min g R ( g ) = E ρ ◮ ρ X marginal distribution of X in X , ◮ L 2 ρ X set of squared integrable functions w.r.t. ρ X . Bayes predictor minimizes the quadratic risk over L 2 ρ X : g ρ ( X ) = E [ Y | X ] . Moreover, for any function g in L 2 ρ X , the excess risk is: R ( g ) − R ( g ρ ) = � g − g ρ � 2 ρ X . L 2 H L 2 H a space of functions: there exists g H ∈ ¯ ρ X such that R ( g H ) = inf g ∈H R ( g ) . 18

Reproducing Kernel Hilbert Space Definition A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R , such that there exists a reproducing kernel K : X × X → R , satisfying: ◮ For any x ∈ X , H contains the function K x , defined by: K x : X → R z �→ K ( x , z ) .

Reproducing Kernel Hilbert Space Definition A Reproducing Kernel Hilbert Space (RKHS) H is a space of functions from X into R , such that there exists a reproducing kernel K : X × X → R , satisfying: ◮ For any x ∈ X , H contains the function K x , defined by: K x : X → R z �→ K ( x , z ) . ◮ For any x ∈ X and f ∈ H , the reproducing property holds: � K x , f � H = f ( x ) . 19

Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! 20

Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! H L 2 ◮ Approximation: many kernels satisfy ¯ ρ X = L 2 ρ X , there is no approximation error ! 20

Why are RKHS so nice? ◮ Computation: ◮ Linear spaces of functions. ◮ Existence of gradients (Hilbert). ◮ Possible to compute inner products thanks to the reproducing property. ◮ Only deal with functions in the set span { K x i , i = 1 . . . n } (representer theorem). � the algebraic framework is preserved ! H L 2 ◮ Approximation: many kernels satisfy ¯ ρ X = L 2 ρ X , there is no approximation error ! ◮ Representation: Feature map, X → H x �→ K x maps points from any set into a linear space to apply a linear method. 20

Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . 21

Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . Consider the stochastic gradient recursion, starting from g 0 ∈ H : � � g n = g n − 1 − γ � g n − 1 , K x n � H − y n K x n , where γ is the step-size . 21

Stochastic approximation in the RKHS. � ( � g , K X � H − Y ) 2 � As R ( g ) = E , for each pair of observations ( � g , K x n � H − y n ) K x n = ( g ( x n ) − y n ) K x n is an unbiased stochastic gradient of R at g . Consider the stochastic gradient recursion, starting from g 0 ∈ H : � � g n = g n − 1 − γ � g n − 1 , K x n � H − y n K x n , where γ is the step-size . Thus n � g n = a i K x i , i =1 with ( a n ) n � 1 , a n = − γ n ( g n − 1 ( x n ) − y n ) . With averaging, n 1 � g n = g k n + 1 k =0 Total complexity: O ( n 2 ) 21

Kernel regression: Analysis � Y 2 � Assume E [ K ( X , X )] and E are finite. Define the covariance operator . � � K X K ⊤ Σ = E . X We make two assumptions: ◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of g H w.r.t. the kernel space H . 22

Kernel regression: Analysis � Y 2 � Assume E [ K ( X , X )] and E are finite. Define the covariance operator . � � K X K ⊤ Σ = E . X We make two assumptions: ◮ Capacity condition: eigenvalue decay of Σ. ◮ Source condition: position of g H w.r.t. the kernel space H . Σ is a trace-class operator, that can be decomposed over its eigen-spaces. Its power: Σ τ , τ > 0. are thus well defined. 22

Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . 23

Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . If we denote ( µ i ) i ∈ I the sequence of non-zero eigenvalues of the operator Σ, in decreasing order, then µ i = O ( i − α ). 23

Capacity condition (CC) CC ( α ) : for some α > 1 , we assume that tr (Σ 1 /α ) < ∞ . If we denote ( µ i ) i ∈ I the sequence of non-zero eigenvalues of the operator Σ, in decreasing order, then µ i = O ( i − α ). Sobolev first order kernel Gaussian kernel Eigenvalue decay of the log 10 ( µ i ) covariance operator. log 10 ( i ) log 10 ( i ) Left: min kernel, ρ X = U [0; 1], − → CC ( α = 2). Right: Gaussian kernel, ρ X = U [ − 1; 1]. − → CC ( α ) , ∀ α ≥ 1. 23

Source condition (SC) Concerning the optimal function g H , we assume: SC( r ): for some r � 0 , g H ∈ Σ r � � L 2 ρ X Thus � Σ − r ( g H ) � L 2 ρ X < ∞ . r < 0 . 5 r = 0 . 5 r > 0 . 5 24

NPSA with large step sizes Theorem 1 Assume CC( α ) and SC( r ). Then for any γ ≤ 4 R 2 , � Σ − r ( g H − g 0 ) � 2 g n ) − R ( g H ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) L 2 ρ X E R (¯ + 4 . n 1 − 1 /α γ 2 r n min(2 r , 2) − 2 α r − 1+ α , for α − 1 for γ = γ 0 n 2 α ≤ r ≤ 1 2 α r +1 � � − 2 α r � � 4 σ 2 tr (Σ 1 /α ) + 4 � 2 � Σ − r ( g H − g 0 ) E R (¯ g n ) − R ( g H ) ≤ n . 2 α r +1 L 2 ρ X 25

NPSA with large step sizes Theorem 1 Assume CC( α ) and SC( r ). Then for any γ ≤ 4 R 2 , � Σ − r ( g H − g 0 ) � 2 g n ) − R ( g H ) ≤ 4 σ 2 γ 1 /α tr (Σ 1 /α ) L 2 ρ X E R (¯ + 4 . n 1 − 1 /α γ 2 r n min(2 r , 2) − 2 α r − 1+ α , for α − 1 for γ = γ 0 n 2 α ≤ r ≤ 1 2 α r +1 � � − 2 α r � � 4 σ 2 tr (Σ 1 /α ) + 4 � 2 � Σ − r ( g H − g 0 ) E R (¯ g n ) − R ( g H ) ≤ n . 2 α r +1 L 2 ρ X ◮ Statistically optimal rate. [Caponnetto and De Vito, 2007]. ◮ Beyond: online, minimal assumptions... 25

Optimality regions r = 0 . 5 r > 0 . 5 r = α − 1 2 α 5 4 B > V Saturation α 3 2 1 r r = 1 = 2 r = 1 r < 0 . 5 r ≪ 0 . 5 Optimal rate in RKHS can be achieved via large step size and averaging in many situations. 26

Acceleration: Reproducing kernel Hilbert space setting We consider the RKHS setting presented before. Theorem Assume CC( α ) and SC( r ). Then for γ = γ 0 n − 4 r α +2 − α 2 r α +1 , for γ n 2 , for r ≥ α − 2 1 λ = 2 α , − 2 α r 2 α r +1 . E R (¯ g n ) − R ( g H ) ≤ C θ 0 ,θ ∗ , Σ n 27

Acceleration: Reproducing kernel Hilbert space setting We consider the RKHS setting presented before. Theorem Assume CC( α ) and SC( r ). Then for γ = γ 0 n − 4 r α +2 − α 2 r α +1 , for γ n 2 , for r ≥ α − 2 1 λ = 2 α , − 2 α r 2 α r +1 . E R (¯ g n ) − R ( g H ) ≤ C θ 0 ,θ ∗ , Σ n r = α − 1 r = α − 2 2 α 2 α 5 B > V 4 Saturation α 3 2 1 r r = 1 r = 1 = 2 27

Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. 28

Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. ◮ Sheds some light on FD case. 28

Least squares: some conclusions ◮ Provide optimal rate of convergence under two assumptions for non-parametric regression in Hilbert spaces: large step sizes and averaging. ◮ Sheds some light on FD case. ◮ Possible to attain simultaneously optimal rate from the statistical and optimization point of view. 28

Outline 1. Introduction. 2. Non-parametric stochastic approximation 3. Faster rates with acceleration 4. Stochastic approximation as a Markov chain: extension to non quadratic loss functions. ◮ Motivation ◮ Assumptions ◮ Convergence in Wasserstein distance. 29

Motivation 1/ 2. Large step sizes! � θ n ) − R ( θ ∗ ) R (¯ � log 10 log 10 ( n ) Logistic regression. Final iterate (dashed), and averaged recursion (plain). 30

Motivation 2/ 2. Difference between quadratic and logistic loss Logistic Regression Least-Squares Regression � 1 � E R (¯ E R (¯ θ n ) − R ( θ ∗ ) = O ( γ 2 ) θ n ) − R ( θ ∗ ) = O n with γ = 1 / (4 R 2 ) with γ = 1 / (4 R 2 ) 31

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . 32

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , 32

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property 32

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. 32

SGD: an homogeneous Markov chain Consider a L − smooth and µ − strongly convex function R . SGD with a step-size γ > 0 is an homogeneous Markov chain: � � θ γ k +1 = θ γ R ′ ( θ γ k ) + ε k +1 ( θ γ k − γ k ) , ◮ satisfies Markov property ◮ is homogeneous, for γ constant, ( ε k ) k ∈ N i.i.d. Also assume: k = R ′ + ε k +1 is almost surely L -co-coercive. ◮ R ′ ◮ Bounded moments E [ � ε k ( θ ∗ ) � 4 ] < ∞ . 32

Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n † Dieuleveut, Durmus, Bach [2017] . 33

Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ † Dieuleveut, Durmus, Bach [2017] . 33

Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. † Dieuleveut, Durmus, Bach [2017] . 33

Stochastic gradient descent as a Markov Chain: Analysis framework † ◮ Existence of a limit distribution π γ , and linear convergence to this distribution: d θ γ → π γ . n ◮ Convergence of second order moments of the chain, L 2 ¯ ¯ θ n ,γ − → θ γ := E π γ [ θ ] . n →∞ ◮ Behavior under the limit distribution ( γ → 0): ¯ θ γ = θ ∗ + ?. � Provable convergence improvement with extrapolation tricks. † Dieuleveut, Durmus, Bach [2017] . 33

Existence of a limit distribution γ → 0 d ( θ γ Goal: n ) n ≥ 0 → π γ . Theorem For any γ < L − 1 , the chain ( θ γ n ) n ≥ 0 admits a unique stationary distribution π γ . In addition for all θ 0 ∈ R d , n ∈ N : � R d � θ 0 − ϑ � 2 d π γ ( ϑ ) . W 2 2 ( θ γ n , π γ ) ≤ (1 − 2 µγ (1 − γ L )) n 34

Existence of a limit distribution γ → 0 d ( θ γ Goal: n ) n ≥ 0 → π γ . Theorem For any γ < L − 1 , the chain ( θ γ n ) n ≥ 0 admits a unique stationary distribution π γ . In addition for all θ 0 ∈ R d , n ∈ N : � R d � θ 0 − ϑ � 2 d π γ ( ϑ ) . W 2 2 ( θ γ n , π γ ) ≤ (1 − 2 µγ (1 − γ L )) n Wasserstein metric: distance between probability measures. 34

Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? 35

Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? If θ 0 ∼ π γ , then θ 1 ∼ π γ . � � θ γ 1 = θ γ R ′ ( θ γ 0 ) + ε 1 ( θ γ 0 − γ 0 ) . � � R ′ ( θ ) = 0 E π γ 35

Behavior under limit distribution. Ergodic theorem: ¯ θ n → E π γ [ θ ] =: ¯ θ γ . Where is ¯ θ γ ? If θ 0 ∼ π γ , then θ 1 ∼ π γ . � � θ γ 1 = θ γ R ′ ( θ γ 0 ) + ε 1 ( θ γ 0 − γ 0 ) . � � R ′ ( θ ) = 0 E π γ In the quadratic case (linear gradients) Σ E π γ [ θ − θ ∗ ] = 0: ¯ θ γ = θ ∗ ! � � θ − θ ∗ � 4 � ≤ C γ 2 , and expand the In the general case, using E π γ Taylor expansion of R : And iterating this reasoning on higher moments of the chain: �� ¯ � − 1 E π γ [ ε ( θ ) ⊗ 2 ] θ γ − θ ∗ = γ R ′′ ( θ ∗ ) − 1 R ′′′ ( θ ∗ ) R ′′ ( θ ∗ ) ⊗ I + I ⊗ R ′′ ( θ ∗ ) + O ( γ 2 ) Overall, ¯ θ γ − θ ∗ = γ ∆ + O ( γ 2 ) . 35

Constant learning rate SGD: convergence in the quadratic case 36

Constant learning rate SGD: convergence in the quadratic case θ 0 θ n θ 1 36

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT - PowerPoint PPT Presentation

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT Supervised by Francis BACH September 28, 2017 0 Outline 1. Introduction: Supervised Machine Learning Stochastic Approximation 2. Finite dimensional results 3. Infinite

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

A new weak Hilbert space Jess Surez de la Fuente, UEx Workshop on Banach spaces and Banach

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

On Hilbert IVth Problem Marc Troyanov (EPFL) SJTU, June 21, 2019 On Hilbert IVth Abstract

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

6. Approximation and fitting norm approximation least-norm problems regularized

Definable operators on Hilbert spaces The Main Theorem Corollaries Isaac Goldbring UCLA ASL

Report Title Stochastic Volterra Equations in Banach Spaces and Stochastic Volterra Equations

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Linear Algebra Review Leila Wehbe January 29, 2013 Leila Wehbe Linear Algebra Review Metrics

Supports and approximation properties in Lipschitz-free spaces Eva Perneck a Czech Technical

The Hilbert Series of SQCD Matti J arvinen University of Crete 2 March 2012 1/26

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

3. Information-Theoretic Foundations Founder: Claude Shannon, 1940s Gives bounds for:

-Orientations Definition. Given G = ( V , E ) and : V I N. An -orientation of G is an

Stochastic Properties of disturbed Elementary Cellular Automata Micha Posiewnik Institute of

Erlang in Production I wish I'd known that when I started Or This is nothing like the

Domain Specific Languages Domain Specific Languages in Erlang Dennis Byrne

Luerl - Lua in Erlang Scripting mechanisms for the BEAM ecosystem Jean Chassoul FOSDEM 2019

Concepts of programming languages Lecture 9 Wouter Swierstra Faculty of Science Information and