adaptive primal dual stochastic gradient methods
play

Adaptive primal-dual stochastic gradient methods Yangyang Xu - PowerPoint PPT Presentation

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22 Stochastic gradient method stochastic program: F ( x ; ) min x X f ( x ) = E N


  1. Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22

  2. Stochastic gradient method stochastic program: � F ( x ; ξ ) � min x ∈ X f ( x ) = E ξ � N • if ξ uniform on { ξ 1 , . . . , ξ N } , then f ( x ) = 1 i =1 F ( x ; ξ i ) N • stochastic gradient (that requires samples of ξ ): x k +1 = Proj X x k − α k g k � � where g k is a stochastic approximation of ∇ f ( x k ) • low per-update complexity compared to deterministic gradient descent • Literature: tons of works (e.g., [Robbins-Monro’51, Polyak-Juditsky’92, Nemirovski et. al. ’09, Ghadimi-Lan’13, Davis et. al’18] ) 2 / 22

  3. adaptive learning • adaptive gradient [Duchi-Hazan-Singer’11] : x k +1 = Proj v k x k − α k · g k ⊘ v k � � X �� k where v k = t =0 ( g k ) 2 • many other adaptive variants: Adam [Kingma-Ba’14] , AMSGrad [Reddi-Kale-Kumar’18] , and so on • extremely popular in training deep neural networks 3 / 22

  4. Adaptiveness improves convergence speed 0.7 AdaGrad Adam 0.6 tuned SGD objective value 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 pass of data • test on solving a neural network with one hidden layer Observation: adaptive methods much faster, and all methods have similar per-update cost 4 / 22

  5. Take a close look: x k +1 = Proj v k x k − α k · g k ⊘ v k � � X • Proj v k X is assumed simple (holds if X is simple) • Not (easily) implementable if X is complicated This talk: adaptive primal-dual stochastic gradient for problems with complicated constraints 5 / 22

  6. Outline 1. Problem formula and motivating examples 2. Review of existing methods 3. Proposed primal-dual stochastic gradient method 4. Numerical and convergence results and conclusions 6 / 22

  7. Stochastic functional constrained stochastic program � F 0 ( x ; ξ 0 ) � min x ∈ X f 0 ( x ) = E ξ 0 , (P) � F j ( x ; ξ j ) � s . t . f j ( x ) = E ξ j ≤ 0 , j = 1 , . . . , m • X is a simple closed convex set (but the feasible set is complicated) • f j is convex and possibly nondifferentiable • m could be very big: expensive to access all f j ’s at every update Goal: design an efficient stochastic method without complicated projection that can guarantees (near) optimality and feasibility 7 / 22

  8. Example I: linear programming of Markov decision process discounted Markov decision process: ( S , A , P , r , γ ) • state space S = { s 1 , . . . , s m } , action space A = { a 1 , . . . , a n } • transition probability P = [ P a ( s, s ′ )] , reward r = [ r a ( s, s ′ )] • discount factor: γ ∈ (0 , 1] Bellman optimality equation: � P a ( s, s ′ ) � r a ( s, s ′ ) + γv ( s ′ ) � v ( s ) = max , ∀ s ∈ S a ∈A s ′ ∈S equivalent to linear programming [Puterman’14] : v e ⊤ v , s . t . ( I − γ P a ) v − r a ≥ 0 , ∀ a ∈ A min • r a ( s ) = � s ′ ∈S P a ( s, s ′ ) r a ( s, s ′ ) • huge number of constraints if m and/or n is big 8 / 22

  9. Example II: robust optimization by sampling Robust optimization: x ∈ X f 0 ( x ) , s . t . g ( x ; ξ ) ≤ 0 , ∀ ξ ∈ Ξ min Sampled approximation [Calafiore-Campi’05] : x ∈ X f 0 ( x ) , s . t . g ( x ; ξ i ) ≤ 0 , ∀ i = 1 , . . . , m min • { ξ 1 , . . . , ξ m } : m independently extracted samples • solution of the sampled approximation problem is a (1 − τ ) -level robustly feasible solution with probability at least 1 − ε if m ≥ n τε − 1 , where τ ∈ (0 , 1) and ε ∈ (0 , 1) . 9 / 22

  10. Literature Few for problems with functional constraints • penalty method with stochastic approximation [Wang-Ma-Yuan’17] • uses exact function/gradient information of all constraint functions • stochastic mirror-prox descent for saddle-point problems [Baes-Brgisser-Nemirovski’13] • cooperative stochastic approximation (CSA) for problems with expectation constraint [Lan-Zhou’16] • level-set methods [Lin et. al’18] 10 / 22

  11. Stochastic mirror-prox method [Baes-Brgisser-Nemirovski’13] For a saddle-point problem: min x ∈ X max z ∈ Z L ( x , z ) Iterative update scheme: �� x k − α k g k x , z k + α k g k �� x k , ˆ z k � � ˆ = Proj X × Z , z �� x k − α k ˆ x , z k + α k ˆ �� x k +1 , z k +1 � g k g k � = Proj X × Z z • ( g k x ; g k z ) : a stochastic approximation of ∇L ( x k , z k ) • (ˆ g k g k x k , ˆ z k ) x ; ˆ z ) : a stochastic approximation of ∇L (ˆ √ • O (1 / k ) rate in terms of primal-dual gap 11 / 22

  12. Cooperative stochastic approximation [Lan-Zhou’16] For the problem with expectation constraint: min x ∈ X f ( x ) = E ξ [ F ( x , ξ )] , s . t . E ξ [ G ( x , ξ )] ≤ 0 For k = 0 , 1 , . . . , do 1. sample ξ k ; 2. If G ( x k , ξ k ) ≤ η k , set g k = ˜ ∇ F ( x k , ξ k ) ; otherwise, g k = ˜ ∇ G ( x k , ξ k ) 3. Update x by 1 x k +1 = arg min 2 γ k � x − x k � 2 � g k , x � + x ∈ X • purely primal method √ • O (1 / k ) rate for convex problems • O (1 /k ) if both objective and constraint functions are strongly convex 12 / 22

  13. proposed method by the augmented Lagrangian function 13 / 22

  14. Augmented Lagrangian function With slack variables s ≥ 0 , (P) is equivalent to x ∈ X, s ≥ 0 f 0 ( x ) , s . t . f i ( x ) + s i = 0 , i = 1 , . . . , m. min By quadratic penalty, the augmented Lagrangian function is m m + β � � � 2 . ˜ � � � L β ( x , s , z ) = f 0 ( x ) + z i f i ( x ) + s i f i ( x ) + s i 2 i =1 i =1 Fix ( x , z ) and minimize ˜ L β about s ≥ 0 (through solving ∇ s ˜ L β = 0 ): � � − z i s i = β − f i ( x ) , i = 1 , . . . , m. + 14 / 22

  15. Augmented Lagrangian function Eliminate s to have the classic augmented Lagrangian function of (P): m � L β ( x , z ) = f 0 ( x ) + ψ β ( f i ( x ) , z i ) , i =1 where � uv + β 2 u 2 , if βu + v ≥ 0 , ψ β ( u, v ) = − v 2 2 β , if βu + v < 0 . • ψ β ( f i ( x ) , z i ) convex in x and concave in z i for each i • thus L β convex in x and concave in z 15 / 22

  16. Augmented Lagrangian method Choose ( x 1 , y 1 , z 1 ) . For k = 1 , 2 , . . . , iteratively do: x k +1 ∈ Arg min L β ( x , z k ) , x ∈ X z k +1 = z k + ρ ∇ z L β ( x k +1 , z k ) • if ρ < 2 β , globally convergent with rate O � 1 � kρ • bigger ρ and β gives faster convergence in term of iteration number but yields harder x -subproblem 16 / 22

  17. Proposed primal-dual stochastic gradient method Consider the case: • exact f j and ˜ ∇ f j can be obtained for each j = 1 , . . . , m • m is big: expensive to access all f j ’s every update Examples: MDP, robust optimization by sampling, multi-class SVM Remarks: • if f j is stochastic, AL function is a compositional expectation form • difficult to obtain unbiased stochastic estimation of ˜ ∇ x L β • ordinary Lagrangian function can be used to handle the most general case 17 / 22

  18. Proposed primal-dual stochastic gradient method For k = 0 , 1 , . . . , do 1. Sample ξ k and pick j k ∈ [ m ] uniformly at random; 2. Let g k = ˜ ∇ F 0 ( x k , ξ k ) + ˜ � � f j k ( x k ) , z k ∇ x ψ β ; j k 3. Update the primal variable x by x k +1 = Proj X x k − D − 1 � k g k � 4. Let z k +1 = z k j for j � = j k and update z j k by j z k � � j z k +1 = z k β , f j ( x k ) j + ρ k · max − , for j = j k . j • g k unbiased stochastic estimation of ˜ ∇ x L β at x k • ˜ ∇ f j k ( x k ) required, and f j k ( x k ) and f j k ( x k ) needed for the updates ��� k � g k scaled version of g k • D k = I /α k + η · diag t =0 | ˜ g k | 2 with ˜ 18 / 22

  19. How the proposed method performs Test on convex quadratically constrained quadratic programming � N 1 i =1 � H i x − c i � 2 , s . t . 1 2 x ⊤ Q j x + a ⊤ min j x ≤ b j , j = 1 , . . . , m, 2 N x ∈ X where N = m = 10 , 000 . 10 2 10 2 objective distance to optimality PDSG-nonadp PDSG-nonadp average feasibility residual PDSG-adp PDSG-adp 10 0 10 0 CSA CSA mirror-prox mirror-prox 10 -2 10 -2 10 -4 10 -4 10 -6 10 -6 0 10 20 30 40 50 0 10 20 30 40 50 number of epochs number of epochs Observations : • proposed methods better than mirror-prox and CSA • adaptiveness significantly improves convergence speed • all methods have roughly the same asymptotic convergence rate 19 / 22

  20. Sublinear convergence result Assumptions: 1. existence of a primal-dual solution ( x ∗ , z ∗ ) 2. unbiased estimate and bounded variance 3. bounded constraint function and subgradient ρ α Theorem: Given K , let α k = K , ρ k = K , β ≥ ρ . Then √ √ √ max � E � x K ) − f 0 ( x ∗ ) � x K )] + � � = O � K � � f 0 (¯ � , E � [ f (¯ 1 / 2 ρ If f 0 is strongly convex, let α k = α ρ k , ρ k = log( K +1) , β ≥ log 2 . Then � log( K + 1) � E � x K − x ∗ � 2 = O K x K weighted average of { x k } K +1 • ¯ k =1 Remark : CSA [Lan-Zhou’16] requires strong convexity of both objective and constraint functions to achieve O ( 1 K ) 20 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend