an accelerated variance reducing stochastic method with
play

An Accelerated Variance Reducing Stochastic Method with - PowerPoint PPT Presentation

An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1 Table of Contents Background Moreau Envelop and Douglas-Rachford (DR)


  1. An Accelerated Variance Reducing Stochastic Method with Douglas-Rachford Splitting Jingchang Liu November 12, 2018 University of Science and Technology of China 1

  2. Table of Contents Background Moreau Envelop and Douglas-Rachford (DR) Splitting Our methods Theories Experiments Conclusions Q & A 2

  3. Background

  4. Problem Formulation n x ∈R d f ( x ) + h ( x ) := 1 • Regularized ERM: min � f i ( x ) + h ( x ). n i =1 f i : R d → R : empirical loss of i -th sample, convex. • • h : regularization term, convex but possibly non-smooth. • Examples: LASSO, sparse SVM, ℓ 1 , ℓ 2 -Logistic Regression. Definition � 2 γ � y − x � 2 � 1 • Proximal operator: prox γ f ( x ) = argmin y ∈ R d f ( y ) + . • Gradient mapping: f ( x ) = 1 γ ( x − prox γ f ( x )). � � • Subdifferential: ∂ f ( x ) = g | g T ( y − x ) ≤ f ( y ) − f ( x ) , ∀ y ∈ dom f . 2 � y − x � 2 . • Strongly convex: f ( y ) ≥ f ( x ) + � g , y − x � + µ 2 � y − x � 2 . • L -smooth: f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + L 3

  5. Related Works Exsiting Algorithm prox γ h ( x − γ · � ), where � can be obtained from: - GD: � = ∇ f ( x ), more calculations needed in each iteration. - SGD: � = ∇ f i ( x ), small stepsize deduces slow convergence. - Variance reduction (VR): � = ∇ f i ( x ) − ∇ f i (¯ x ) + ∇ f ( x ), such as SVRG, SAGA, SDCA. Accelerated Technique • Ill condition: L /µ , the condition number, is large. • Methods: Acc-SDCA, Catalyst, Mig, Point-SAGA. • Drawbacks: More parameters need to be tuned. 4

  6. Rate Convergence Rate • VR stochastic methods: O (( n + L /µ ) log(1 /ǫ )). � • Acc-SDCA, Mig, Point-SAGA: O (( n + nL /µ ) log(1 /ǫ )). • When L /µ ≫ n , accelerated technique makes the convergence much faster. Aim Design a simpler accelerate VR stochastic method which can achieve the fastest convergence rate. 5

  7. Moreau Envelop and Douglas-Rachford (DR) Splitting

  8. Moreau Envelop Formulaton f ( y ) + 1 � 2 γ � x − y � 2 � f γ ( x ) = inf . y Properties • x ∗ minimizes f ( x ) iff x ∗ minimizes f γ ( x ) • f γ is continuously differentiable even when f is non-differentiable, ∇ f γ ( x ) = ( x − prox γ f ( x )) /γ. Moreover, f γ is 1 /γ -smooth. • If f : µ -strongly convex, then f γ : µ/ ( µγ + 1)-strongly convex. • The condition number of f γ is ( µγ + 1) /µγ , which may be better. Proximal Point Algorithm (PPA) x k +1 = prox γ f ( x k ) = x k − γ ∇ f γ ( x k ) . 6

  9. Point-SAGA Formulation n x ∈R d f ( x ) := 1 Used when h is absent: min � f i ( x ) n i =1 Iteration n x k + γ ( g k � z k g k = j − i / n ) , j i =1 x k +1 f j ( z k prox γ = j ) g k +1 ( z k j − x k +1 ) /γ, = j Equivalence n x k +1 = x k − γ � � g k +1 − g k g k � j + i / n , j i =1 where g k +1 is the gradient mapping of f at z k j . j 7

  10. Point-SAGA: Convergence rate Strongly convex and smooth � � � µ ) log(1 n L O ( n + ǫ ) . Strongly convex and non-smooth � 1 � O . ǫ 8

  11. Douglas-Rachford (DR) Splitting Formulation x ∈ R d f ( x ) + h ( x ) , min Iteration − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . prox γ = Convergence • F ( y ) = y + prox γ h (2prox γ f ( y ) − y ) − prox γ f ( y ). • y is a fixed point of F if and only if x = prox γ f ( y ) satisfies 0 ∈ ∂ f ( x ) + ∂ g ( x ): y = F ( y ) 0 ∈ ∂ f (prox γ y ( y )) + ∂ g (prox γ y ( y )) . ⇄ 9

  12. Our methods

  13. Algorithm 10

  14. Iterations Main iterations n j + 1 x k − γ � � � y k +1 g k +1 − g k g k = , i j n i =1 x k +1 h ( y k ) , = prox γ where = 1 j + x k − y k ) − prox f j ( z k j + x k − y k ) g k +1 ( z k � � , j γ j − x k − y k . the gradient mapping of f j at z k Number of parameters Prox2-SAGA Point-SAGA Katyusha Mig Acc-SDCA Catalyst 1 1 3 2 2 several 11

  15. Connections to other algorithms Point-SAGA When h = 0, we have x k = y k for Prox2-SAGA, n j − 1 x k + γ � � � z k g k g k = , j i n i =1 x k +1 f j ( z k = prox γ j ) , 1 g k +1 γ ( z k j − x k +1 ) . = j DR splitting j = � n When n = 1, since g k i =1 g k i / n in Prox2-SAGA, − x k + y k + prox γ f (2 x k − y k ) , y k +1 = x k +1 h ( y k +1 ) . = prox γ 12

  16. Theories

  17. Effectiveness Proposition Suppose that ( y ∞ , { g ∞ i } i =1 ,..., n ) is the fixed point of the Prox2-SAGA iteration. Then x ∞ = prox γ h ( y ∞ ) is a minimizer of f + h. Proof. ∵ y ∞ = − x ∞ + y ∞ + prox γ + x ∞ − y ∞ ), which implies f i ( z ∞ i ( z ∞ − y ∞ ) /γ ∈ ∂ f i ( x ∞ ) , i = 1 , . . . , n . (1) i Meanwhile, because x ∞ = prox γ h ( y ∞ ), we have ( y ∞ − x ∞ ) /γ ∈ ∂ h ( x ∞ ) . (2) Observing that n n 1 − y ∞ ) + ( y ∞ − x ∞ ) = 1 − x ∞ = 0 , � � ( z ∞ z ∞ i i n n i =1 i =1 from (1) and (2), we have 0 ∈ ∂ f ( x ∞ ) + ∂ h ( x ∞ ). 13

  18. Convergence Rate Non-strongly convex case Suppose that f i : convex and L -smooth, h : convex. Denote � k j = 1 g k t =1 g t ¯ j , then for Prox2-SAGA with step size γ ≤ 1 / L , at any k time k > 0 it holds n � 2 ≤ 1 � 2 + � 1 � γ ( y 0 − y ∗ ) � 2 � � � g k j − g ∗ � � � g 0 i − g ∗ � � ¯ . E j i k i =1 Strongly convex case Suppose that f i : µ -strongly convex and L -smooth, h : convex. Then for √ � � 9 L 2 +3 µ L − 3 L 1 Prox2-SAGA with stepsize γ = min µ n , , for any time 2 µ L k > 0 it holds n µγ � k · µγ − 2 � 2 ≤ � � 2 + � y 0 − y ∗ � 2 � � � γ ( g 0 � x k − x ∗ � � � � i − g ∗ � 1 − i ) . E 2 µγ + 2 2 − n µγ i =1 14

  19. Remarks - When the stepsize � 1 9 L 2 + 3 µ L − 3 L � � γ = min µ n , , 2 µ L then O ( n + L /µ ) log(1 /ǫ ) steps are required to achieve � 2 ≤ ǫ . � x k − x ∗ � � E - When f i is ill-conditioned, then a large stepsize � 1 36 L 2 − 6( n − 2) µ L � µ n , 6 L + � γ = min 2( n − 2) µ L � is possible, under which the required steps is O ( n + nL /µ ) log(1 /ǫ ). 15

  20. Experiments

  21. Experiments Figure 2: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression. 16

  22. Experiments 17 Figure 3: Comparison of several algorithms with ℓ 1 ℓ 2 -Logistic Regression.

  23. Experiments svmguide3 rcv1 10 0 10 0 Prox-SDCA Prox2-SAGA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SAGA, Prox-SGD 10 -2 10 -2 objective gap objective gap 10 -4 10 -4 10 -6 10 -6 0 20 40 60 80 0 10 20 30 40 50 60 70 epoch epoch covtype ijcnn1 10 0 10 0 Prox2-SAGA Prox-SDCA Prox-SDCA Prox2-SAGA Prox-SAGA, Prox-SGD Prox-SGD Prox-SAGA, 10 -1 10 -2 objective gap objective gap 10 -2 10 -3 10 -4 10 -4 10 -5 10 -6 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 epoch epoch Figure 4: Comparison of several algorithms with sparse SVMs. 18

  24. Conclusions

  25. • Prox2-SAGA has combined Point-SAGA and DR splitting. • Point-SAGA provides faster convergence rate to Prox2-SAGA. • DR splitting provides the effectiveness. 19

  26. Q & A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend