sdca
play

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 - PowerPoint PPT Presentation

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and Technology of China 1 Table of Contents Lagrangian Duality SDCA Convergence Rate Experiments Asynchronous SDCA Q & A 2 Lagrangian Duality


  1. SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and Technology of China 1

  2. Table of Contents Lagrangian Duality SDCA Convergence Rate Experiments Asynchronous SDCA Q & A 2

  3. Lagrangian Duality

  4. Dual Problem Primal Problem min f 0 ( x ) s . t . f i ( x ) ≤ 0 , i = 1 , 2 · · · , m h i ( x ) = 0 , i = 1 , 2 , · · · , p Lagrangian Function m p � � L ( x , λ, v ) = f 0 ( x ) + λ i f i ( x ) + v i h i ( x ) , λ i ≥ 0 i =1 i =1 Dual Fucntion g ( λ, v ) = inf x ∈ D L ( x , λ, v ) g ( λ, v ) is a concave function. 3

  5. SDCA

  6. Reference Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, Shai Shalev-Shwartz & Tong Zhang, JMLR2013 4

  7. Optimization Objective Formulation w ∈ R d P ( w ) min n P ( w ) := 1 + λ 2 � w � 2 � � w T x i � φ i n i =1 Parameters • x 1 , x 2 , · · · , x n ∈ R d , φ 1 , φ 2 , · · · , φ n : Scalar convex functions. • SGD: O (1 / n ) Examples � w T x i � � 0 , 1 − y i w T x i � • SVM: φ i = max � w T x i � � � − y i w T x i �� • Logistic Regression: φ i = log 1 + exp � 2 � w T x i � � w T x i − y i • Ridge Regression: φ i = 5

  8. Dual Problem Dual Problem max α D ( α ) 2 � � n n D ( α ) = 1 i ( − α i ) − λ 1 � � � � − φ ∗ α i x i � � 2 λ n n � � � � i =1 i =1 Conjugate function: φ ∗ i ( u ) = max z ( zu − φ i ( z )) Derivation n 2 � w � 2 equals to P ( w ) = 1 � w T x i � + λ � φ i n i =1 n 1 φ i ( z i ) + λ � 2 � y � 2 P ( y , z ) = n i =1 y T x i = z i , i = 1 , 2 , · · · , n s . t . 6

  9. Derivation n L ( y , z , α ) = P ( y , z ) + 1 � � y T x i − z i � α i n i =1 D ( α ) = inf y , z L ( y , z , α ) � � n n 1 2 � y � 2 + 1 λ � � α i y T x i = inf z i { φ i ( z i ) − α i z i } + inf n n y i =1 i =1 2 n � n � 1 i ( − α i ) − λ 1 � � � � − φ ∗ = α i x i � � n 2 � λ n � � � i =1 i =1 Relationship n w ( α ) = 1 � α i x i λ n i =1 7

  10. Assumptions L -Lipschitz continuous | φ i ( a ) − φ i ( b ) | ≤ L | a − b | 1 /γ -smooth A function φ i : R → R is (1 /γ )-smooth if it is differentiable and its derivative is (1 /γ )-Lipschitz. Remark if φ i ( a ) is (1 /γ )-smooth, then φ ∗ i is γ strongly convex. 8

  11. Algorithms Figure 1: Procedure SDCA 9

  12. Theorem Th1 Consider Procedure SDCA with α (0) = 0. Assume that φ i is L -Lipschitz for all i . To abtain a duality gap of E [ P ( ¯ w ) − D (¯ α )] ≤ ε , it suffices to have a total number of iterations of T ≥ T 0 + n 4 L 2 λε Th2 Consider Procedure SDCA with α (0) = 0. Assume that φ i is (1 /γ )-smooth for all i . To abtain a duality gap of E [ P ( ¯ w ) − D (¯ α )] ≤ ε , it suffices to have a total number of iterations of � n + 1 � �� n + 1 � · 1 � T ≥ log λγ λγ ε 10

  13. Linear Convergence For Smooth Hinge-Loss Figure 2: Experiments with the smoothed hinge-loss ( γ = 1). 11

  14. Convergence For Non-smooth Hinge-loss Figure 3: Experiments with the hinge-loss (non-smooth) 12

  15. Effect of Smoothness Parameter Figure 4: Duality gap as a function of the number of rounds for different values of γ 13

  16. Comparison To SGD Figure 5: Comparing the primal sub-optimality of SDCA and SGD for the smoothed hinge-loss ( γ = 1) 14

  17. Asynchronous SDCA

  18. Introduction Reference PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Prime Problem n w ∈ R d P ( w ) := 1 2 � w � 2 + � � w T x i � min l i i =1 Dual Problem 2 � � n n α ∈ R d D ( α ) := 1 � � � � l ∗ min α i x i + i ( − α i ) � � 2 � � � � i =1 i =1 15

  19. Algorithm Figure 6: Parallel Asynchronous Stochastic dual Co-ordinate Descent (PASSCoDe) 16

  20. Operation PASSCoDe-Lock • Step 1.5: lock variables in N i := { w t | ( x i ) t � = 0 } • The locks are then released after step 3. • May equal to inconsistent read. PASSCode-Atomic • step 3: For each j ∈ N ( i ), Update w j ← w j + △ α i ( x i ) j atomically. 17

  21. Linear Convergence Rate of PASSCoDe-Atomic Theorem If / √ n ≤ 1 6 τ ( τ + 1) 2 eM � � and � τ 2 M 2 e 2 1 ≥ 2 L max � 1 + e τ M √ n R 2 n min then PASSCoDe-Atomic has a global linear convergence rate in expectation, that is, α j +1 �� α j �� � � − D ( α ∗ ) ≤ η � � � − D ( α ∗ ) � E D E D where α ∗ is the optimal solution and � τ 2 M 2 e 2 � � � κ 1 − 2 L max 1 + e τ M η = 1 − √ n R 2 L max n min 18

  22. Convergence and Efficiency Figure 7: Convergence and Efficiency for news20, covtype, rcv1 datasets 19

  23. Speedup Figure 8: Speedup for news20, covtype, rcv1 datasets 20

  24. Q & A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend