passcode p arallel as ynchronous s tochastic dual co
play

PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De - PowerPoint PPT Presentation

PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De scent Cho-Jui Hsieh Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu and I. S. Dhillon Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1


  1. PASSCoDe : P arallel AS ynchronous S tochastic dual Co -ordinate De scent Cho-Jui Hsieh Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu and I. S. Dhillon Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1 / 29

  2. Outline L2-regularized Empirical Risk Minimization Dual Coordinate Descent (Hsieh et al., 2008) Parallel Dual Coordinate Descent (on multi-core machines) Theoretical Analysis Experimental Results Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 2 / 29

  3. L2-regularized ERM n w ∈ R d P ( w ) := 1 w ∗ = arg min 2 � w � 2 + � ℓ i ( w T x i ) i =1 SVM with hinge loss: ℓ i ( z i ) = C max (1 − z i , 0) SVM with squared hinge loss: ℓ i ( z i ) = C max (1 − z i , 0) 2 Logistic regression: ℓ i ( z i ) = C log (1 + e − z i ) Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 3 / 29

  4. Primal and Dual Formulations Primal Problem n w ∈ R d P ( w ) := 1 w ∗ = arg min 2 � w � 2 + � ℓ i ( w T x i ) i =1 Dual Problem 2 � n � n α ∈ R n D ( α ) := 1 � � α ∗ = arg min � � + ℓ ∗ i ( − α i ) , α i x i � � 2 � � � � i =1 i =1 ℓ ∗ i ( · ): the conjugate of ℓ i ( · ) Primal-Dual Relationship between w ∗ and α ∗ n w ∗ = w ( α ∗ ) := � α ∗ i x i i =1 Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 4 / 29

  5. Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 5 / 29

  6. Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ � 2 � δ + ( � n i =1 α i x i ) T x i 1 1 � x i � 2 ℓ ∗ = arg min + i ( − ( α i + δ )) � x i � 2 2 δ � n   � T � = T i α i x i x i , α i   i =1 Simple univariate problem, but O ( nnz ) construction time Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 6 / 29

  7. Coordinate Descent on the Dual Problem Randomly select an i ∈ { 1 , . . . , n } and update α i ← α i + δ ∗ , where δ ∗ = arg min D ( α + δ e i ) δ � 2 � δ + ( � n i =1 α i x i ) T x i 1 1 � x i � 2 ℓ ∗ = arg min + i ( − ( α i + δ )) � x i � 2 2 δ � n   � T � = T i α i x i x i , α i   i =1 Simple univariate problem, but O ( nnz ) construction time ⇒ O ( n i ) DCD: [Hsieh et al 2008] i =1 α i x i and δ ∗ = T i Maintain primal variable w = � n � w T x i , α i � O ( n i ) construction time: n i = nnz of x i O ( n i ) maintenance cost: w ← w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 7 / 29

  8. Dual Coordinate Descent Stochastic Dual Coordinate Descent For t = 1 , 2 , . . . 1 Randomly pick an index i 2 Compute w T x i 3 Update α i ← α i + δ ∗ where δ ∗ = T i ( w T x i , α i ) 4 Update w ← w + δ ∗ x i . Implemented in LIBLINEAR: Linear SVM (Hsieh et al., 2008), multi-class SVM (Keerthi et al., 2008), Logistic regression (Yu et al., 2011). Analysis: (Nesterov et al., 2012; Shalev-Shwartz et al., 2013) Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 8 / 29

  9. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 0 Registers: R1 R2 R3 R4 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  10. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: x 32 R1 R3 R4 operation: Load x 32 to R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  11. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: w 2 x 32 R3 R4 operation: Load w 2 to R1 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  12. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 0 Registers: w 2 x 32 w T x i R4 operation: R3 = R3 + R1 × R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  13. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 2 x 34 w T x i R4 operation: Load x 34 to R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  14. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 4 x 34 w T x i R4 operation: Load w 4 to R1 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  15. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 0 Registers: w 4 x 34 w T x i R4 operation: R3 = R3 + R1 × R2 DCD step: compute w T x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  16. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 0 Registers: w 4 α 3 w T x i R4 operation: Load α 3 to R2 DCD step: compute δ ∗ = T i � w T x , α i � Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  17. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 0 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: R4 = T i (R2,R3) DCD step: compute δ ∗ = T i � w T x , α i � Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  18. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 0 0 0 0 CPU1 CPU2 ( i = 3) 0 1 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: R2 = R2 + R4 DCD step: update α i = α i + δ ∗ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  19. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 1 0 1 Registers: w 4 α 3 w T x i δ ∗ operation: Save R2 to α 3 DCD step: update α i = α i + δ ∗ Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  20. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 1 Registers: w 4 x 32 w T x i δ ∗ operation: Load x 32 to R2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  21. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: Load w 2 to R1 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  22. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: R1 = R1 + R2 × R4 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  23. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 2 0 1 Registers: w 2 x 32 w T x i δ ∗ operation: Save R1 to w 2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  24. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 . 2 0 . 4 0 1 Registers: w 2 x 34 w T x i δ ∗ operation: Load x 34 to R2 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

  25. Dual Coordinate Descent x 1 x 2 x 3 x 4 x 5 x 6 w Memory: 0.1 0 0.1 0.2 0.2 0 . 2 0.2 0.2 0.3 0 0.4 0.4 0.4 0 0.4 0.5 0.5 0 0.5 0.6 0 0.6 α : 0 0 1 0 0 0 CPU1 CPU2 ( i = 3) 0 0 . 4 0 1 Registers: w 4 x 34 w T x i δ ∗ operation: Load w 4 to R1 DCD step: update w = w + δ ∗ x i Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend