c o c o a communication efficient coordinate ascent
play

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - PowerPoint PPT Presentation

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak , Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan LARGE-SCALE OPTIMIZATION LARGE-SCALE OPTIMIZATION C O C O A


  1. Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE

  2. Mini-batch Limitations 1. METHODS BEYOND SGD 
 Use Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE 
 Average over K << batch size

  3. Co mmunication-Efficient Distributed Dual Co ordinate A scent (CoCoA) 1. METHODS BEYOND SGD 
 Use Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE 
 Average over K << batch size

  4. 1. Primal-Dual Framework ≥ PRIMAL DUAL

  5. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # n 2 || w || 2 + 1 P ( w ) := � X ` i ( w T x i ) min n w ∈ R d i =1

  6. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i

  7. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap

  8. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice

  9. 1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear

  10. 2. Immediately Apply Updates

  11. 2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w

  12. 2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w for i 2 b ∆ w ∆ w � α r i P ( w ) FRESH w w + ∆ w end

  13. 3. Average over K

  14. 3. Average over K reduce: w = w + 1 P k ∆ w k K

  15. CoCoA Algorithm 1: CoCoA Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  16. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  17. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  18. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately 
 reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  19. CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately 
 reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k ✔ average over K Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

  20. Convergence

  21. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n

  22. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K

  23. 
 Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  24. 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  25. 
 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

  26. 
 
 
 Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step 
 Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K ✔ convergence rate is linear for applies also to duality gap smooth losses 0 ≤ σ ≤ n/K measure of difficulty of data partition

  27. LARGE-SCALE OPTIMIZATION C O C O A RESULTS!

  28. LARGE-SCALE OPTIMIZATION C O C O A RESULTS!

  29. Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522,911 54 22.22% 4 Rcv1 677,399 47,236 0.16% 8 Imagenet 32,751 160,000 100% 32

  30. Imagenet Imagenet 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e3) mini − batch − CD (H=1) local − SGD (H=1e3) mini − batch − SGD (H=10) − 6 − 6 10 10 0 0 200 200 400 400 600 600 800 800 Time (s)

  31. RCV1 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e5) minibatch − CD (H=100) local − SGD (H=1e4) batch − SGD (H=100) − 6 − 6 10 10 0 0 100 100 200 200 300 300 400 400 Time (s) Imagenet Imagenet Cov Cov 2 2 10 10 2 2 10 10 Log Primal Suboptimality 0 0 Log Primal Suboptimality 0 0 10 10 10 10 − 2 − 2 − 2 − 2 10 10 10 10 − 4 − 4 − 4 − 4 10 10 10 10 COCOA (H=1e5) COCOA (H=1e3) minibatch − CD (H=100) mini − batch − CD (H=1) local − SGD (H=1e5) local − SGD (H=1e3) batch − SGD (H=1) mini − batch − SGD (H=10) − 6 − 6 10 10 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 800 Time (s) Time (s)

  32. Effect of H on C O C O A 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 1e5 − 4 − 4 10 10 1e4 1e3 100 1 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 Time (s)

  33. C O C O A Take-Aways

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend