 
              Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE
Mini-batch Limitations 1. METHODS BEYOND SGD Use Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size
Co mmunication-Efficient Distributed Dual Co ordinate A scent (CoCoA) 1. METHODS BEYOND SGD Use Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size
1. Primal-Dual Framework ≥ PRIMAL DUAL
1. Primal-Dual Framework ≥ PRIMAL DUAL " # n 2 || w || 2 + 1 P ( w ) := � X ` i ( w T x i ) min n w ∈ R d i =1
1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i
1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap
1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice
1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear
2. Immediately Apply Updates
2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w
2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w for i 2 b ∆ w ∆ w � α r i P ( w ) FRESH w w + ∆ w end
3. Average over K
3. Average over K reduce: w = w + 1 P k ∆ w k K
CoCoA Algorithm 1: CoCoA Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]
CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]
CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]
CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]
CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k ✔ average over K Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]
Convergence
Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n
Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K
Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition
Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition
Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition
Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K ✔ convergence rate is linear for applies also to duality gap smooth losses 0 ≤ σ ≤ n/K measure of difficulty of data partition
LARGE-SCALE OPTIMIZATION C O C O A RESULTS!
LARGE-SCALE OPTIMIZATION C O C O A RESULTS!
Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522,911 54 22.22% 4 Rcv1 677,399 47,236 0.16% 8 Imagenet 32,751 160,000 100% 32
Imagenet Imagenet 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e3) mini − batch − CD (H=1) local − SGD (H=1e3) mini − batch − SGD (H=10) − 6 − 6 10 10 0 0 200 200 400 400 600 600 800 800 Time (s)
RCV1 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e5) minibatch − CD (H=100) local − SGD (H=1e4) batch − SGD (H=100) − 6 − 6 10 10 0 0 100 100 200 200 300 300 400 400 Time (s) Imagenet Imagenet Cov Cov 2 2 10 10 2 2 10 10 Log Primal Suboptimality 0 0 Log Primal Suboptimality 0 0 10 10 10 10 − 2 − 2 − 2 − 2 10 10 10 10 − 4 − 4 − 4 − 4 10 10 10 10 COCOA (H=1e5) COCOA (H=1e3) minibatch − CD (H=100) mini − batch − CD (H=1) local − SGD (H=1e5) local − SGD (H=1e3) batch − SGD (H=1) mini − batch − SGD (H=10) − 6 − 6 10 10 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 800 Time (s) Time (s)
Effect of H on C O C O A 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 1e5 − 4 − 4 10 10 1e4 1e3 100 1 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 Time (s)
C O C O A Take-Aways
Recommend
More recommend