C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - PowerPoint PPT Presentation

Mini-batch Limitations 1. METHODS BEYOND SGD 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE

Mini-batch Limitations 1. METHODS BEYOND SGD   Use Primal-Dual Framework 2. STALE UPDATES   Immediately apply local updates 3. AVERAGE OVER BATCH SIZE   Average over K << batch size

Co mmunication-Efficient Distributed Dual Co ordinate A scent (CoCoA) 1. METHODS BEYOND SGD   Use Primal-Dual Framework 2. STALE UPDATES   Immediately apply local updates 3. AVERAGE OVER BATCH SIZE   Average over K << batch size

1. Primal-Dual Framework ≥ PRIMAL DUAL

1. Primal-Dual Framework ≥ PRIMAL DUAL " # n 2 || w || 2 + 1 P ( w ) := � X ` i ( w T x i ) min n w ∈ R d i =1

1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i

1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap

1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice

1. Primal-Dual Framework ≥ PRIMAL DUAL " # " # n n 2 || w || 2 + 1 D ( ↵ ) := − || A ↵ || 2 − 1 P ( w ) := � X X ` i ( w T x i ) ` ∗ min max i ( − ↵ i ) n n w ∈ R d α ∈ R n i =1 i =1 A i = 1 λ nx i Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear

2. Immediately Apply Updates

2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w

2. Immediately Apply Updates for i 2 b ∆ w ∆ w � α r i P ( w ) STALE end w w + ∆ w for i 2 b ∆ w ∆ w � α r i P ( w ) FRESH w w + ∆ w end

3. Average over K

3. Average over K reduce: w = w + 1 P k ∆ w k K

CoCoA Algorithm 1: CoCoA Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately   reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

CoCoA Algorithm 1: CoCoA ✔ <10 lines of code in Spark Input : T ≥ 1 , scaling parameter 1 ≤ β K ≤ K (default: β K := 1). Data : { ( x i , y i ) } n i =1 distributed over K machines Initialize : α (0) [ k ] ← 0 for all machines k , and w (0) ← 0 ✔ primal-dual framework allows for for t = 1 , 2 , . . . , T for all machines k = 1 , 2 , . . . , K in parallel any internal optimization method ( ∆ α [ k ] , ∆ w k ) ← LocalDualMethod ( α ( t − 1) , w ( t − 1) ) [ k ] α ( t ) [ k ] ← α ( t − 1) + β K K ∆ α [ k ] [ k ] end ✔ local updates applied immediately   reduce w ( t ) ← w ( t − 1) + β K P K k =1 ∆ w k K end Procedure A: LocalDualMethod : Dual algorithm on machine k ✔ average over K Input : Local α [ k ] ∈ R n k , and w ∈ R d consistent with other coordinate blocks of α s.t. w = A α Data : Local { ( x i , y i ) } n k i =1 Output : ∆ α [ k ] and ∆ w := A [ k ] ∆ α [ k ]

Convergence

  Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n

  Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K

  Convergence Assumptions: are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

    Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − 1 + λ n γ ˜ n ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

      Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K applies also to duality gap 0 ≤ σ ≤ n/K measure of difficulty of data partition

      Convergence Assumptions: ✔ it converges! are -smooth 1 / γ ` i LocalDualMethod makes improvement per step   Θ ◆ H ✓ λ n γ 1 e.g. for SDCA Θ = 1 − ✔ inherits convergence rate of 1 + λ n γ ˜ n locally used method ◆ T ⇣ ✓ 1 − (1 − Θ ) 1 λ n γ ⌘ E [ D ( α ∗ ) − D ( α ( T ) )] ≤ D ( α ∗ ) − D ( α (0) ) σ + λ n γ K ✔ convergence rate is linear for applies also to duality gap smooth losses 0 ≤ σ ≤ n/K measure of difficulty of data partition

LARGE-SCALE OPTIMIZATION C O C O A RESULTS!

Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) Cov 522,911 54 22.22% 4 Rcv1 677,399 47,236 0.16% 8 Imagenet 32,751 160,000 100% 32

Imagenet Imagenet 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e3) mini − batch − CD (H=1) local − SGD (H=1e3) mini − batch − SGD (H=10) − 6 − 6 10 10 0 0 200 200 400 400 600 600 800 800 Time (s)

RCV1 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 − 4 − 4 10 10 COCOA (H=1e5) minibatch − CD (H=100) local − SGD (H=1e4) batch − SGD (H=100) − 6 − 6 10 10 0 0 100 100 200 200 300 300 400 400 Time (s) Imagenet Imagenet Cov Cov 2 2 10 10 2 2 10 10 Log Primal Suboptimality 0 0 Log Primal Suboptimality 0 0 10 10 10 10 − 2 − 2 − 2 − 2 10 10 10 10 − 4 − 4 − 4 − 4 10 10 10 10 COCOA (H=1e5) COCOA (H=1e3) minibatch − CD (H=100) mini − batch − CD (H=1) local − SGD (H=1e5) local − SGD (H=1e3) batch − SGD (H=1) mini − batch − SGD (H=10) − 6 − 6 10 10 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 800 Time (s) Time (s)

Effect of H on C O C O A 2 2 10 10 Log Primal Suboptimality 0 0 10 10 − 2 − 2 10 10 1e5 − 4 − 4 10 10 1e4 1e3 100 1 − 6 − 6 10 10 0 0 20 20 40 40 60 60 80 80 100 100 Time (s)

C O C O A Take-Aways

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - PowerPoint PPT Presentation

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak , Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan LARGE-SCALE OPTIMIZATION LARGE-SCALE OPTIMIZATION C O C O A

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

Pattern-avoiding ascent sequences An interesting equivalence Generating Tree Counting Nodes

Vision ASCENT to perform as its client's extended HR team and always ranked their most dependable

The Mode Decision INST 154 Apollo at 50 Lunar Orbit Rendezvous Four Bad Ideas Direct Ascent

The Mode Decision INST 154 Apollo at 50 Lunar Orbit Rendezvous Four Bad Ideas Direct Ascent

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

TO ANSWER EACH QUESTION . 1. Approximate the y-coordinate of the point on the line with the given

3-D Coordinate System 3-D Coordinate System Contour Map How thick is the earths crust?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn & Baker) 3D Coordinate

Analytic Geometry 1 -1 Cartesian Coordinate System

Right Hand Coordinate System Right Hand Rule A rectangular or Cartesian coordinate system

3. Matrices Often if one starts with a coordinate system ( x 1 , x 2 , x 3 ), sometimes it is

Information Theory Lecture 6 Block Codes and Finite Fields Codes: Roth (R) 12, 4.14

Security Bounds for the Design of Code-based Cryptosystems M. Finiasz et N. Sendrier The

Error-Correcting Codes G. Eric Moorhouse, UW Math Corrected copies of transparencies for this

Code-based One Way Functions Nicolas Sendrier INRIA Rocquencourt, projet CODES Ecrypt Summer

One-Slide Summary Python and Real databases, unlike PS5, have many Object-Oriented

Data Partitioning Strategies for Graph Workloads on Heterogeneous Clusters Michael LeBeane,

SAGNAC GYROSCOPES AND GINGERINO Angela D. V. Di Virgilio, INFN sez. Di

Poll Questions 1. Are you located: In the state Ombudsman office In a local Ombudsman entity Not

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith - PowerPoint PPT Presentation

C O C O A Communication-Efficient Coordinate Ascent Virginia Smith Martin Jaggi, Martin Tak , Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan LARGE-SCALE OPTIMIZATION LARGE-SCALE OPTIMIZATION C O C O A

Atomistic simulations of rare events using the using the gentlest ascent gentlest ascent

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction &amp; History Pairs of

Pattern-avoiding ascent sequences An interesting equivalence Generating Tree Counting Nodes

Vision ASCENT to perform as its client's extended HR team and always ranked their most dependable

The Mode Decision INST 154 Apollo at 50 Lunar Orbit Rendezvous Four Bad Ideas Direct Ascent

The Mode Decision INST 154 Apollo at 50 Lunar Orbit Rendezvous Four Bad Ideas Direct Ascent

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Transformations &amp; Transformations &amp; Coordinate Systems Coordinate Systems CSCD 472?

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

TO ANSWER EACH QUESTION . 1. Approximate the y-coordinate of the point on the line with the given

3-D Coordinate System 3-D Coordinate System Contour Map How thick is the earths crust?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn &amp; Baker) 3D Coordinate

Analytic Geometry 1 -1 Cartesian Coordinate System

Right Hand Coordinate System Right Hand Rule A rectangular or Cartesian coordinate system

3. Matrices Often if one starts with a coordinate system ( x 1 , x 2 , x 3 ), sometimes it is

Information Theory Lecture 6 Block Codes and Finite Fields Codes: Roth (R) 12, 4.14

Security Bounds for the Design of Code-based Cryptosystems M. Finiasz et N. Sendrier The

Error-Correcting Codes G. Eric Moorhouse, UW Math Corrected copies of transparencies for this

Code-based One Way Functions Nicolas Sendrier INRIA Rocquencourt, projet CODES Ecrypt Summer

One-Slide Summary Python and Real databases, unlike PS5, have many Object-Oriented

Data Partitioning Strategies for Graph Workloads on Heterogeneous Clusters Michael LeBeane,

SAGNAC GYROSCOPES AND GINGERINO Angela D. V. Di Virgilio, INFN sez. Di

Poll Questions 1. Are you located: In the state Ombudsman office In a local Ombudsman entity Not

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

Transformations & Transformations & Coordinate Systems Coordinate Systems CSCD 472?

3D Geometric Transformation (Chapt. 5 in FVD, Chapt. 11 in Hearn & Baker) 3D Coordinate