randomized primal dual algorithms for asynchronous
play

Randomized Primal-Dual Algorithms for Asynchronous Distributed - PowerPoint PPT Presentation

Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed


  1. Randomized Primal-Dual Algorithms for Asynchronous Distributed Optimization Lin Xiao Microsoft Research Joint work with Adams Wei Yu (CMU), Qihang Lin (University of Iowa) Weizhu Chen (Microsoft) Workshop on Large-Scale and Distributed Optimization Lund Center for Control of Complex Engineering Systems June 14-16, 2017

  2. Motivation big data optimization problems • dataset cannot fit into memory or storage of single computer • require distributed algorithms with inter-machine communication origins • machine learning, data mining, . . . • industry: search, online advertising, social media analysis, . . . goals • asynchronous distributed algorithms deployable in the cloud • nontrivial communication and/or computation complexity 1

  3. Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments

  4. Empirical risk minimization (ERM) • popular formulation in supervised (linear) learning � N = 1 P ( w ) def φ ( x T minimize i w , y i ) + λ g ( w ) N w ∈ R d i =1 – i.i.d. samples: ( x 1 , y 1 ) , . . . ( x N , y N ) where x i ∈ R d , y i ∈ R – loss function: φ ( · , y ) convex for every y – g ( w ) strongly convex, e.g., g ( w ) = ( λ/ 2) � w � 2 √ 2 – regularization parameter λ ∼ 1 / N or smaller • linear regression : φ ( x T w , y ) = ( y − w T x ) 2 • binary classification: y ∈ {± 1 } – logistic regression: φ ( x T w , y ) = log(1 + exp( − y ( w T x ))) � � – hinge loss (SVM): φ ( x T w , y ) = max 0 , 1 − y ( w T x ) 2

  5. Distributed ERM when dataset cannot fit into memory of single machine • data partitioned on m machines   X 1: x T 1   X 2:   x T . .   ∈ R N × d 2 X = .  .  .   . X i : . . x T . N • rewrite objective function m � 1 minimize Φ i ( X i : w ) + g ( w ) N w ∈ R d i =1 where Φ i ( X i : w ) = � j w , y j ) and � m j ∈I i φ j ( x T i =1 |I i | = N 3

  6. Distributed optimization • distributed algorithms: alternate between – a local computation procedure at each machine – a communication round with simple map-reduce operations ( e.g. , broadcasting a vector in R d to m machines, or computing sum or average of m vectors in R d ) • bottleneck: high cost of inter-machine communication – speed/delay, synchronization – energy consumption • communication-efficiency w ) − P ( w ⋆ ) ≤ ǫ – number of communication rounds to find P ( � – often can be measured by iteration complexity 4

  7. Iteration complexity • assumption: f : R d − R twice continuously differentiable, λ I � f ′′ ( w ) � LI , ∀ w ∈ R d in other words, f is λ -strongly convex and L -smooth • condition number κ = L λ we focus on ill-conditioned problems: κ ≫ 1 • iteration complexities of first-order methods – gradient descent method: O ( κ log(1 /ǫ )) – accelerated gradient method: O ( √ κ log(1 /ǫ )) – stochastic gradient method: O ( κ/ǫ ) (population loss) 5

  8. Distributed gradient methods distributed implementation of gradient descent m � P ( w ) = 1 minimize Φ i ( X i : w ) N w ∈ R d i =1 w ( t +1) = w ( t ) − α t ∇ P ( w ( t ) ) master w ( t ) ∇ Φ i communicate O ( d ) bits m 1 i compute ∇ Φ i ( X i : w ( t ) ) � � � � � � • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can use accelerated gradient method: O ( √ κ log(1 /ǫ )) 6

  9. ADMM � m 1 i =1 f i ( u i ) • reformulation: minimize N subject to u i = w , i = 1 , . . . , m • augmented Lagrangian � � � m f i ( u i ) + � v i , u i − w � + ρ 2 � u i − w � 2 L ρ ( u , v , w ) = 2 i =1 w ( t +1) = arg min w L ρ ( u ( t +1) , v ( t ) , w ) master communicate O ( d ) bits u ( t +1) w ( t ) i v ( t ) i u ( t +1) = arg min u i L ρ ( u i , v ( t ) , w ( t ) ) i m 1 i v ( t +1) = v ( t ) + ρ ( u ( t ) − w ( t ) ) � � � � � � i i i • no. of communication rounds: O ( κ log(1 /ǫ )) or O ( √ κ log(1 /ǫ )) 7

  10. The dual ERM problem primal problem m � = 1 P ( w ) def Φ i ( X i : w ) + g ( w ) minimize N w ∈ R d i =1 dual problem � � m m � � = − 1 − 1 D ( α ) def Φ ∗ i ( α i ) − g ∗ ( X i : ) T α i maximize N N α ∈ R N i =1 i =1 where g ∗ and φ ∗ i are convex conjugate functions • g ∗ ( v ) = sup u ∈ R d { v T u − g ( u ) } • Φ ∗ i ( α i ) = sup z ∈ R ni { α T i z − Φ i ( z ) } , for i = 1 , . . . , m recover primal variable from dual: w = ∇ g ∗ � � m � − 1 i =1 ( X i : ) T α i N 8

  11. The CoCoA(+) algorithm (Jaggi et al. 2014, Ma et al. 2015) � � � m � m = − 1 − 1 def ( X i : ) T α i Φ ∗ i ( α i ) − g ∗ maximize D ( α ) N N α ∈ R N i =1 i =1 v ( t +1) = v ( t ) + � m i =1 ∆ v ( t ) i master communicate O ( d ) bits ∆ v ( t ) v ( t ) i α ( t +1) = arg max α i G i ( v ( t ) , α i ) m 1 i i � � � � � � ∆ v ( t ) N ( X i : ) T ( α ( t +1) − α ( t ) = 1 ) i i i • each iteration involves one round of communication • number of communication rounds: O ( κ log(1 /ǫ )) • can be accelerated by PPA (Catalyst, Lin et al.) : O ( √ κ log(1 /ǫ )) 9

  12. Primal and dual variables w α 1 X 1: α 2 X 2: . . . X i : . . . α m � � � m − 1 w = ∇ g ∗ ( X i : ) T α i N i =1 10

  13. Can we do better? • asynchronous distributed algorithms? • better communication complexity? • better computation complexity? 11

  14. Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments

  15. Asynchronism: Hogwild! style idea: exploit sparsity to avoid simultaneous updates (Niu et al. 2011) w ������������������������ ������������������������ machine 1 X 1: ������������������������ ������������������������ machine 2 X 2: . ������������������������ ������������������������ . . ������������������������� ������������������������� X i : . ������������������������� ������������������������� . . machine m problems: • too frequent communication (bottleneck for distributed system) • slow convergence (sublinear rate using stochastic gradients) 12

  16. Tame the hog: forced separation w 1 w 2 w i w K ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� machine 1 machine 2 machine m • partition w into K blocks w 1 , . . . , w K • each machine updates a different block using relevant columns • set K > m so that all machines can work all the time • event-driven asynchronism: – whenever free, each machine request new block to update – update orders can be intentionally randomized 13

  17. Double separation via saddle-point formulation w 1 w 2 w k w K ... α 1 ... α 2 ... α i X i : X ik ... ... ... ... ... α m X : k � 1 � m K m K � � � � i X ik w k − 1 α T Φ ∗ min max i ( α i ) + g ( w k ) N N w ∈ R d α ∈ R N i =1 k =1 i =1 k =1 14

  18. A randomized primal-dual algorithm Algorithm 1: Doubly stochastic primal-dual coordinate update input: initial points w (0) and α (0) for t = 0 , 1 , 2 , . . . , T − 1 1. pick j ∈ { 1 , . . . , m } and l ∈ { 1 , . . . , K } with probabilities p j and q l 2. compute stochastic gradients u ( t +1) q l X jl w ( t ) v ( t +1) N ( X jl ) T α ( t ) 1 1 1 = = , j l l p j j 3. update primal and dual block coordinates: � prox σ j Ψ ∗ α ( t ) + σ j u ( t +1) � � if i = j , α ( t +1) j j = j i α ( t ) if i � = j , i , � prox τ l g l w ( t ) − τ l v ( t +1) � � if k = l , w ( t +1) l l = w ( t ) k if k � = l . k , end for 15

  19. How good is this algorithm? • on the update order – sequence ( i ( t ) , k ( t )) not really i.i.d. – in practice better than i.i.d.? w 1 w 2 w i w K ��� ��� ��� ��� ���� ���� ��� ��� ��� ��� ���� ���� machine 1 machine 2 machine m • bad news: sublinear convergence, with complexity O (1 /ǫ ) 16

  20. Outline • distributed empirical risk minimization • randomized primal-dual algorithms with parameter servers • variance reduction techniques • DSCOVR algorithms (Doubly Stochastic Coordinate Optimization with Variance Reduction) • preliminary experiments

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend