asynchronous algorithms for conic programs including
play

Asynchronous Algorithms for Conic Programs, including Optimal, - PowerPoint PPT Presentation

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization, Information Processing, and Learning


  1. Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization, Information Processing, and Learning August’17 1 / 31

  2. Overview • conic programming problem (P): minimize c T x subject to Ax = b, x ∈ K K is a closed convex cone • this talk : a first-order iteration • parallel: linear speedup, async • still working if problem is unsolvable 2 / 31

  3. Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: 1 equivalent to standard ADMM, but the different form is important 3 / 31

  4. Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates 1 equivalent to standard ADMM, but the different form is important 3 / 31

  5. Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates • coordinate friendly: break z into m blocks, cost( T i ) ∼ 1 m cost( T ) 1 equivalent to standard ADMM, but the different form is important 3 / 31

  6. Approach overview Douglas-Rachfordf 1 fixed point iteration z k +1 = Tz k T depends on A, b, c and has nice properties: • convergence guarantees and rates • coordinate friendly: break z into m blocks, cost( T i ) ∼ 1 m cost( T ) • divergent nicely: • (P) has no primal-dual sol pair ⇔ � z k � → ∞ • z k +1 − z k tells a whole lot 1 equivalent to standard ADMM, but the different form is important 3 / 31

  7. Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z 4 / 31

  8. Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z • Douglas-Rachford Splitting (DRS) method solves minimize f ( x ) + g ( x ) by iterating z k +1 = Tz k 4 / 31

  9. Douglas-Rachford splitting (Lions-Mercier’79) • proximal mapping of a closed function h 2 γ � z − x � 2 } 1 prox γh ( x ) = arg min { h ( z ) + z • Douglas-Rachford Splitting (DRS) method solves minimize f ( x ) + g ( x ) by iterating z k +1 = Tz k defined as : x k + 1 2 = prox γg ( z k ) x k +1 = prox γf (2 z k − x k + 1 2 ) z k +1 = z k + ( x k +1 − x k + 1 2 ) 4 / 31

  10. Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex 5 / 31

  11. Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b 5 / 31

  12. Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b • per-iteration cost: O ( n 2 ) if x ∈ R n (by pre-factorizing AA T ) 5 / 31

  13. Apply DRS to conic programming minimize c T x subject to Ax = b, x ∈ K ⇔ minimize � c T x + δ A · = b ( x ) � + δ K ( x ) � �� � � �� � g ( x ) f ( x ) • cone K is nonempty closed convex • each iteration: project onto K , then project onto A · = b • per-iteration cost: O ( n 2 ) if x ∈ R n (by pre-factorizing AA T ) • prior work: ADMM for SDP (Wen-Goldfarb-Y.’09) 5 / 31

  14. Other choices of splitting • linearized ADMM and primal-dual splitting: avoid inverting full A • variations of Frank-Wolfe: avoid expensive projections to SDP cone • subgradient and bundle methods ... 6 / 31

  15. Coordinate friendly 2 (CF) • (Block) coordinate update is fast only if the subproblems are simple • definition : T : H → H is CF if, for any z and i ∈ [ m ] , z + := � � z 1 , . . . , ( Tz ) i , . . . , z m it holds that = O � 1 cost � { z, M ( z ) } �→ { z + , M ( z + ) } � m cost[ z �→ Tz ] � where M ( z ) is some quantity maintained in the memory 2 Peng-Wu-Xu-Yan-Y. AMSA’16 7 / 31

  16. Composed operators • 9 rules 3 for CF T 1 ◦ T 2 cover many examples • general principles: • T 1 ◦ T 2 inherits the (weaker) separability property • if T 1 is CF and T 2 to be either cheap , easy-to-maintain , or directly CF , then T 1 ◦ T 2 is CF • if T 1 is separable or cheap, T 1 ◦ T 2 is easier to CF 3 Peng-Wu-Xu-Yan-Y. AMSA’16 8 / 31

  17. Lists of CF T 1 ◦ T 2 • many convex image processing models • portfolio optimization • most sparse optimization problems • all LPs, all SOCPs, and SDPs without large cones • most ERM problems • ... 9 / 31

  18. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } 10 / 31

  19. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np 10 / 31

  20. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small 10 / 31

  21. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 10 / 31

  22. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 • given γ and updating x i , refreshing γ costs O (1) 10 / 31

  23. Example: DRS for SOCP • second-order cone: Q n = { x ∈ R n : x 1 ≥ � ( x 2 , . . . , x n ) � 2 } • DRS operator has the form T = linear ◦ proj Q n 1 ×···× Q np • CF is trivial if all cones are small • now, consider a big cone; property: proj Q n ( x ) = ( αx 1 , βx 2 , . . . , βx n ) where α, β depend on x 1 and γ := � ( x 2 , . . . , x n ) � 2 • given γ and updating x i , refreshing γ costs O (1) • by maintaining γ , proj Q n is cheap, and T = linear ◦ cheap is CF 10 / 31

  24. Fixed-point iterations • full update z k +1 = Tz k 11 / 31

  25. Fixed-point iterations • full update z k +1 = Tz k • (block) coordinate update (CU) : choose i k ∈ [ m ] , � z k i + η (( Tz k ) i − z k i ) , if i = i k z k +1 = i z k i , otherwise . 11 / 31

  26. Fixed-point iterations • full update z k +1 = Tz k • (block) coordinate update (CU) : choose i k ∈ [ m ] , � z k i + η (( Tz k ) i − z k i ) , if i = i k z k +1 = i z k i , otherwise . • parallel CU : p agents choose I k ⊂ [ m ] � z k i + η (( Tz k ) i − z k i ) , if i ∈ I k z k +1 = i z k i , otherwise . • η depends on properties of T , i k , and I k 11 / 31

  27. Sync-parallel versus async-parallel Agent 1 Agent 1 idle idle Agent 2 Agent 2 idle Agent 3 Agent 3 idle Synchronous Asynchronous (faster agents must wait) (all agents are non-stop) 12 / 31

  28. ARock: async-parallel CU • p agents • every agent continuously does: pick i k ⊂ [ m ] , � i + η (( Tz k − d k ) i − z k − d k z k ) , if i = i k z k +1 i = i z k i , otherwise . new notation: • k increases after any agent completes an update k − d k, 1 k − d k,m • z k − d k = ( z ) may be stale , . . . , z m 1 • allow inconsistent atomic read/write 13 / 31

  29. Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async 14 / 31

  30. Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async • recent in ML community: async SG and async BCD • early works: random i k , bounded delays, E f has sufficient descent, treat delays as noise, delays independent of i k 14 / 31

  31. Various theories and meanings • 1969 – 90s: T is contractive in � · � w, ∞ , partially/totally async • recent in ML community: async SG and async BCD • early works: random i k , bounded delays, E f has sufficient descent, treat delays as noise, delays independent of i k • state-of-the-art: allow essential cyclic i k , unbounded noise ( t − 4 or faster decay), Lyapunov analysis, delays as overdue progress, delays can depend on i k 14 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend