distributed consensus optimization
play

Distributed Consensus Optimization Ming Yan Michigan State - PowerPoint PPT Presentation

Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Ming Yan, Michigan State University Decentralized-1 why we need decentralized optimization? Decentralized vehicles/aircrafts


  1. decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Ming Yan, Michigan State University Decentralized-10

  2. decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . Ming Yan, Michigan State University Decentralized-10

  3. decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x Ming Yan, Michigan State University Decentralized-10

  4. decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . Ming Yan, Michigan State University Decentralized-10

  5. decentralized gradient descent Consider problem n � minimize f ( x ) = f i ( x i ) , s . t . x 1 = x 2 = · · · = x n . x i =1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) x k +1 = Wx k − λ ∇ f ( x k ) . • Rewrite it as x k +1 = x k − (( I − W ) x k + λ ∇ f ( x k )) . DGD is gradient descent with stepsize one of √ 1 I − Wx � 2 minimize 2 � 2 + λf ( x ) . x • The solution is generally non consensus, i.e., Wx ∗ = x ∗ + λ ∇ f ( x ∗ ) � = x ∗ . • Diminishing stepsize, i.e., decreasing λ during the iteration. Ming Yan, Michigan State University Decentralized-10

  6. constant stepsize? Ming Yan, Michigan State University Decentralized-11

  7. constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) Ming Yan, Michigan State University Decentralized-11

  8. constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) Ming Yan, Michigan State University Decentralized-11

  9. constant stepsize? • alternating direction method of multipliers (ADMM) (Shi et al. ’14, Chang-Hong-Wang ’15, Hong-Chang ’17) • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura ’14) • EXTRA/PG-EXTRA (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-11

  10. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-12

  11. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-12

  12. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-12

  13. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-12

  14. forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-13

  15. forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 Ming Yan, Michigan State University Decentralized-13

  16. forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 Ming Yan, Michigan State University Decentralized-13

  17. forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − Ming Yan, Michigan State University Decentralized-13

  18. forward-backward • Using forward-backward in the KKT form √ � � � x k � � ∇ f ( x k ) � α I − I − W √ − s k − I − W β I 0 √ √ � � � x k +1 � � � � x k +1 � α I − I − W I − W 0 = √ + √ . s k +1 s k +1 − I − W β I − I − W 0 • It reduces to √ � � � x k � � ∇ f ( x k ) � � � � x k +1 � α I − I − W α I 0 √ − = √ . s k s k +1 − I − W β I − 2 I − W β I 0 • It is equivalent to √ α x k − I − Ws k − ∇ f ( x k ) = α x k +1 , √ √ I − Wx k + β s k = − 2 I − Wx k +1 + β s k +1 . − √ • For simplicity, let t = I − Ws , and we have α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-13

  19. EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . Ming Yan, Michigan State University Decentralized-14

  20. EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β Ming Yan, Michigan State University Decentralized-14

  21. EXact firsT-ordeR Algorithm (EXTRA) • From the previous slide α x k − t k − ∇ f ( x k ) = α x k +1 , − ( I − W ) x k + β t k = − 2( I − W ) x k +1 + β t k +1 . • We have α x k +1 = α x k − t k − ∇ f ( x k ) = α x k − I − W (2 x k − x k − 1 ) − t k − 1 − ∇ f ( x k ) β = α x k − I − W (2 x k − x k − 1 )+ α x k + ∇ f ( x k − 1 ) − α x k − 1 − ∇ f ( x k ) β α I − I − W (2 x k − x k − 1 ) + ∇ f ( x k − 1 ) − ∇ f ( x k ) . = � � β • Let αβ = 2 and we have EXTRA (Shi et al. ’15) x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-14

  22. convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-15

  23. convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15

  24. convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-15

  25. convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-15

  26. convergence conditions for EXTRA: I EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 : � x k +1 � � − I + W � � x k � I + W 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � − I + W � � � � − Σ � � U ⊤ � I + W Σ 0 U 2 2 = . U ⊤ I 0 U I 0 • The iteration becomes � U ⊤ x k +1 � � �� U ⊤ x k � − Σ Σ 2 = . U ⊤ x k U ⊤ x k − 1 I 0 • The condition for W is − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-15

  27. convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-16

  28. convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-16

  29. convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 Ming Yan, Michigan State University Decentralized-16

  30. convergence conditions for EXTRA: II EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � I + W − 1 − I + W + 1 α I α I 2 = . x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � I + W − 1 − I + W + 1 � � � � Σ − 1 − Σ 2 + 1 � � U ⊤ � α I α I U α I α I 0 2 = . U ⊤ I 0 U I 0 • The condition for W is 4 / (3 α ) − 2 / 3 < λ (Σ) = λ ( W + I ) ≤ 2 , which is 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 . In addition, we have stepsize 1 /α < 2 . Ming Yan, Michigan State University Decentralized-16

  31. conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Ming Yan, Michigan State University Decentralized-17

  32. conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-17

  33. conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-17

  34. conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) Ming Yan, Michigan State University Decentralized-17

  35. conditions for general EXTRA EXTRA: x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): 4 / (3 α ) − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex. (Li-Yan ’17) • weaker condition on f ( x ) but more restrict condition for both parameters. (Shi et al. ’15) Ming Yan, Michigan State University Decentralized-17

  36. large stepsize as centralized ones? Ming Yan, Michigan State University Decentralized-18

  37. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x Ming Yan, Michigan State University Decentralized-19

  38. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. Ming Yan, Michigan State University Decentralized-19

  39. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − Ming Yan, Michigan State University Decentralized-19

  40. decentralized smooth optimization Problem: √ minimize f ( x ) , s . t . I − Wx = 0 . x • Lagrangian function √ f ( x ) + � I − Wx , s � , where s is the Lagrangian multiplier. • Optimality condition (KKT): √ 0 = ∇ f ( x ∗ ) + I − Ws ∗ , √ I − Wx ∗ . 0 = − • It is the same as √ � ∇ f ( x ∗ ) � � � � x ∗ � I − W 0 − = √ . s ∗ − I − W 0 0 Ming Yan, Michigan State University Decentralized-19

  41. forward-backward • The KKT system � ∇ f ( x ∗ ) � − 0 √ � � � x ∗ � I − W 0 √ = . s ∗ − I − W 0 Ming Yan, Michigan State University Decentralized-20

  42. forward-backward • Using forward-backward in the KKT form � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � � � � x k +1 � α I I − W 0 0 √ = + . β I − 1 s k +1 s k +1 0 α ( I − W ) − I − W 0 Ming Yan, Michigan State University Decentralized-20

  43. forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) Ming Yan, Michigan State University Decentralized-20

  44. forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 Ming Yan, Michigan State University Decentralized-20

  45. forward-backward • Combine the right hand side: � � � x k � � ∇ f ( x k ) � α I 0 − β I − 1 s k α ( I − W ) 0 0 √ � � � x k +1 � α I I − W √ = . s k +1 β I − 1 − I − W α ( I − W ) • Apply Gaussian elimination: � � � x k � � ∇ f ( x k ) � α I 0 √ − √ β I − 1 s k 1 I − W ∇ f ( x k ) I − W α ( I − W ) α √ � � � x k +1 � α I I − W = . s k +1 β I 0 • It is equivalent to √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � I − W ∇ f ( x k ) = β s k +1 . αβ ( I − W ) � α Ming Yan, Michigan State University Decentralized-20

  46. NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α Ming Yan, Michigan State University Decentralized-21

  47. NIDS (Li-Shi-Yan ’17) From the previous slide: √ α x k − ∇ f ( x k ) − I − Ws k +1 = α x k +1 , √ √ I − 1 s k − 1 I − Wx k + β � αβ ( I − W ) � I − W ∇ f ( x k ) = β s k +1 . α √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21

  48. NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . Ming Yan, Michigan State University Decentralized-21

  49. NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21

  50. NIDS (Li-Shi-Yan ’17) √ Let t = I − Ws : α x k − ∇ f ( x k ) − t k +1 = α x k +1 , I − 1 t k − 1 − ( I − W ) x k + β � αβ ( I − W ) � α ( I − W ) ∇ f ( x k ) = β t k +1 . We have α x k +1 = α x k − ∇ f ( x k ) − t k +1 I − 1 t k − 1 β ( I − W ) x k + 1 = α x k − ∇ f ( x k ) − � αβ ( I − W ) � αβ ( I − W ) ∇ f ( x k ) I − 1 ( α x k − t k − ∇ f ( x k )) = � αβ ( I − W ) � I − 1 ( α x k + α x k − α x k − 1 + ∇ f ( x k − 1 ) − ∇ f ( x k )) . = � αβ ( I − W ) � Thus I − 1 (2 x k − x k − 1 − 1 x k +1 = � α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) . αβ ( I − W ) � Ming Yan, Michigan State University Decentralized-21

  51. convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-22

  52. convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . Ming Yan, Michigan State University Decentralized-22

  53. convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22

  54. convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 Ming Yan, Michigan State University Decentralized-22

  55. convergence conditions for NIDS NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • If f = 0 (same as EXTRA): The condition for W is − 5 / 3 < λ ( W ) ≤ 1 . • If ∇ f ( x k ) = x k − b : � x k +1 � � � � x k � (2 − 1 α ) I + W − (1 − 1 α ) I + W 2 2 = x k x k − 1 I 0 • Let I + W = U Σ U ⊤ . � U ⊤ x k +1 � � (2 − 1 α ) Σ − (1 − 1 α ) Σ � � U ⊤ x k � 2 2 = U ⊤ x k U ⊤ x k − 1 I 0 • Therefore, one sufficient condition is − 5 / 3 < λ ( W ) ≤ 1 and 1 /α < 2 . Ming Yan, Michigan State University Decentralized-22

  56. conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-23

  57. conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Ming Yan, Michigan State University Decentralized-23

  58. conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Ming Yan, Michigan State University Decentralized-23

  59. conditions of NIDS for general smooth functions NIDS (with αβ = 2 ): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Initial condition ( k = 0 , 1 ): x 1 = x 0 − 1 α ∇ f ( x 0 ) . Convergence condition (Li-Yan ’17): − 5 / 3 < λ ( W ) ≤ 1 , 1 /α < 2 /L. Linear convergence condition: • f ( x ) is strongly convex and − 1 < λ ( W ) ≤ 1 (Li-Shi-Yan ’17): � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) Ming Yan, Michigan State University Decentralized-23

  60. NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-24

  61. NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. Ming Yan, Michigan State University Decentralized-24

  62. NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. Ming Yan, Michigan State University Decentralized-24

  63. NIDS vs EXTRA EXTRA x k +1 = I + W (2 x k − x k − 1 ) − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 ) 2 NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The difference is in the data to be communicated. • But NIDS has a larger range for parameters than EXTRA. • NIDS is faster than EXTRA. Ming Yan, Michigan State University Decentralized-24

  64. advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-25

  65. advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. Ming Yan, Michigan State University Decentralized-25

  66. advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . Ming Yan, Michigan State University Decentralized-25

  67. advantages of NIDS NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 • The stepsize is large and does not depend on the network topology. α < 2 1 L. • Individual stepsizes can be included. α i < 2 1 L i . • The linear convergence rate from the functions and the network are separated. � � �� 1 − µ L, 1 − 1 − λ 2 ( W ) O max . 1 − λ n ( W ) It matches the results for gradient descent and decentralized averaging without acceleration. Ming Yan, Michigan State University Decentralized-25

  68. D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 Ming Yan, Michigan State University Decentralized-26

  69. D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . Ming Yan, Michigan State University Decentralized-26

  70. D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . Ming Yan, Michigan State University Decentralized-26

  71. D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . Ming Yan, Michigan State University Decentralized-26

  72. D 2 : stochastic NIDS (Huang et al. ’18) NIDS: x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k ) − ∇ f ( x k − 1 )) 2 NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k +1 = I + W (2 x k − x k − 1 − 1 α ( ∇ f ( x k , ξ k ) − ∇ f ( x k − 1 , ξ k − 1 )) 2 • ∇ f ( x k , ξ k ) is a stochastic gradient by sampling ξ t from distribution D . • E ξ ∼D ∇ f ( x ; ξ ) = ∇ f ( x ) , ∀ x . • E ξ ∼D �∇ f ( x ; ξ ) − ∇ f ( x ) � 2 � σ 2 , ∀ x . • Convergence result: if the stepsize is small enough (in the order of � T/n ) − 1 ), the convergence rate is ( c + � � σ + 1 O √ . T nT Ming Yan, Michigan State University Decentralized-26

  73. numerical experiments Ming Yan, Michigan State University Decentralized-27

  74. compared algorithms • NIDS • EXTRA/PG-EXTRA • DIGing-ATC (Nedic et al. ’16): x k +1 = W ( x k − α y k ) , y k +1 = W ( y k + ∇ f ( x k +1 ) − ∇ f ( x k )) . • accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in (Qu-Li ’17) • dual friendly optimal algorithm (OA) for distributed optimization (Uribe et al. ’17). Ming Yan, Michigan State University Decentralized-28

  75. strongly convex: same stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 10 20 30 40 50 60 70 80 90 number of iterations Ming Yan, Michigan State University Decentralized-29

  76. strongly convex: same stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 10 20 30 40 50 60 70 80 90 number of iterations Ming Yan, Michigan State University Decentralized-29

  77. strongly convex: adaptive stepsize 10 -2 10 -4 10 -6 10 -8 10 -10 10 -12 10 -14 0 20 40 60 80 100 120 140 number of iterations Ming Yan, Michigan State University Decentralized-30

  78. linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 number of iterations Ming Yan, Michigan State University Decentralized-31

  79. linear convergence rate bottleneck 10 5 10 0 10 -5 10 -10 10 -15 10 -20 0 50 100 150 200 250 300 350 400 450 number of iterations Ming Yan, Michigan State University Decentralized-31

  80. nonsmooth functions 10 2 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L 10 0 PGEXTRA-1.3/L PGEXTRA-1.4/L 10 -2 10 -4 10 -6 10 -8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 10 4 Ming Yan, Michigan State University Decentralized-32

  81. stochastic case: shuffled 2 2.5 Decentralized 2 1.5 1.5 Loss Loss D 2 1 1 Decentralized Centralized 0.5 D 2 0.5 Centralized 0 0 0 20 40 60 80 100 0 20 40 60 80 100 # Epochs # Epochs (a) T RANSFER L EARNING (b) L E N ET Ming Yan, Michigan State University Decentralized-33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend