averaging algorithms and distributed optimization
play

Averaging algorithms and distributed optimization John N. - PowerPoint PPT Presentation

Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010 Outline Motivation and applications Consensus/averaging in distributed optimization


  1. Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010

  2. Outline Motivation and applications • Consensus/averaging in distributed optimization • Convergence times of consensus/averaging • time-invariant case – time-varying case –

  3. The Setting • n agents starting values x i (0) – reach consensus on some x ∗ , with either: • min i x i (0) ≤ x ∗ ≤ max i x i (0) (consensus) – x ∗ = x 1 (0) + · · · + x n (0) (averaging) – n averaging when x i ∈ { − 1 , +1 } (voting) – interested in: • genuinely distributed algorithm – no synchronization – no “infrastructure” such as spanning trees – x i := x i + x j simple updates, such as: • 2

  4. Social sciences Merging of “expert” opinions • Evolution of public opinion • Evolution of reputation • Modeling of jurors • Language evolution • Preference for “simple” models • behavior described by “rules of thumb” – less complex than Bayesian updating – interested in modeling, analysis (descriptive theory) • • . . . and narratives –

  5. Engineering Distributed computation and sensor networks • – Fusion of individual estimates – Distributed Kalman filtering Distributed optimization – – Distributed reinforcement learning Networking • Load balancing and resource allocation – – Clock synchronization – Reputation management in ad hoc networks Network monitoring – Multiagent coordination and control • – Coverage control Monitoring – – Creating virtual coordinates for geographic routing – Decentralized task assignment – Flocking

  6. The DeGroot opinion pooling model (1974) � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Markov chain theory + “mixing conditions” • convergence of A t , to matrix with equal rows − → convergence of x i to � j π j x j − → convergence rate estimates − → Averaging algorithms • A doubly stochastic: 1 ′ A x = 1 ′ x , where 1 ′ = [1 1 . . . 1] – x 1 + · · · + x n is conserved – convergence to x ∗ = x 1 (0) + · · · + x n (0) – n

  7. Part I: Distributed Optimization

  8. Gradient-like methods � min f ( x ) special case: f ( x ) = f i ( x ) • x i f, f i convex – f smooth; work with ∇ f ( x ) • update: x := x − γ ∇ f ( x ) – with noise: x := x − γ ( ∇ f ( x ) + w ) – (stochastic approximation, γ t → 0) → f nonsmooth, work with subgradient ∂ f ( x ) • update: x := x − γ∂ f ( x ) ( γ t → 0) – with noise: x := x − γ ( ∂ f ( x ) + w ) – More sophisticated variants: Dual averaging methods •

  9. Smooth f ; compentwise decentralization x i j : agent i , component j • i − γ ∂ f x i i := x i ( x i ) update: – ∂ x i j := x j x i reconcile: (occasionally; upper bound B ) – j track y = ( x 1 1 , . . . , x n Analysis: n ) • � y − x i � = O ( B γ ) y := y − γ ∇ f ( y ) + O ( B γ 2 ) − ∇ Convergence theorem for centralized gradient method remains • valid: [Bertsekas, JNT, Athans, 86] need γ ∼ 1 /B – also for stochastic approximation variant – � ∂ f � x i i := x i ( x i ) + w i i − γ ∂ x i

  10. Smooth f ; overlap and cooperate Assume (for simplicity) scalar x • subscript denotes agent’s value of x – x i := x i − γ f ( x i ) redundant/useless – useful in the presence of noise: • update: x i := x i − γ ( ∇ f ( x i ) + w i ) – x := x − γ · 1 � reconcile: ( ∇ f ( x i ) + w i ) – n i

  11. Smooth f ; overlap and cooperate (ctd.) Two-phase version • – update: x i := x i − γ ( ∇ f ( x i ) + w i ) – reconcile: run consensus algorithm x := Ax � converges: x i → y , ∀ i y = π j ≥ 0 π j x j j � y := y − γ π j ( ∇ f ( x j ) + w j ) � j expected update direction is still descent direction • classical convergence results for centralized stochastic • gradient method, with γ t → 0, remain valid

  12. Smooth f ; overlap and cooperate (ctd.) • Interleaved version � x i := a ij x j − γ ( ∇ f ( x i ) + w i ) j � – define y = π i x i i � � � – note: a ij x j = π i x i π i i j i � � y := y − γ π i ( ∇ f ( x i ) + w i ) i i • | x i − y | = O ( γ T · | ∇ f ( y ) | ) T : convergence time (time constant) of consensus algorithm π i ( ∇ f ( y ) + w j ) + O ( γ 2 T · | ∇ f ( y ) | ) � y := y − γ i convergence theorem for centralized stochastic gradient method, • with γ t → 0, remains valid [Bertsekas, JNT, Athans, 86]

  13. Smooth, additive f ; overlap and cooperate f ( x ) = 1 � � f i ( x ) optimality ∇ f i ( x ) = 0 ⇐ ⇒ • n i i Two-phase version • update: x i := x i − γ ∇ f i ( x i ) – reconcile: run consensus algorithm x := Ax – � converges: x i → y , ∀ i y = π i ≥ 0 π i x i i � y := y − γ π i ∇ f i ( x i ) i correctness requires π i = 1 /n • Use averaging algorithm ( A : doubly stochastic) –

  14. Additive f ; overlap and cooperate (ctd.) Interleaved version • � x i := a ij x j − γ ∇ f i ( x i ) + w i j define y = 1 � – x i n i y := y − γ 1 � � ∇ f i ( x i ) n i � � | x i − y | = O i | ∇ f i ( y ) | • � γ T · � � T : convergence time (time constant) of averaging algorithm – for constant γ , error does not vanish at optimum – optimality possible only with γ t → 0 (even in the absence of noise) – hence studied for nonsmooth f or stochastic case [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10]

  15. Convergence times — the big picture T con ( n, ǫ ): time for consensus/averaging algorithm • to reduce disagreement from unity to ǫ generically O (1 / log(1 / ǫ )) – focus on T con ( n ) – T opt ( n, ǫ ): time for centralized (sub)gradient algorithm • to bring cost gap to ǫ hide dependence on other constants – (bounds on first, second derivatives, stepsize details) � � Two-phase version: O T con ( n ) · T opt ( n, ǫ ) • � � • · Interleaved version: Results have the same flavor • [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10] is interleaving faster or “better” than two-phase version? – Our mission: study and reduce T con ( n ) • automatically better overall convergence time e.g., [Nedic, Olshevsky, Ozdaglar & JNT, 08]

  16. Part II: Consensus and averaging

  17. Convergence time of consensus algorithms � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Convergence time (time to get close to “steady-state”) Equal weight to all neighbors Undirected graphs: O ( n 3 ), tight Directed graphs: exponential( n ) (Landau and Odlyzko, 1981) Better results for special graphs Θ ( n 2 ) for line graphs (Erd¨ os-R´ enyi, geometric, small world)

  18. Averaging algorithms A doubly stochastic: 1 ′ A x = 1 ′ x • positive diagonal – nonzero entries are ≥ α > 0 – (0)+ + (0) convergence to x ∗ = x 1 (0)+ ··· + x n (0) – n convergence time = O ( n 2 / α ) – ( x i − x ∗ ) 2 is a Lyapunov function � V ( x ) = • V i (Nedic, Olshevsky, Ozdaglar & JNT, 09) • bidirectional graph, natural algorithm: x i := x i + 1 � ( x j − x i ) 2 n neighbors j α ∼ 1 convergence time = O ( n 3 ) n

  19. A critique The consensus/averaging algorithm x := Ax • assumes constant a ij = fixed graph ⇒ elect a leader, form a spanning tree, accumulate on tree – Want simplicity and robustness in dealing with • changing topologies, failures, etc.

  20. Time-Varying/Chaotic Environments i.i.d. random graphs: same (in expectation) as fixed graphs; • convergence rate ← → “mixing times” (Boyd et al., 2005) Fairly arbitrary sequence of graphs/matrices A ( t ): • worst-case analysis � x i ( t + 1) = a ij ( t ) x j ( t ) j a ij ( t ): nonzero whenever i receives message from j x ( t + 1) = A ( t ) x ( t ) (inhomogeneous Markov chain)

  21. Consensus convergence � � x i ( t + 1) = a ij ( t ) x j t ) j a ii ( t ) > 0; a ij ( t ) > 0 = a ij ( t ) ≥ α > 0 ⇒ • “strong connectivity in bounded time”: • over B time steps “communication graph” is strongly connected Convergence to consensus: • x i ( t ) → x ∗ = convex combination of initial values ∀ i : (JNT, Bertsekas, Athans, 86; Jadbabaie et al., 03) “convergence time”: exponential in n and B • even with: – symmetric graph at each time equal weight to each neighbor (Cao, Spielman, Morse, 05)

  22. Averaging in Time-Varying Setting • V x ( t + 1) = A ( t ) x ( t ) (Nedic, Olshevsky, Ozdaglar & JNT, 09) • A ( t ) doubly stochastic, for all t – O ( n 2 / α ) bound remains valid! – Improved convergence rate • exchange “load” with up to two neighbors at a time – can use α = O (1) – convergence time: O ( n 2 ) – • Averaging in time-varying bidirectional graphs: → O ( n 2 ) − no harder than consensus on fixed graphs • Various convergence proofs of optimization algs. remain valid Improves the convergence time estimate for subgradient – methods [Nedic, Olshevsky, Ozdaglar, JNT, 09]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend