Averaging algorithms and distributed optimization John N. - PowerPoint PPT Presentation

Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010

Outline Motivation and applications • Consensus/averaging in distributed optimization • Convergence times of consensus/averaging • time-invariant case – time-varying case –

The Setting • n agents starting values x i (0) – reach consensus on some x ∗ , with either: • min i x i (0) ≤ x ∗ ≤ max i x i (0) (consensus) – x ∗ = x 1 (0) + · · · + x n (0) (averaging) – n averaging when x i ∈ { − 1 , +1 } (voting) – interested in: • genuinely distributed algorithm – no synchronization – no “infrastructure” such as spanning trees – x i := x i + x j simple updates, such as: • 2

Social sciences Merging of “expert” opinions • Evolution of public opinion • Evolution of reputation • Modeling of jurors • Language evolution • Preference for “simple” models • behavior described by “rules of thumb” – less complex than Bayesian updating – interested in modeling, analysis (descriptive theory) • • . . . and narratives –

Engineering Distributed computation and sensor networks • – Fusion of individual estimates – Distributed Kalman filtering Distributed optimization – – Distributed reinforcement learning Networking • Load balancing and resource allocation – – Clock synchronization – Reputation management in ad hoc networks Network monitoring – Multiagent coordination and control • – Coverage control Monitoring – – Creating virtual coordinates for geographic routing – Decentralized task assignment – Flocking

The DeGroot opinion pooling model (1974) � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Markov chain theory + “mixing conditions” • convergence of A t , to matrix with equal rows − → convergence of x i to � j π j x j − → convergence rate estimates − → Averaging algorithms • A doubly stochastic: 1 ′ A x = 1 ′ x , where 1 ′ = [1 1 . . . 1] – x 1 + · · · + x n is conserved – convergence to x ∗ = x 1 (0) + · · · + x n (0) – n

Part I: Distributed Optimization

Gradient-like methods � min f ( x ) special case: f ( x ) = f i ( x ) • x i f, f i convex – f smooth; work with ∇ f ( x ) • update: x := x − γ ∇ f ( x ) – with noise: x := x − γ ( ∇ f ( x ) + w ) – (stochastic approximation, γ t → 0) → f nonsmooth, work with subgradient ∂ f ( x ) • update: x := x − γ∂ f ( x ) ( γ t → 0) – with noise: x := x − γ ( ∂ f ( x ) + w ) – More sophisticated variants: Dual averaging methods •

Smooth f ; compentwise decentralization x i j : agent i , component j • i − γ ∂ f x i i := x i ( x i ) update: – ∂ x i j := x j x i reconcile: (occasionally; upper bound B ) – j track y = ( x 1 1 , . . . , x n Analysis: n ) • � y − x i � = O ( B γ ) y := y − γ ∇ f ( y ) + O ( B γ 2 ) − ∇ Convergence theorem for centralized gradient method remains • valid: [Bertsekas, JNT, Athans, 86] need γ ∼ 1 /B – also for stochastic approximation variant – � ∂ f � x i i := x i ( x i ) + w i i − γ ∂ x i

Smooth f ; overlap and cooperate Assume (for simplicity) scalar x • subscript denotes agent’s value of x – x i := x i − γ f ( x i ) redundant/useless – useful in the presence of noise: • update: x i := x i − γ ( ∇ f ( x i ) + w i ) – x := x − γ · 1 � reconcile: ( ∇ f ( x i ) + w i ) – n i

Smooth f ; overlap and cooperate (ctd.) Two-phase version • – update: x i := x i − γ ( ∇ f ( x i ) + w i ) – reconcile: run consensus algorithm x := Ax � converges: x i → y , ∀ i y = π j ≥ 0 π j x j j � y := y − γ π j ( ∇ f ( x j ) + w j ) � j expected update direction is still descent direction • classical convergence results for centralized stochastic • gradient method, with γ t → 0, remain valid

Smooth f ; overlap and cooperate (ctd.) • Interleaved version � x i := a ij x j − γ ( ∇ f ( x i ) + w i ) j � – define y = π i x i i � � � – note: a ij x j = π i x i π i i j i � � y := y − γ π i ( ∇ f ( x i ) + w i ) i i • | x i − y | = O ( γ T · | ∇ f ( y ) | ) T : convergence time (time constant) of consensus algorithm π i ( ∇ f ( y ) + w j ) + O ( γ 2 T · | ∇ f ( y ) | ) � y := y − γ i convergence theorem for centralized stochastic gradient method, • with γ t → 0, remains valid [Bertsekas, JNT, Athans, 86]

Smooth, additive f ; overlap and cooperate f ( x ) = 1 � � f i ( x ) optimality ∇ f i ( x ) = 0 ⇐ ⇒ • n i i Two-phase version • update: x i := x i − γ ∇ f i ( x i ) – reconcile: run consensus algorithm x := Ax – � converges: x i → y , ∀ i y = π i ≥ 0 π i x i i � y := y − γ π i ∇ f i ( x i ) i correctness requires π i = 1 /n • Use averaging algorithm ( A : doubly stochastic) –

Additive f ; overlap and cooperate (ctd.) Interleaved version • � x i := a ij x j − γ ∇ f i ( x i ) + w i j define y = 1 � – x i n i y := y − γ 1 � � ∇ f i ( x i ) n i � � | x i − y | = O i | ∇ f i ( y ) | • � γ T · � � T : convergence time (time constant) of averaging algorithm – for constant γ , error does not vanish at optimum – optimality possible only with γ t → 0 (even in the absence of noise) – hence studied for nonsmooth f or stochastic case [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10]

Convergence times — the big picture T con ( n, ǫ ): time for consensus/averaging algorithm • to reduce disagreement from unity to ǫ generically O (1 / log(1 / ǫ )) – focus on T con ( n ) – T opt ( n, ǫ ): time for centralized (sub)gradient algorithm • to bring cost gap to ǫ hide dependence on other constants – (bounds on first, second derivatives, stepsize details) � � Two-phase version: O T con ( n ) · T opt ( n, ǫ ) • � � • · Interleaved version: Results have the same flavor • [Nedic & Ozdaglar, 09; Duchi, Agarwal, & Wainright, 10] is interleaving faster or “better” than two-phase version? – Our mission: study and reduce T con ( n ) • automatically better overall convergence time e.g., [Nedic, Olshevsky, Ozdaglar & JNT, 08]

Part II: Consensus and averaging

Convergence time of consensus algorithms � x i ( t + 1) = a ij x j ( t ) a ij ≥ 0 , j a ij = 1 � j x ( t + 1) = Ax ( t ) A : stochastic matrix Convergence time (time to get close to “steady-state”) Equal weight to all neighbors Undirected graphs: O ( n 3 ), tight Directed graphs: exponential( n ) (Landau and Odlyzko, 1981) Better results for special graphs Θ ( n 2 ) for line graphs (Erd¨ os-R´ enyi, geometric, small world)

Averaging algorithms A doubly stochastic: 1 ′ A x = 1 ′ x • positive diagonal – nonzero entries are ≥ α > 0 – (0)+ + (0) convergence to x ∗ = x 1 (0)+ ··· + x n (0) – n convergence time = O ( n 2 / α ) – ( x i − x ∗ ) 2 is a Lyapunov function � V ( x ) = • V i (Nedic, Olshevsky, Ozdaglar & JNT, 09) • bidirectional graph, natural algorithm: x i := x i + 1 � ( x j − x i ) 2 n neighbors j α ∼ 1 convergence time = O ( n 3 ) n

A critique The consensus/averaging algorithm x := Ax • assumes constant a ij = fixed graph ⇒ elect a leader, form a spanning tree, accumulate on tree – Want simplicity and robustness in dealing with • changing topologies, failures, etc.

Time-Varying/Chaotic Environments i.i.d. random graphs: same (in expectation) as fixed graphs; • convergence rate ← → “mixing times” (Boyd et al., 2005) Fairly arbitrary sequence of graphs/matrices A ( t ): • worst-case analysis � x i ( t + 1) = a ij ( t ) x j ( t ) j a ij ( t ): nonzero whenever i receives message from j x ( t + 1) = A ( t ) x ( t ) (inhomogeneous Markov chain)

Consensus convergence � � x i ( t + 1) = a ij ( t ) x j t ) j a ii ( t ) > 0; a ij ( t ) > 0 = a ij ( t ) ≥ α > 0 ⇒ • “strong connectivity in bounded time”: • over B time steps “communication graph” is strongly connected Convergence to consensus: • x i ( t ) → x ∗ = convex combination of initial values ∀ i : (JNT, Bertsekas, Athans, 86; Jadbabaie et al., 03) “convergence time”: exponential in n and B • even with: – symmetric graph at each time equal weight to each neighbor (Cao, Spielman, Morse, 05)

Averaging in Time-Varying Setting • V x ( t + 1) = A ( t ) x ( t ) (Nedic, Olshevsky, Ozdaglar & JNT, 09) • A ( t ) doubly stochastic, for all t – O ( n 2 / α ) bound remains valid! – Improved convergence rate • exchange “load” with up to two neighbors at a time – can use α = O (1) – convergence time: O ( n 2 ) – • Averaging in time-varying bidirectional graphs: → O ( n 2 ) − no harder than consensus on fixed graphs • Various convergence proofs of optimization algs. remain valid Improves the convergence time estimate for subgradient – methods [Nedic, Olshevsky, Ozdaglar, JNT, 09]

Averaging algorithms and distributed optimization John N. - PowerPoint PPT Presentation

Averaging algorithms and distributed optimization John N. Tsitsiklis M I T NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds December 2010 Outline Motivation and applications Consensus/averaging in distributed optimization

Value Averaging I nvesting The Strategy for Enhancing Investment Returns What is Value Averaging?

Reynolds Averaging Reynolds Averaging We separate the dynamical fields into slowly varying mean

Bayesian model averaging Dr. Jarad Niemi STAT 544 - Iowa State University March 9, 2017 Jarad

Bayesian model averaging Dr. Jarad Niemi Iowa State University September 7, 2017 Jarad Niemi

Averaging kernels and their use in validating AIRS temperature and water vapor A work in

Time (integrator) parallel exponential integration and phase-averaging for geophysical fluid

Capital Budgeting: CoC Averaging (Welch, Chapter 13-2) Ivo Welch Averaging (Opportunity) CoC

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Algorithms for unconstrained local optimization Fabio Schoen 2008

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Bayesian Model Selection and Averaging Nonlinear Models Bayes factors Example Families FFX

Forecasting Ination Using Dynamic Model Averaging Gary Koop and Dimitris Korobilis September

FAST UNCERTAINTY ESTIMATES AND BAYESIAN MODEL AVERAGING OF DNNS WESLEY MADDOX JOINT WORK WITH

Distributed Algorithms for Message-Passing Systems Contents Part I Distributed Graph

Dollar Cost Averaging DCA: invest gradually in equal dollar amounts, rather than investing the

Insert the Sub Title of Your Presentation Heading Professional PowerPoint Presentation NO.1

Personal Brand How to set yourself apart, clarify your unique professional identity, and build a

Free Family Feud Template Google Slides Couped Cameron spot-checks that permit jellifies stintedly

2 2 SQ SQL { }

Q4 2017 Preliminary Earnings February 1, 2018 Results Summary SAFE HARBOR STATEMENT This

A semantically-enhanced grid registry: Work in progress Sylvia Wong, Victor Tan, Weijian Fang,

On optimal threshold defender structures of resharing-based oblivious shuffle protocols for

Threshold Implementations: Comprehend and Apply Svetla Nikova, KU Leuven, Belgium June 8, 2013