distributed approaches to mirror descent for stochastic
play

Distributed Approaches to Mirror Descent for Stochastic Learning - PowerPoint PPT Presentation

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous


  1. Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI 
 (joint work with Waheed Bajwa, Rutgers)

  2. Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  3. Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  4. Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Questions: • How well can devices jointly learn when links are slow(/not fast)? • What are good strategies? Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  5. Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  6. Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  7. Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges • Solution: distributed versions of stochastic mirror descent that carefully balance gradient averaging and mini-batching • Derive network/rate conditions for near-optimum convergence • Accelerated methods provide a substantial speedup Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  8. Distributed Stochastic Learning • Network of m nodes, each with an i.i.d. data stream {ξ i (t)}, for sensor i at time t • Nodes communicate over wireless links, modeled by graph (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  9. Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  10. Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X • Nodes have access to noisy gradients: g i (t) := ∇ɸ(x i (t),ξ i (t)) (ξ 1 (1),ξ 1 (2),…) E ξ [g i (t)] = ∇ψ(x i (t)) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) E ξ [||g i (t) - ∇ψ(x i (t)|| 2 ] ≤ σ 2 (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) • Nodes keep search points x i (t) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  11. Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  12. Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t • Extensions via Bregman divergences + prox mappings • After T rounds:  L � σ E [ ψ ( x av i ( T )) − ψ ( x ∗ )] ≤ O (1) T + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  13. Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds:  L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  14. Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds:  L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T • Optimum convergence order-wise • Noise term dominates in general, but ASMD provides a universal solution to the SO problem • Will prove significant in distributed stochastic learning [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  15. Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is  � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  16. Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is  � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT • Achievable with sufficiently fast communications • In distributed computing environment, noise term is achievable via gradient averaging: 1. Use AllReduce to average gradients over a spanning tree 2. Take a SMD step • Upshot: Averaging reduces gradient noise, provides speedup • Perfect averages difficult to compute over wireless networks • Approaches: average consensus, incremental methods, etc. [Dekel et al., “Optimal distributed online prediction using mini-batches”, 2012] [Duchi et al., “Dual averaging for distributed optimization…”, 2012] [Ram et al., “Incremental stochastic sub-gradient algorithms for convex optimization”, 2009] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  17. Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives m 1 (r) m 2 (r) m 3 (r) m 4 (r) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  18. Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives ξ i (t=1) ξ i (t=2) ξ i (t=3) ξ i (t=4) data rounds m i (r=1) m i (r=2) comms rounds ρ = 1/2 ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) comms rounds ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

  19. Distributed Mirror Descent Outline • Distribute stochastic MD via averaging consensus : 1. Nodes obtain local gradients 2.Compute distributed gradient averages via consensus 3. Take MD step using the average gradients ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) consensus rounds x i (t=1) x i (t=2) search point updates ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend