Distributed Approaches to Mirror Descent for Stochastic Learning - PowerPoint PPT Presentation

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI   (joint work with Waheed Bajwa, Rutgers)

Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Motivation: Autonomous Driving • Network of autonomous automobiles + one human-driven car • Sensing for “anomalous” driving from human • Want to jointly sense over communications links Challenges: • Need to detect/act quickly • Wireless links have limited rate — can’t exchange raw data Questions: • How well can devices jointly learn when links are slow(/not fast)? • What are good strategies? Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Contributions of This Talk • Frame the problem as distributed stochastic optimization • Network of devices trying to minimize an objective function from streams of noisy data • Focus on communications aspect: how to collaborate when links have limited rates? • Defining two time scales : one rate for data arrival, and one for message exchanges • Solution: distributed versions of stochastic mirror descent that carefully balance gradient averaging and mini-batching • Derive network/rate conditions for near-optimum convergence • Accelerated methods provide a substantial speedup Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Distributed Stochastic Learning • Network of m nodes, each with an i.i.d. data stream {ξ i (t)}, for sensor i at time t • Nodes communicate over wireless links, modeled by graph (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X (ξ 1 (1),ξ 1 (2),…) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Optimization Model • Nodes want to solve the stochastic optimization problem: min x∈X ψ(x) = min x∈X E ξ [ɸ(x,ξ)] • ɸ is convex, X⊂ℝ d is compact and convex • ψ has Lipschitz gradients: [composite optimization later!] ||∇ψ(x) - ∇ψ(y)|| ≤ L||x - y||, x,y ∈X • Nodes have access to noisy gradients: g i (t) := ∇ɸ(x i (t),ξ i (t)) (ξ 1 (1),ξ 1 (2),…) E ξ [g i (t)] = ∇ψ(x i (t)) (ξ 6 (1),ξ 6 (2),…) (ξ 2 (1),ξ 2 (2),…) E ξ [||g i (t) - ∇ψ(x i (t)|| 2 ] ≤ σ 2 (ξ 5 (1),ξ 5 (2),…) (ξ 3 (1),ξ 3 (2),…) • Nodes keep search points x i (t) (ξ 4 (1),ξ 4 (2),…) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Mirror Descent • (Centralized) SO is well understood • Optimum convergence via mirror descent Algorithm: Stochastic Mirror Descent Initialize x i (0) ← 0 for t=1 to T : x i (t) ← P x [x i (t-1) - γ t g i (t-1)] x avi (t) ← 1/t Q τ x i (τ) end for t • Extensions via Bregman divergences + prox mappings • After T rounds:  L � σ E [ ψ ( x av i ( T )) − ψ ( x ∗ )] ≤ O (1) T + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds:  L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Stochastic Mirror Descent • Can speed up convergence via accelerated stochastic mirror descent: • Similar SGD steps, but more complex iterate averaging • After T rounds:  L � σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) T 2 + √ T • Optimum convergence order-wise • Noise term dominates in general, but ASMD provides a universal solution to the SO problem • Will prove significant in distributed stochastic learning [Xiao, “Dual averaging methods for regularized stochastic learning and online optimization”, 2010] [Lan, “An Optimal Method for Stochastic Composite Optimization”, 2012] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is  � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Back to Distributed Stochastic Learning • With m nodes, after T rounds, the best possible performance is  � L σ E [ ψ ( x i ( T )) − ψ ( x ∗ )] ≤ O (1) ( mT ) 2 + √ mT • Achievable with sufficiently fast communications • In distributed computing environment, noise term is achievable via gradient averaging: 1. Use AllReduce to average gradients over a spanning tree 2. Take a SMD step • Upshot: Averaging reduces gradient noise, provides speedup • Perfect averages difficult to compute over wireless networks • Approaches: average consensus, incremental methods, etc. [Dekel et al., “Optimal distributed online prediction using mini-batches”, 2012] [Duchi et al., “Dual averaging for distributed optimization…”, 2012] [Ram et al., “Incremental stochastic sub-gradient algorithms for convex optimization”, 2009] Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives m 1 (r) m 2 (r) m 3 (r) m 4 (r) Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Communications Model • Nodes connected over an undirected graph G = (V,E) • Every communications round, each node broadcasts a single gradient-like message m i (r) to its neighbors • Rate limitations modeled by the communications ratio ρ • ρ communications rounds for every data sample that arrives ξ i (t=1) ξ i (t=2) ξ i (t=3) ξ i (t=4) data rounds m i (r=1) m i (r=2) comms rounds ρ = 1/2 ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) comms rounds ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Distributed Mirror Descent Outline • Distribute stochastic MD via averaging consensus : 1. Nodes obtain local gradients 2.Compute distributed gradient averages via consensus 3. Take MD step using the average gradients ξ i (t=1) ξ i (t=2) data rounds m i (r=1) m i (r=2) m i (r=3) m i (r=4) consensus rounds x i (t=1) x i (t=2) search point updates ρ = 2 Matthew Nokleby, Wayne State University “Distributed Approaches to Mirror Descent…”

Distributed Approaches to Mirror Descent for Stochastic Learning - PowerPoint PPT Presentation

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

SMART MIRROR AMITY INSTITUTE OF TELECOM ENGINEERING AND MANAGEMENT WHAT IS SMART MIRROR A

Mirror Presentation Service Group https://www.indiamart.com/mirror-presentation-service/ Mirror

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

MIRROR TRADING I N T E R N A T I O N A L business opportunity presentation MIRROR TRADING I N

A proposal for (0 , 2) mirror symmetry of toric varieties Wei Gu Based on Wei Gu , Eric Sharpe

Quiver Representations and Theta Functions Man Wai Cheung Motivation from mirror symmetry

Q34.4 A spherical mirror produces an enlarged upright image of A spherical mirror produces an

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

Joint Source-Channel LZ'77 Coding Stefano Lonardi University of California, Riverside Wojciech

Formal Modeling in Cognitive Science Lecture 20: Joint, Marginal, and Conditional Distributions

Discrete Translates in Function Spaces Alexander Olevskii The talk is based on joint work with

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Distributed Approaches to Mirror Descent for Stochastic Learning - PowerPoint PPT Presentation

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Matthew Nokleby, Wayne State University, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

SMART MIRROR AMITY INSTITUTE OF TELECOM ENGINEERING AND MANAGEMENT WHAT IS SMART MIRROR A

Mirror Presentation Service Group https://www.indiamart.com/mirror-presentation-service/ Mirror

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent Konstantin Mishchenko, KAUST

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

MIRROR TRADING I N T E R N A T I O N A L business opportunity presentation MIRROR TRADING I N

A proposal for (0 , 2) mirror symmetry of toric varieties Wei Gu Based on Wei Gu , Eric Sharpe

Quiver Representations and Theta Functions Man Wai Cheung Motivation from mirror symmetry

Q34.4 A spherical mirror produces an enlarged upright image of A spherical mirror produces an

Joint ITU-T and OASIS Workshop and Demonstration of Advances in ICT Standards for Public Warning

Joint Source-Channel LZ'77 Coding Stefano Lonardi University of California, Riverside Wojciech

Formal Modeling in Cognitive Science Lecture 20: Joint, Marginal, and Conditional Distributions

Discrete Translates in Function Spaces Alexander Olevskii The talk is based on joint work with

Joint work with Jessica Hwang &amp; Paulo Orenstein (Stanford), Judah Cohen &amp; Karl Pfeiffer

Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE 2015 Alan

Graph Representation Learning: Where Probability Theory, Data Mining, and Neural Networks Meet

It slices, dices, and makes julienne data! or, Processing data with RecordStream, also known

Joint work with Jessica Hwang & Paulo Orenstein (Stanford), Judah Cohen & Karl Pfeiffer