decentralized stochastic approximation optimization and
play

Decentralized Stochastic Approximation, Optimization, and - PowerPoint PPT Presentation

Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020


  1. Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020 October 30, 2020

  2. Collaborators Thinh Doan Siva Theja Maguluri Sihan Zeng Virginia Tech, ECE Georgia Tech, ISyE Georgia Tech, ECE

  3. Reinforcement Learning

  4. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus

  5. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

  6. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

  7. Fixed point iterations Classical result (Banach fixed point theorem): when H ( · ) : R N → R N is a contraction � H ( u ) − H ( v ) � ≤ δ � u − v � , δ < 1 , then there is a unique fixed point x ⋆ such that x ⋆ = H ( x ⋆ ) , and the iteration x k +1 = H ( x k ) , finds it k →∞ x k = x ⋆ . lim

  8. Easy proof Choose any point x 0 , then take x k +1 = H ( x k ) so x k +1 − x ⋆ = H ( x k ) − x ⋆ = H ( x k ) − H ( x ⋆ ) and � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ � x k − x ⋆ � ≤ δ k +1 � x 0 − x ⋆ � , so the convergence is geometric

  9. Relationship to optimization Choose any point x 0 , then take x k +1 = H ( x k ) , then � x k +1 − x ⋆ � = � H ( x k ) − H ( x ⋆ ) � ≤ δ k +1 � x 0 − x ⋆ � , Gradient descent takes H ( x ) = x − α ∇ f ( x ) for some differentiable f .

  10. Fixed point iterations: Variation Take x k +1 = x k + α ( H ( x k ) − x k ) , 0 < α ≤ 1 . (More conservative, convex combination of new iterate and old.) Then again x k +1 = (1 − α ) x k + αH ( x k ) and � x k +1 − x ⋆ � ≤ (1 − α ) � x k − x ⋆ � + α � H ( x k ) − H ( x ⋆ ) � ≤ (1 − α − δα ) � x k − x ⋆ � . Still converge, albeit a little more slowly for α < 1 .

  11. What if there is noise? If our observations of H ( · ) are noisy , x k +1 = x k + α ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we don’t get convergence for fixed α , but we do converge to a “ball” around at a geometric rate

  12. Stochastic approximation If our observations of H ( · ) are noisy , x k +1 = x k + α k ( H ( x k ) − x k + η k ) , E[ η k ] = 0 , then we need to take α k → 0 as we approach the solution. If we take { α k } such that ∞ ∞ � � α 2 k < ∞ , α k = ∞ k =0 k =0 then we so get (much slower) convergence Example: α k = C/ ( k + 1)

  13. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

  14. Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) .

  15. Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Long-term reward of policy µ : � ∞ � � γ t R ( s t , µ ( s t ) , s t +1 ) | s 0 = s V µ ( s ) = E t =0

  16. Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . Bellman equation: V µ obeys � V µ ( s ) = P ( z | s , µ ( s )) [ R ( s , µ ( s ) , z ) + γV µ ( z )] z ∈S � �� � b µ + γ P µ V µ This is a fixed point equation for V µ

  17. Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value function ( Q function): � ∞ � � γ t R ( s t , µ ( s t ) s t +1 ) | s 0 = s , a 0 = a Q µ ( s , a ) = E t =0

  18. Markov decision process At time t , 1 An agent finds itself in a state s t 2 It takes action a t = µ ( s t ) 3 It moves to state s t +1 according to P ( s t +1 | s t , a t ) ... 4 ... and receives reward R ( s t , a t , s t +1 ) . State-action value for the optimal policy obeys � � Q ⋆ ( s , a ) = E R ( s , a , s ′ ) + γ max Q ⋆ ( s ′ , a ′ ) | s 0 = s , a 0 = a a ′ and we take µ ⋆ ( s ) = arg max a Q ⋆ ( s , a ) ... ... this is another fixed point equation

  19. Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t

  20. Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� � H ( V t ) − V t In practice, we don’t have the model P ( z | s ) , only observed data { ( s t , s t +1 ) }

  21. Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) The “noise” is that s t +1 is sampled, rather than averaged over

  22. Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � V t +1 ( s ) = V t ( s ) + α P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� � H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over

  23. Stochastic approximation for policy evaluation Fixed point iteration for finding V µ ( s ) : �� � P ( z | s ) [ R ( s , z ) + γV t ( z )] − V t ( s ) V t +1 ( s ) = V t ( s ) + α z � �� � H ( V t ) − V t Stochastic approximation iteration V t +1 ( s t ) = V t ( s t ) + α t ( R ( s t , s t +1 ) + γV t ( s t +1 ) − V t ( s t )) � �� � H ( V t ) − V t + η t The “noise” is that s t +1 is sampled, rather than averaged over This is different from stochastic gradient descent, since H ( · ) is in general not a gradient map

  24. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

  25. Function approximation State space can be large (or even infinite) ... ... we need a natural way to parameterize/simplify

  26. Linear function approximation Simple (but powerful) model: linear representation   φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ ,  .  V ( s ; θ ) = φ ( s ) = .   k =1 φ K ( s )

  27. Linear function approximation Simple (but powerful) model: linear representation   φ 1 ( s ) K � . θ k φ k ( s ) = φ ( s ) T θ ,   . V ( s ; θ ) = φ ( s ) = .   k =1 φ K ( s ) 0.14 1 0.9 0.12 0.8 0.1 0.7 0.08 0.6 0.06 0.5 0.4 0.04 0.3 0.02 0.2 0 0.1 −0.02 0 −6 −4 −2 0 2 4 6 8 10 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

  28. Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict ....

  29. Policy evaluation with function approximation Bellman equation: � V ( s ) = P ( z | s ) [ R ( s , µ ( s ) , z ) + γV ( z )] z ∈S Linear approximation: K � θ k φ k ( s ) = φ ( s ) T θ V ( s ; θ ) = k =1 These can conflict .... ... but the following iterations θ t +1 = θ t + α t ( R ( s t , s t +1 ) + γV ( s t +1 ; θ t ) − V ( s t ; θ t )) ∇ θ V ( s t , θ t ) � � R ( s t , s t +1 ) + γ φ ( s t +1 ) T θ t − φ ( s t ) T θ t = θ t + α t φ ( s t ) converge to a “near optimal” θ ⋆ Tsitsiklis and Roy, ‘97

  30. Ingredients for Distributed RL Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

  31. Network consensus Each node in a network has a number x ( i ) We want each node to agree on the average ! N x = 1 � x ( i ) = 1 T x ¯ N i =1 Node i communicates with its neighbors N i Iterate, take v 0 = x , then � v k +1 ( i ) = W ij v k ( i ) j ∈N i v k +1 = W v k , W doubly stochastic

  32. Network consensus convergence ! Nodes reach “consensus” quickly: v k +1 = W v k v k +1 − ¯ x 1 = W v k − ¯ x 1 = W ( v k − ¯ x 1 ) � v k +1 − ¯ x 1 � = � W ( v k − ¯ x 1 ) �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend