Decentralized Stochastic Approximation, Optimization, and - - PowerPoint PPT Presentation

decentralized stochastic approximation optimization and
SMART_READER_LITE
LIVE PREVIEW

Decentralized Stochastic Approximation, Optimization, and - - PowerPoint PPT Presentation

Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020


slide-1
SLIDE 1

Decentralized Stochastic Approximation, Optimization, and Multi-Agent Reinforcement Learning

Justin Romberg, Georgia Tech ECE CAMDA/TAMIDS Seminar, Texas A& M College Station, Texas Streaming live from Atlanta, Georgia March 16, 2020 October 30, 2020

slide-2
SLIDE 2

Collaborators

Thinh Doan Siva Theja Maguluri Sihan Zeng Virginia Tech, ECE Georgia Tech, ISyE Georgia Tech, ECE

slide-3
SLIDE 3

Reinforcement Learning

slide-4
SLIDE 4

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus

slide-5
SLIDE 5

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-6
SLIDE 6

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-7
SLIDE 7

Fixed point iterations

Classical result (Banach fixed point theorem): when H(·) : RN → RN is a contraction H(u) − H(v) ≤ δu − v, δ < 1, then there is a unique fixed point x⋆ such that x⋆ = H(x⋆), and the iteration xk+1 = H(xk), finds it lim

k→∞ xk = x⋆.

slide-8
SLIDE 8

Easy proof

Choose any point x0, then take xk+1 = H(xk) so xk+1 − x⋆ = H(xk) − x⋆ = H(xk) − H(x⋆) and xk+1 − x⋆ = H(xk) − H(x⋆) ≤ δxk − x⋆ ≤ δk+1x0 − x⋆, so the convergence is geometric

slide-9
SLIDE 9

Relationship to optimization

Choose any point x0, then take xk+1 = H(xk), then xk+1 − x⋆ = H(xk) − H(x⋆) ≤ δk+1x0 − x⋆, Gradient descent takes H(x) = x − α∇f(x) for some differentiable f.

slide-10
SLIDE 10

Fixed point iterations: Variation

Take xk+1 = xk + α(H(xk) − xk), 0 < α ≤ 1. (More conservative, convex combination of new iterate and old.) Then again xk+1 = (1 − α)xk + αH(xk) and xk+1 − x⋆ ≤ (1 − α)xk − x⋆ + αH(xk) − H(x⋆) ≤ (1 − α − δα)xk − x⋆. Still converge, albeit a little more slowly for α < 1.

slide-11
SLIDE 11

What if there is noise?

If our observations of H(·) are noisy, xk+1 = xk + α (H(xk) − xk + ηk) , E[ηk] = 0, then we don’t get convergence for fixed α, but we do converge to a “ball” around at a geometric rate

slide-12
SLIDE 12

Stochastic approximation

If our observations of H(·) are noisy, xk+1 = xk + αk (H(xk) − xk + ηk) , E[ηk] = 0, then we need to take αk → 0 as we approach the solution. If we take {αk} such that

  • k=0

α2

k < ∞, ∞

  • k=0

αk = ∞ then we so get (much slower) convergence Example: αk = C/(k + 1)

slide-13
SLIDE 13

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-14
SLIDE 14

Markov decision process

At time t,

1 An agent finds itself in a state st 2 It takes action at = µ(st) 3 It moves to state st+1 according to

P (st+1|st, at)...

4 ... and receives reward R(st, at, st+1).

slide-15
SLIDE 15

Markov decision process

At time t,

1 An agent finds itself in a state st 2 It takes action at = µ(st) 3 It moves to state st+1 according to

P (st+1|st, at)...

4 ... and receives reward R(st, at, st+1).

Long-term reward of policy µ: Vµ(s) = E ∞

  • t=0

γtR(st, µ(st), st+1) | s0 = s

slide-16
SLIDE 16

Markov decision process

At time t,

1 An agent finds itself in a state st 2 It takes action at = µ(st) 3 It moves to state st+1 according to

P (st+1|st, at)...

4 ... and receives reward R(st, at, st+1).

Bellman equation: Vµ obeys Vµ(s) =

  • z∈S

P (z|s, µ(s)) [R(s, µ(s), z) + γVµ(z)]

  • bµ+γP µV µ

This is a fixed point equation for Vµ

slide-17
SLIDE 17

Markov decision process

At time t,

1 An agent finds itself in a state st 2 It takes action at = µ(st) 3 It moves to state st+1 according to

P (st+1|st, at)...

4 ... and receives reward R(st, at, st+1).

State-action value function (Q function): Qµ(s, a) = E ∞

  • t=0

γtR(st, µ(st)st+1) | s0 = s, a0 = a

slide-18
SLIDE 18

Markov decision process

At time t,

1 An agent finds itself in a state st 2 It takes action at = µ(st) 3 It moves to state st+1 according to

P (st+1|st, at)...

4 ... and receives reward R(st, at, st+1).

State-action value for the optimal policy obeys Q⋆(s, a) = E

  • R(s, a, s′) + γ max

a′

Q⋆(s′, a′) | s0 = s, a0 = a

  • and we take µ⋆(s) = arg maxa Q⋆(s, a) ...

... this is another fixed point equation

slide-19
SLIDE 19

Stochastic approximation for policy evaluation

Fixed point iteration for finding Vµ(s): Vt+1(s) = Vt(s) + α

  • z

P (z|s) [R(s, z) + γVt(z)] − Vt(s)

  • H(V t)−V t
slide-20
SLIDE 20

Stochastic approximation for policy evaluation

Fixed point iteration for finding Vµ(s): Vt+1(s) = Vt(s) + α

  • z

P (z|s) [R(s, z) + γVt(z)] − Vt(s)

  • H(V t)−V t

In practice, we don’t have the model P (z|s), only observed data {(st, st+1)}

slide-21
SLIDE 21

Stochastic approximation for policy evaluation

Fixed point iteration for finding Vµ(s): Vt+1(s) = Vt(s) + α

  • z

P (z|s) [R(s, z) + γVt(z)] − Vt(s)

  • H(V t)−V t

Stochastic approximation iteration Vt+1(st) = Vt(st) + αt (R(st, st+1) + γVt(st+1) − Vt(st)) The “noise” is that st+1 is sampled, rather than averaged over

slide-22
SLIDE 22

Stochastic approximation for policy evaluation

Fixed point iteration for finding Vµ(s): Vt+1(s) = Vt(s) + α

  • z

P (z|s) [R(s, z) + γVt(z)] − Vt(s)

  • H(V t)−V t

Stochastic approximation iteration Vt+1(st) = Vt(st) + αt (R(st, st+1) + γVt(st+1) − Vt(st))

  • H(V t)−V t+ηt

The “noise” is that st+1 is sampled, rather than averaged over

slide-23
SLIDE 23

Stochastic approximation for policy evaluation

Fixed point iteration for finding Vµ(s): Vt+1(s) = Vt(s) + α

  • z

P (z|s) [R(s, z) + γVt(z)] − Vt(s)

  • H(V t)−V t

Stochastic approximation iteration Vt+1(st) = Vt(st) + αt (R(st, st+1) + γVt(st+1) − Vt(st))

  • H(V t)−V t+ηt

The “noise” is that st+1 is sampled, rather than averaged over This is different from stochastic gradient descent, since H(·) is in general not a gradient map

slide-24
SLIDE 24

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-25
SLIDE 25

Function approximation

State space can be large (or even infinite) ... ... we need a natural way to parameterize/simplify

slide-26
SLIDE 26

Linear function approximation

Simple (but powerful) model: linear representation V (s; θ) =

K

  • k=1

θkφk(s) = φ(s)Tθ, φ(s) =    φ1(s) . . . φK(s)   

slide-27
SLIDE 27

Linear function approximation

Simple (but powerful) model: linear representation V (s; θ) =

K

  • k=1

θkφk(s) = φ(s)Tθ, φ(s) =    φ1(s) . . . φK(s)   

−6 −4 −2 2 4 6 8 10 −0.02 0.02 0.04 0.06 0.08 0.1 0.12 0.14 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-28
SLIDE 28

Policy evaluation with function approximation

Bellman equation: V (s) =

  • z∈S

P (z|s) [R(s, µ(s), z) + γV (z)] Linear approximation: V (s; θ) =

K

  • k=1

θkφk(s) = φ(s)Tθ These can conflict ....

slide-29
SLIDE 29

Policy evaluation with function approximation

Bellman equation: V (s) =

  • z∈S

P (z|s) [R(s, µ(s), z) + γV (z)] Linear approximation: V (s; θ) =

K

  • k=1

θkφk(s) = φ(s)Tθ These can conflict .... ... but the following iterations θt+1 = θt + αt (R(st, st+1) + γV (st+1; θt) − V (st; θt)) ∇θV (st, θt) = θt + αt

  • R(st, st+1) + γφ(st+1)Tθt − φ(st)Tθt
  • φ(st)

converge to a “near optimal” θ⋆

Tsitsiklis and Roy, ‘97

slide-30
SLIDE 30

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-31
SLIDE 31

Network consensus

Each node in a network has a number x(i) We want each node to agree on the average ¯ x = 1 N

N

  • i=1

x(i) = 1Tx Node i communicates with its neighbors Ni Iterate, take v0 = x, then vk+1(i) =

  • j∈Ni

Wijvk(i) vk+1 = W vk, W doubly stochastic

!

slide-32
SLIDE 32

Network consensus convergence

Nodes reach “consensus” quickly: vk+1 = W vk vk+1 − ¯ x1 = W vk − ¯ x1 = W (vk − ¯ x1) vk+1 − ¯ x1 = W (vk − ¯ x1)

!

slide-33
SLIDE 33

Network consensus convergence

Nodes reach “consensus” quickly: vk+1 = W vk vk+1 − ¯ x1 = W vk − ¯ x1 = W (vk − ¯ x1) vk+1 − ¯ x1 = W (vk − ¯ x1) ≤ σ2vk − ¯ x1 ≤ σk+1

2

v0 − ¯ x1

!

slide-34
SLIDE 34

Network consensus convergence

Nodes reach “consensus” quickly: vk+1 = W vk vk+1 − ¯ x1 = W vk − ¯ x1 = W (vk − ¯ x1) vk+1 − ¯ x1 = W (vk − ¯ x1) ≤ σ2vk − ¯ x1 ≤ σk+1

2

v0 − ¯ x1

!

σ larger σ smaller

slide-35
SLIDE 35

Multi-agent Reinforcement Learning, Scenario 1: Multiple agents in a single environment, common state, different rewards What is the value of a particular policy?

slide-36
SLIDE 36

Multi-agent reinforcement learning

N agents, communicating on a network One environment, common state st ∈ S transition probabilities P (st+1|st) Individual actions ai

t ∈ Ai

Individual rewards Ri(st, st+1) evaluate policies µi : S → Ai

!

slide-37
SLIDE 37

Multi-agent reinforcement learning

N agents, communicating on a network One environment, common state st ∈ S transition probabilities P (st+1|st) Individual actions ai

t ∈ Ai

Individual rewards Ri(st, st+1) compute average cumulative reward V (s) = E ∞

  • t=0

γt 1 N

N

  • i=1

Ri(st, st+1) | s0 = s

  • !
slide-38
SLIDE 38

Multi-agent reinforcement learning

N agents, communicating on a network One environment, common state st ∈ S transition probabilities P (st+1|st) Individual actions ai

t ∈ Ai

Individual rewards Ri(st, st+1) find V that satisfies V (s) =

  • z∈S

P (z|s)

  • 1

N

N

  • n=1

Ri(s, z) + γV (z)

  • !
slide-39
SLIDE 39

Distributed temporal difference learning

Initialize: Each agent starts at θi Iterations: Observe: st, take action to go to st+1, get reward R(st, st+1) Communicate: average estimates from neighbors yi

t =

  • j∈Ni

Wijθj

t

Local updates: θi

t+1 = yi t + αtdi tφ(st),

where di

t = Ri(st, st+1) + γφ(st+1)Tθi t − φ(st)Tθi t !

slide-40
SLIDE 40

Ingredients for Distributed RL

Distributed RL is a combination of: stochastic approximation Markov decision processes function representation network consensus (complicated probabilistic analysis)

slide-41
SLIDE 41

Previous work

Subset of existing results: Unified convergence theory: Borkar and Meyn ’00 Convergence rates with “independent noise” (centralized): Thoppe and Borkar ’19, Dalal et al ’18, Lakshminarayanan and Szepesvari ’18 Convergence rates under Markovian noise (centralized): Bhandari et al COLT ’18. Srikant and Ying COLT ’19 Multi-agent RL: Mathkar and Bokar ’17, Zhang et al ’18, Kar et al ’13, Stankovic and Stankovic ’16, Macua et al ’15

slide-42
SLIDE 42

Rate of convergence for distributed TD

Fixed step size αt = α, for small enough α E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O(ηt−τ) + O(α)

where σ < 1 is network connectivity, η < 1 are problem parameters, and τ is the mixing time for the underlying Markov chain

slide-43
SLIDE 43

Rate of convergence for distributed TD

Fixed step size αt = α, for small enough α E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O(ηt−τ) + O(α)

where σ < 1 is network connectivity, η < 1 are problem parameters, and τ is the mixing time for the underlying Markov chain σ larger σ smaller

slide-44
SLIDE 44

Rate of convergence for distributed TD

Fixed step size αt = α, for small enough α E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O(ηt−τ) + O(α)

where σ < 1 is network connectivity, η < 1 are problem parameters, and τ is the mixing time for the underlying Markov chain

slide-45
SLIDE 45

Rate of convergence for distributed TD

Fixed step size αt = α, for small enough α E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O(ηt−τ) + O(α)

where σ < 1 is network connectivity, η < 1 are problem parameters, and τ is the mixing time for the underlying Markov chain Time-varying step size αt ∼ 1/(t + 1) E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O
  • T

(1 − σ2)2 log(t + 1) t + 1

slide-46
SLIDE 46

Rate of convergence for distributed TD

Fixed step size αt = α, for small enough α E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O(ηt−τ) + O(α)

where σ < 1 is network connectivity, η < 1 are problem parameters, and τ is the mixing time for the underlying Markov chain Time-varying step size αt ∼ 1/(t + 1) E

  • θi

t − θ⋆

  • ≤ O(σt−τ) + O
  • T

(1 − σ2)2 log(t + 1) t + 1

slide-47
SLIDE 47

Distributed Stochastic Approximation: General Case

Goal: Find θ⋆ such that ¯ F(θ⋆) = 0, where ¯ F(θ) =

N

  • i=1

E[Fi(Xi; θ)], using decentralized communications between agents with access to Fi(Xi; θ). Using the iteration θk+1

i

=

  • i∈N(i)

Wi,jθk

j + ǫFi(Xk i , θk i )

gives us max

j

E

  • θk

i − θ⋆2 2

  • → O

ǫ log(1/ǫ) 1 − σ2

2

  • at a linear rate

when the Fi are Lipschitz, ¯ Fi are strongly monotone, and the {Xk

i } are Markov

slide-48
SLIDE 48

Multi-agent Reinforcement Learning, Scenario 2: Multiple agents in different environments (dynamics, rewards) Can we find a jointly optimal policy?

slide-49
SLIDE 49

Policy Optimization, Framework

We will set this up as a distributed optimization program with decentralized communications One agent explores each environment Agent collaborate by sharing their models Performance guarantees: number of gradient iterations sample complexity (future)

slide-50
SLIDE 50

Policy Optimization, Framework

Environments i = 1, . . . , N, each with similar state/action spaces Key quantities: π(·|s): policy that maps states into actions ri(s, a): reward function in environment i ρi(s): initial state distribute in environment i Li(π): long-term reward of π in environment i Li(π) = E ∞

  • k=0

γkri(sk

i , ak i )

  • ,

ak

i ∼ π(·|sk−1 i

), s0

i ∼ ρi

We want to solve maximize

π N

  • i=1

Li(π)

slide-51
SLIDE 51

Decentralized Policy Optimization, Challenges

maximize

π N

  • i=1

Li(π) → maximize

θ N

  • i=1

Li(θ), πθ(a|s) = eθs,a

  • a′ eθs,a′

Natural parameterization (softmax) is ill-conditioned at solution

slide-52
SLIDE 52

Decentralized Policy Optimization, Challenges

maximize

π N

  • i=1

Li(π) → maximize

θ N

  • i=1

Li(θ)−λ RE(θ), πθ(a|s) = eθs,a

  • a′ eθs,a′

Natural parameterization (softmax) is ill-conditioned at solution

slide-53
SLIDE 53

Decentralized Policy Optimization, Challenges

maximize

π N

  • i=1

Li(π) → maximize

θ N

  • i=1

Li(θ)−λ RE(θ), πθ(a|s) = eθs,a

  • a′ eθs,a′

Natural parameterization (softmax) is ill-conditioned at solution Even for a single agent, this problem in nonconvex ... ... ability to find global optimum tied to “exploration conditions”

(Agarwal et al ’19)

slide-54
SLIDE 54

Decentralized Policy Optimization, Challenges

maximize

π N

  • i=1

Li(π) → maximize

θ N

  • i=1

Li(θ)−λ RE(θ), πθ(a|s) = eθs,a

  • a′ eθs,a′

Natural parameterization (softmax) is ill-conditioned at solution Even for a single agent, this problem in nonconvex ... ... ability to find global optimum tied to “exploration conditions”

(Agarwal et al ’19)

Agents have competing interests (global solution suboptimal for every agent)

slide-55
SLIDE 55

Decentralized Policy Optimization, Challenges

maximize

π N

  • i=1

Li(π) → maximize

θ N

  • i=1

Li(θ)−λ RE(θ), πθ(a|s) = eθs,a

  • a′ eθs,a′

Natural parameterization (softmax) is ill-conditioned at solution Even for a single agent, this problem in nonconvex ... ... ability to find global optimum tied to “exploration conditions”

(Agarwal et al ’19)

Agents have competing interests (global solution suboptimal for every agent) Gradients can only be computed imperfectly for large or partially specified problems

slide-56
SLIDE 56

Algorithm: Decentralized Policy Optimization

maximize

{θi} N

  • i=1

Li(θi), subject to θi = θj, (i, j) ∈ E Each agent stores a local version of policy θi, initialized to θ0

i

At each node, iterate from policy πθk

i ◮ Compute “advantage function” A(s, a) = Q(s, a) − V (s) ◮ Compute gradient

∇Li(θk

i ) = (complicated function of πθk

i and A(s, a))

◮ Meanwhile, exchange θk

i with neighbors

◮ Update policy

θk+1

i

=

  • j∈N (i)

Wi,jθk

j + αk∇Li(θk i )

slide-57
SLIDE 57

Algorithm: Mathematical Guarantees

θk+1

i

=

  • j∈N(i)

Wi,jθk

j + αk∇Li(θk i )

For small enough step sizes αk, after k iterations we have

  • 1

N

N

  • i=1

∇Li(θk

i )

  • 2

≤ O 1 √ k + Cg k

  • Convergence to stationary point (not global max)

Graph properties expressed in Cg Other constants come from λ, N, and MDP properties

slide-58
SLIDE 58

Algorithm: Mathematical Guarantees

θk+1

i

=

  • j∈N(i)

Wi,jθk

j + αk∇Li(θk i )

If common states are “equally explored” across environments, then after k iterations max

j

N

  • i=1

Li(θ∗) − Li(θk

j )

  • ≤ ǫ

when k ≥ C ǫ2 Convergence to global optimum Requires careful choice of regularization parameter λ “Equal exploration” hard to verify Can make this stochastic, but not with finite-sample guarantee

slide-59
SLIDE 59

Simulation: GridWorld

slide-60
SLIDE 60

Simulation: GridWorld

slide-61
SLIDE 61

Simulation: GridWorld

slide-62
SLIDE 62

Simulation: Drones in D-PEDRA

slide-63
SLIDE 63

Simulation: Drones in D-PEDRA

Table 1: MSF of the learned policy Policy Env0 Env1 Env2 Env3 Sum SA-0 15.9 4.5 4.1 3.6 28.1 SA-1 3.0 55.4 9.7 8.1 76.2 SA-2 1.5 0.8 21.1 2.0 25.4 SA-3 2.3 0.8 8.6 40.1 51.8 DCPG 25.2 67.9 40.5 61.8 195.4 Random 2.5 3.9 4.7 3.7 14.8

Interesting, unexplained result: Learning a joint policy is easier than learning individual policies

slide-64
SLIDE 64

Thank you!

References:

  • T. T. Doan, S. M. Maguluri, and J. Romberg, “Finite-time performance of distributed temporal

difference learning with linear function approximation,” submitted January 2020, arXiv:1907.12530.

  • T. T. Doan, S. M. Maguluri, and J. Romberg, “Finite-time analysis of distributed TD(0) with linear

function approximation for multi-agent reinforcement learning,” ICML 2019.

  • S. Zeng, T. T. Doan, and J. Romberg, “Finite-time analysis of decentralized stochastic approximation

with applications in multi-agent and multi-task learning,” submitted October 2020, arxiv:2010.15088 .

  • S. Zeng, et al, “ A decentralized policy gradient approach to multi-task reinforcement learning,”

submitted October 2020, arxiv:2006.04338.