Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data - - PowerPoint PPT Presentation

asynchronous rl
SMART_READER_LITE
LIVE PREVIEW

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data - - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL Stability of training neural networks requires the gradient updates


slide-1
SLIDE 1

Asynchronous RL

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science CMU 10703

slide-2
SLIDE 2

Non-stationary data problem for Deep RL

  • Stability of training neural networks requires the gradient updates to be de-

correlated

  • This is not the case if data arrives sequentially
  • Gradient updates computed from some part of the space can cause the value

(Q) function approximator to oscillate

  • Our solution so far has been: Experience buffers where experience tuples are

mixed and sampled from. Resulting sampled batches are more stationary that the ones encountered online (without buffer)

  • This limits deep RL to off-policy methods, since data from an older policy are

used to update the weights of the value approximator.

slide-3
SLIDE 3

Asynchronous Deep RL

  • Alternative: parallelize the collection of experience and stabilize training

without experience buffers!

  • Multiple threads of experience, one per agent, each exploring in different

part of the environment contributing experience tuples

  • Different exploration strategies (e.g., various \epsilon values) in different

threads increase diversity

  • It can be applied to both on policy and off policy methods, applied it to

SARSA, DQN, and advantage actor-critic

slide-4
SLIDE 4

Distributed RL

slide-5
SLIDE 5

Distributed Asynchronous RL

The actor critic trained in such asynchronous way is knows as A3C Each worker may have slightly modified version of the policy/critic No locking

slide-6
SLIDE 6

Distributed Synchronous RL

The actor critic trained in such synchronous way is knows as A2C

  • 5. Gradients of all

workers are averaged and the central neural net weights are updated All worker may have the same actor/critic weights

slide-7
SLIDE 7
  • Training stabilization without Experience Buffer
  • Use of on policy methods, e.g., SARSA and policy gradients
  • Reduction is training time linear to the number of threads

A3C What is the approximation used for the advantage?

r1, r2, r3

R3 = r3 + γV(s4, θ′

v)

R2 = r2 + γr3 + γ2V(s4, θ′

v)

s1, s2, s3, s4

A3 = R3 − V(s3; θ′

v)

A2 = R2 − V(s2; θ′

v)

slide-8
SLIDE 8

Advantages of Asynchronous (multi-threaded) RL

slide-9
SLIDE 9

Evolutionary Methods

Deep Reinforcement Learning and Control Katerina Fragkiadaki

Carnegie Mellon School of Computer Science CMU 10703

Part of the slides borrowed by Xi Chen, Pieter Abbeel, John Schulman

slide-10
SLIDE 10

Policy Optimization

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)]

:a trajectory

τ

slide-11
SLIDE 11

Policy Optimization

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)]

:a trajectory

τ

slide-12
SLIDE 12

Policy Optimization and RL

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)] = max

θ

𝔽 [

T

t=0

R(st)|πθ, μ0(s)]

slide-13
SLIDE 13

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)] = max

θ

𝔽 [

T

t=0

R(st)|πθ, μ0(s)]

H

X

slide-14
SLIDE 14

H

X

Evolutionary methods

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)]

slide-15
SLIDE 15

Black-box Policy Optimization

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)]

θ

𝔽 [R(τ)]

No information regarding the structure of the reward

slide-16
SLIDE 16

General algorithm: Initialize a population of parameter vectors (genotypes) 1.Make random perturbations (mutations) to each parameter vector 2.Evaluate the perturbed parameter vector (fitness) 3.Keep the perturbed vector if the result improves (selection) 4.GOTO 1

max

θ

J(θ) = max

θ

𝔽 [R(τ)|πθ, μ0(s0)]

Evolutionary methods

Biologically plausible…

slide-17
SLIDE 17

CEM: Initialize for iteration = 1, 2, … Sample n parameters For each , perform one rollout to get return Select the top k% of , and fit a new diagonal Gaussian to those samples. Update endfor θi ∼ N(µ, diag(σ2)) µ ∈ Rd, σ ∈ Rd

>0

θi R(τi) θ µ, σ

Cross-entropy method

Let’s consider our parameters to be sampled from a multivariate isotropic Gaussian We will evolve this Gaussian towards sampled that have highest fitness

slide-18
SLIDE 18
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

𝑛𝑗, 𝐷𝑗

Covariance Matrix Adaptation

μi, Ci

Let’s consider our parameters to be sampled from a multivariate Gaussian We will evolve this Gaussian towards sampled that have highest fitness

slide-19
SLIDE 19
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

Covariance Matrix Adaptation

slide-20
SLIDE 20
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

Covariance Matrix Adaptation

slide-21
SLIDE 21
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

Covariance Matrix Adaptation

slide-22
SLIDE 22
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

Covariance Matrix Adaptation

slide-23
SLIDE 23
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

Covariance Matrix Adaptation

slide-24
SLIDE 24
  • Sample
  • Select elites
  • Update mean
  • Update covariance
  • iterate

𝑛𝑗+1, 𝐷𝑗+1

Covariance Matrix Adaptation

μi+1, Ci+1

slide-25
SLIDE 25

[NIPS 2013] µ ∈ R22

CMA-ES, CEM

Work embarrassingly well in low-dimensions

slide-26
SLIDE 26
  • Evolutionary methods work well on relatively low-dim

problems

  • Can they be used to optimize deep network policies?

Question

slide-27
SLIDE 27

PG VS ES

We are sampling in both cases…

slide-28
SLIDE 28

Policy Gradients Review

max

θ

. J(θ) = 𝔽τ∼Pθ(τ) [R(τ)]

∇θJ(θ) = ∇θ𝔽τ∼Pθ(τ) [R(τ)] = ∇θ∑

τ

Pθ(τ)R(τ) = ∑

τ

∇θPθ(τ)R(τ) = ∑

τ

Pθ(τ) ∇μPθ(τ) Pθ(τ) R(τ) = ∑

τ

Pθ(τ)∇θlog Pθ(τ)R(τ) = 𝔽τ∼Pθ(τ) [∇θlog Pθ(τ)R(τ)]

∇θJ(θ) ≈ 1 N

N

i=1

∇θlog Pθ(τ(i))R(τ(i)) Sample estimate:

slide-29
SLIDE 29

∇μU(μ) = ∇μ𝔽θ∼Pμ(θ) [F(θ)] = ∇μ∫ Pμ(θ)F(θ)dθ = ∫ ∇μPμ(θ)F(θ)dθ = ∫ Pμ(θ) ∇μPμ(θ) Pμ(θ) F(θ)dθ = ∫ Pμ(θ)∇μlog Pμ(θ)F(θ)dθ = 𝔽θ∼Pμ(θ) [∇μlog Pμ(θ)F(θ)]

max

μ

. U(μ) = 𝔽θ∼Pμ(θ) [F(θ)]

ES Considers distribution over policy parameters

slide-30
SLIDE 30

∇μU(μ) = ∇μ𝔽θ∼Pμ(θ) [F(θ)] = ∇μ∫ Pμ(θ)F(θ)dθ = ∫ ∇μPμ(θ)F(θ)dθ = ∫ Pμ(θ) ∇μPμ(θ) Pμ(θ) F(θ)dθ = ∫ Pμ(θ)∇μlog Pμ(θ)F(θ)dθ = 𝔽θ∼Pμ(θ) [∇μlog Pμ(θ)F(θ)]

max

μ

. U(μ) = 𝔽θ∼Pμ(θ) [F(θ)]

∇μU(μ) ≈ 1 N

N

i=1

∇μlog Pμ(θ(i))F(θ(i)) ES Considers distribution over policy parameters Sample estimate:

slide-31
SLIDE 31

∇μU(μ) = ∇μ𝔽θ∼Pμ(θ) [F(θ)] = ∇μ∫ Pμ(θ)F(θ)dθ = ∫ ∇μPμ(θ)F(θ)dθ = ∫ Pμ(θ) ∇μPμ(θ) Pμ(θ) F(θ)dθ = ∫ Pμ(θ)∇μlog Pμ(θ)F(θ)dθ = 𝔽θ∼Pμ(θ) [∇μlog Pμ(θ)F(θ)]

max

μ

. U(μ) = 𝔽θ∼Pμ(θ) [F(θ)]

∇μU(μ) ≈ 1 N

N

i=1

∇μlog Pμ(θ(i))F(θ(i))

max

θ

. J(θ) = 𝔽τ∼Pθ(τ) [R(τ)]

∇θJ(θ) = ∇θ𝔽τ∼Pθ(τ) [R(τ)] = ∇θ∑

τ

Pθ(τ)R(τ) = ∑

τ

∇θPθ(τ)R(τ) = ∑

τ

Pθ(τ) ∇μPθ(τ) Pθ(τ) R(τ) = ∑

τ

Pθ(τ)∇θlog Pθ(τ)R(τ) = 𝔽τ∼Pθ(τ) [∇θlog Pθ(τ)R(τ)] ∇θJ(θ) ≈ 1 N

N

i=1

∇θlog Pθ(τ(i))R(τ(i)) PG ES Considers distribution over actions Considers distribution over policy parameters Sample estimate: Sample estimate:

slide-32
SLIDE 32

From trajectories to actions

∇θlog P(τ(i); θ) = ∇θlog

T

t=0

P(s(i)

t+1|s(i) t , a(i) t )

dynamics ⋅ πθ(a(i)

t |s(i) t )

policy = ∇θ

T

t=0

log P(s(i)

t+1|s(i) t , a(i) t )

dynamics + log πθ(a(i)

t |s(i) t )

policy = ∇θ

T

t=0

log πθ(a(i)

t |s(i) t )

policy =

T

t=0

∇θlog πθ(a(i)

t |s(i) t )

∇θJ(θ) ≈ 1 N

N

i=1

∇θlog Pθ(τ(i))R(τ(i)) ∇θJ(θ) ≈ 1 N

N

i=1 T

t=1

∇θlog πθ(α(i)

t |s(i) t )R(s(i) t , a(i) t )

slide-33
SLIDE 33

∇μU(μ) = ∇μ𝔽θ∼Pμ(θ) [F(θ)] = ∇μ∫ Pμ(θ)F(θ)dθ = ∫ ∇μPμ(θ)F(θ)dθ = ∫ Pμ(θ) ∇μPμ(θ) Pμ(θ) F(θ)dθ = ∫ Pμ(θ)∇μlog Pμ(θ)F(θ)dθ = 𝔽θ∼Pμ(θ) [∇μlog Pμ(θ)F(θ)]

max

μ

. U(μ) = 𝔽θ∼Pμ(θ) [F(θ)]

∇μU(μ) ≈ 1 N

N

i=1

∇μlog Pμ(θ(i))F(θ(i))

max

θ

. J(θ) = 𝔽τ∼Pθ(τ) [R(τ)]

∇θJ(θ) = ∇θ𝔽τ∼Pθ(τ) [R(τ)] = ∇θ∑

τ

Pθ(τ)R(τ) = ∑

τ

∇θPθ(τ)R(τ) = ∑

τ

Pθ(τ) ∇μPθ(τ) Pθ(τ) R(τ) = ∑

τ

Pθ(τ)∇θlog Pθ(τ)R(τ) = 𝔽τ∼Pθ(τ) [∇θlog Pθ(τ)R(τ)] PG ES Considers distribution over actions Considers distribution over policy parameters Sample estimate: Sample estimate: ∇θJ(θ) ≈ 1 N

N

i=1 T

t=1

∇θlog πθ(α(i)

t |s(i) t )R(s(i) t , a(i) t )

slide-34
SLIDE 34

n Suppose is a Gaussian distribution with mean ,

and covariance matrix If we draw two parameter samples , and obtain two

θ ∼ Pµ(θ) µ log Pµ(θ) = −||θ − µ||2 2σ2 + const σ2I rµ log Pµ(θ) = θ µ σ2 θ τ

  • A concrete example
slide-35
SLIDE 35

n Suppose is a Gaussian distribution with mean ,

and covariance matrix

n If we draw two parameter samples , and obtain two

trajectories :

θ ∼ Pµ(θ) µ log Pµ(θ) = −||θ − µ||2 2σ2 + const σ2I rµ log Pµ(θ) = θ µ σ2 θ1, θ2 τ1, τ2

  • A concrete example

𝔽θ∼Pμ(θ) [∇μlogPμ(θ)R(τ)] ≈ 1

2 [R(τ1)θ1 − μ σ2 + R(τ2)θ2 − μ σ2 ]

slide-36
SLIDE 36

n Suppose is a Gaussian distribution with mean ,

and covariance matrix

θ ∼ Pµ(θ) µ σ2I µ θ τ

  • Sampling parameter vectors

θ1 = μ + σ * ϵ1, ϵ1 ∼ 𝒪(0,I) θ2 = μ + σ * ϵ2, ϵ2 ∼ 𝒪(0,I)

Imagine we have access to random vectors ϵ ∼ 𝒪(0,I) The theta samples have the desired mean and variance

slide-37
SLIDE 37

n Suppose is a Gaussian distribution with mean ,

and covariance matrix

n If we draw two parameter samples , and obtain two

trajectories :

θ ∼ Pµ(θ) µ log Pµ(θ) = −||θ − µ||2 2σ2 + const σ2I rµ log Pµ(θ) = θ µ σ2 θ1, θ2 τ1, τ2

  • A concrete example

𝔽θ∼Pμ(θ) [∇μlogPμ(θ)R(τ)]

≈ 1 2σ [R(τ1)ϵ1 + R(τ2)ϵ2] ≈ 1 2 [R(τ1)θ1 − μ σ2 + R(τ2)θ2 − μ σ2 ]

θ1 = μ + σ * ϵ1, ϵ1 ∼ 𝒪(0,I) θ2 = μ + σ * ϵ2, ϵ2 ∼ 𝒪(0,I)

slide-38
SLIDE 38

Natural Evolutionary Strategies

slide-39
SLIDE 39

n Antithetic sampling

n Sample a pair of policies with mirror noise

Get a pair of rollouts from environment

  • (✓+ = µ + ✏, ✓− = µ − ✏)

Connection to Finite Differences

slide-40
SLIDE 40

n Antithetic sampling

n Sample a pair of policies with mirror noise n Get a pair of rollouts from environment

SPSA: Finite Difference with random direction (τ+, τ−)

  • (✓+ = µ + ✏, ✓− = µ − ✏)

Connection to Finite Differences

slide-41
SLIDE 41

n Antithetic sampling

n Sample a pair of policies with mirror noise n Get a pair of rollouts from environment n SPSA: Finite Difference with random direction

(τ+, τ−) rµE [R(⌧)] ⇡ 1 2  R(⌧+)✓+ µ 2 + R(⌧−)✓− µ 2

  • = 1

2  R(⌧+)✏ 2 + R(⌧−)✏ 2

  • = ✏

2 [R(⌧+) R(⌧−)] (✓+ = µ + ✏, ✓− = µ − ✏)

vs Finite Difference

Connection to Finite Differences

slide-42
SLIDE 42

n Antithetic sampling

n Sample a pair of policies with mirror noise n Get a pair of rollouts from environment n SPSA: Finite Difference with random direction

(τ+, τ−) rµE [R(⌧)] ⇡ 1 2  R(⌧+)✓+ µ 2 + R(⌧−)✓− µ 2

  • = 1

2  R(⌧+)✏ 2 + R(⌧−)✏ 2

  • = ✏

2 [R(⌧+) R(⌧−)] (✓+ = µ + ✏, ✓− = µ − ✏)

vs Finite Difference

Connection to Finite Differences

slide-43
SLIDE 43

Finite Differences

slide-44
SLIDE 44

Evolution methods VS Policy Gradients

We add noise \epsilon in our actions (\epsilon-greedy)! We add noise \ksi in our policy (neural) parameters!

n

Open Question: Policy Gradient at action level or parameter level?

Important to remember both are doing finite difference / random search

∇μU(μ) ≈ 1 N

N

i=1

∇μlog Pμ(θ(i))F(θ(i)) Sample estimate: Sample estimate: ∇θJ(θ) ≈ 1 N

N

i=1 T

t=1

∇θlog πθ(α(i)

t |s(i) t )R(s(i) t , a(i) t )

slide-45
SLIDE 45

Neural net architectures that work well with (stochastic) gradient descent

  • ptimization do not work will with ES. Contributions:
  • Virtual batch norm
  • Discretization of continuous actions - better exploration during mutation!
  • Parallelization with a need for tiny only cross-worker communication!!
slide-46
SLIDE 46

Neural net architectures that work well with (stochastic) gradient descent

  • ptimization do not work will with ES. Contributions:
  • Virtual batch norm
  • Discretization of continuous actions - better exploration during mutation!
  • Parallelization with a need for tiny only cross-worker communication!!
slide-47
SLIDE 47

Neural net architectures that work well with (stochastic) gradient descent

  • ptimization do not work will with ES. Contributions:
  • Virtual batch norm
  • Discretization of continuous actions - better exploration during mutation!
  • Parallelization with a need for tiny only cross-worker communication!!
slide-48
SLIDE 48

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

Distributed SGD

Used in Asynchronous RL!

slide-49
SLIDE 49

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4 Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4 ALL REDUCE

Each worker sends big gradient vectors

Distributed SGD

slide-50
SLIDE 50

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

What need to be sent??

Distributed Evolution

slide-51
SLIDE 51

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

What need to be sent??

Distributed Evolution

slide-52
SLIDE 52

Worker 6 Worker 1 Worker 2 Worker 3

θ and R(τ)?

θ is big! ✓ = µ + ✏

but Same for all workers Only need seed of random number generator!

Distributed Evolution

slide-53
SLIDE 53

[Salimans, Ho, Chen, Sutskever, 2017]

Distributed Evolution

slide-54
SLIDE 54

[Salimans, Ho, Chen, Sutskever, 2017]

Distributed Evolution

slide-55
SLIDE 55

Distributed Evolution

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

Each worker broadcasts tiny scalars

Distributed Evolution

slide-56
SLIDE 56

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

Each worker broadcasts tiny scalars

Distributed Evolution

slide-57
SLIDE 57

Worker 6 Worker 1 Worker 2 Worker 3 Worker 5 Worker 4

Each worker broadcasts tiny scalars

Distributed Evolution

slide-58
SLIDE 58

[Salimans, Ho, Chen, Sutskever, 2017]

Distributed Evolution Scales Very Well :-)

slide-59
SLIDE 59

[Salimans, Ho, Chen, Sutskever, 2017]

Distributed Evolution Requires More Samples :-(

slide-60
SLIDE 60

Step-size Alternative derivations

Monte Carlo Policy Gradients (REINFORCE), gradient direction:

ˆ g = ˆ Et h rθ log πθ(at | st) ˆ At i LPG(θ) = ˆ Et h log πθ(at | st) ˆ At i

How would we implement this in TF or Pytorch Equivalently:

LIS

θold(θ) = 𝔽τ∼θold [

πθ(at|st) πθold(at|st) ̂ At]

slide-61
SLIDE 61

J(θ) = 𝔽τ∼Pθ(τ) [R(τ)] = ∑

τ

P(τ|θ)R(τ) = ∑

τ

P(τ|θold) P(τ|θ) P(τ|θold) R(τ) = ∑

τ∼P(τ|θold)

P(τ|θ) P(τ|θold) R(τ) = 𝔽τ∼θold P(τ|θ) P(τ|θold) R(τ)

∇J(θ)|θold = 𝔽τ∼θold ∇θP(τ|θ)|θold P(τ|θold) R(τ) = 𝔽τ∼θold∇θlog P(τ|θ)|θold R(τ)

Derivation of likelihood ratio derivative from importance sampling

𝔽τ∼θold πθ(at|st) πθold(at|st) ̂ At

slide-62
SLIDE 62

Step-size Alternative derivations

Monte Carlo Policy Gradients (REINFORCE), gradient direction:

ˆ g = ˆ Et h rθ log πθ(at | st) ˆ At i LPG(θ) = ˆ Et h log πθ(at | st) ˆ At i

How would we implement this in TF or Pytorch Equivalently:

LIS

θold(θ) = 𝔽τ∼θold [

πθ(at|st) πθold(at|st) ̂ At]

Next: Instead of leaving the stepwise to some schedule of the learning rate (heuristically chosen), make it part of the algorithm!

slide-63
SLIDE 63

Hard to choose stepsizes

I Input data is nonstationary due to changing policy: observation and reward

distributions change

I Bad step is more damaging than in supervised learning, since it affects

visitation distribution

Hard to choose stepsizes

  • Step too big

Bad policy->data collected under bad policy-> we cannot recover (in SL, data does not depend on neural network weights)

  • Step too small

Not efficient use of experience (in SL, data can be trivially re-used)

I Policy gradients

ˆ g = ˆ Et h rθ log πθ(at | st) ˆ At i

θnew = θold + ϵ ⋅ ̂ g

h r | i

I Can differentiate the following loss

LPG(θ) = ˆ Et h log πθ(at | st) ˆ At i .

SGD:

slide-64
SLIDE 64

Natural Gradient Descent

Take a step in direction d where d stays within a a ball of radius epsilon.

slide-65
SLIDE 65

Gradient Descent in Parameter Space

Take a step in direction d where d stays within a a ball of radius epsilon. Imagine we are trying to minimize -loglik, and want to update the parameters of the distribution. Same distance in parameter space results in very different distances in distribution space Ideally, we want to take steps based on distance in distribution space. Way more robust, less tweaking of the step. What metric shall we use? Euclidean distance in parameter space

slide-66
SLIDE 66

Gradient Descent in Distribution Space

We will use the Kullback-Leibler divergence (KL) to measure distances between distributions before and after the update. Instead:

slide-67
SLIDE 67

Gradient Descent in Distribution Space

We will use the Kullback-Leibler divergence (KL) to measure distances between distributions before and after the update. Instead: Unconstrained penalized objective: First order Taylor expansion for the loss and second order for the KL:

slide-68
SLIDE 68

Gradient Descent in Distribution Space

We will use the Kullback-Leibler divergence (KL) to measure distances between distributions before and after the update. Instead: Unconstrained penalized objective: First order Taylor expansion for the loss and second order for the KL:

slide-69
SLIDE 69

The natural gradient:

Natural Gradient Descent

Newton’s method