Separating value functions across time-scales Joshua Romoff* 1,2 , - - PowerPoint PPT Presentation

β–Ά
separating value functions across time scales
SMART_READER_LITE
LIVE PREVIEW

Separating value functions across time-scales Joshua Romoff* 1,2 , - - PowerPoint PPT Presentation

Separating value functions across time-scales Joshua Romoff* 1,2 , Peter Henderson* 3 , Ahmed Touati 2,4 , Emma Brunskill 3 , Joelle Pineau 1,2 , Yann Ollivier 2 1 MILA-McGill University, 2 Facebook AI Research, 3 Stanford University, 4


slide-1
SLIDE 1

Separating value functions across time-scales

Joshua Romoff*1,2, Peter Henderson*3, Ahmed Touati2,4, Emma Brunskill3, Joelle Pineau1,2 , Yann Ollivier2

1MILA-McGill University, 2Facebook AI Research, 3Stanford University, 4MILA-UniversitΓ©

de MontrΓ©al *Equal Contribution

slide-2
SLIDE 2

RL Background

  • Monte-Carlo Return / Target:
  • Value function:
  • 𝐻𝑒 ∢ =

∞

βˆ‘

𝑗=0

𝛿𝑗𝑠𝑒+𝑗 π‘Š(𝑑) ∢ = 𝔽[𝐻𝑒| 𝑑𝑒 = 𝑑, 𝜌]

Separating value functions across time- scales

slide-3
SLIDE 3

RL Background - bootstrapping

  • Multi-step returns:
  • – returns :
  • 𝐻𝑙

𝑒 ∢ = π‘™βˆ’1

βˆ‘

𝑗=0

𝛿𝑗𝑠𝑒+𝑗+ 𝛿𝑙 π‘Š(𝑑𝑒+𝑙)

πœ‡

π»πœ‡

𝑒 ∢ = (1 βˆ’ πœ‡) ∞

βˆ‘

𝑙=1

πœ‡π‘™βˆ’1𝐻

𝑙 𝑒

Separating value functions across time- scales

slide-4
SLIDE 4

Learning - Problems with large π‘ˆπΈ

𝛿

  • Part of the problem formulation:
  • When

training is difficult

𝐻𝑒 ∢ =

∞

βˆ‘

𝑗=0

𝛿𝑗𝑠𝑒+𝑗

𝛿 β†’ 1 π‘Šπ›Ώ

Separating value functions across time- scales

slide-5
SLIDE 5
  • Our Solution

π‘ˆπΈ( βˆ† )

  • Define a sequence of s:
  • , …,
  • with

𝛿 βˆ† ∢ = (𝛿0, 𝛿1

π›Ώπ‘Ž)

𝛿𝑗 ≀ 𝛿𝑗+1 βˆ€ 𝑗

Separating value functions across time- scales

slide-6
SLIDE 6
  • Our Solution

π‘ˆπΈ( βˆ† )

  • Define a sequence of s:
  • , …,
  • with
  • Learn

and

𝛿 βˆ† ∢ = (𝛿0, 𝛿1

π›Ώπ‘Ž)

𝛿𝑗 ≀ 𝛿𝑗+1 βˆ€ 𝑗 𝑋0 ∢ = π‘Šπ›Ώ0 𝑋𝑗 ∢ = π‘Šπ›Ώπ‘— βˆ’ π‘Šπ›Ώπ‘—βˆ’1

Separating value functions across time- scales

slide-7
SLIDE 7
  • Our Solution

π‘ˆπΈ( βˆ† )

  • Define a sequence of s:
  • , …,
  • with
  • Learn

and

  • Recompose:

𝛿 βˆ† ∢ = (𝛿0, 𝛿1

π›Ώπ‘Ž)

𝛿𝑗 ≀ 𝛿𝑗+1 βˆ€ 𝑗 𝑋0 ∢ = π‘Šπ›Ώ0 𝑋𝑗 ∢ = π‘Šπ›Ώπ‘— βˆ’ π‘Šπ›Ώπ‘—βˆ’1

π‘Ž

βˆ‘

𝑗=0

𝑋𝑗 = π‘Šπ›Ώπ‘Ž

Separating value functions across time- scales

slide-8
SLIDE 8
  • Bellman Equation

π‘ˆπΈ( βˆ† )

  • We can use Bellman Equations:
  • We extend it to multi-step

and

𝑋0 ∢ 𝑠𝑒+ 𝛿0 𝑋0(𝑇𝑒+1) 𝑋𝑗>0 ∢ (𝛿𝑗 βˆ’ π›Ώπ‘—βˆ’1)π‘Šπ›Ώπ‘—βˆ’1(𝑇𝑒+1) + 𝛿𝑗 𝑋𝑗 (𝑇𝑒+1) π‘ˆπΈ π‘ˆπΈ(πœ‡)

Separating value functions across time- scales

slide-9
SLIDE 9
  • Bellman Equation

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-10
SLIDE 10
  • Equivalence results

π‘ˆπΈ( βˆ† )

  • Equivalence to standard




π‘ˆπΈ(πœ‡)

Separating value functions across time- scales

We did it! Wait…

slide-11
SLIDE 11
  • Equivalence results

π‘ˆπΈ( βˆ† )

  • Equivalence to standard




π‘ˆπΈ(πœ‡)

Separating value functions across time- scales

We did it! Wait…

slide-12
SLIDE 12
  • Equivalence conditions

π‘ˆπΈ( βˆ† )

  • Linear function approximation
  • Same learning rates for each W
  • Same K-step /

for each W 


π›Ώπœ‡

Separating value functions across time- scales

slide-13
SLIDE 13

– Benefits more tuning π‘ˆπΈ( βˆ† )

  • We don’t have to be equivalent!
  • Change the learning rates
  • -step / return




𝑙

πœ‡

Separating value functions across time- scales

slide-14
SLIDE 14

– Benefits more tuning π‘ˆπΈ( βˆ† )

  • What will this get us?
  • Let’s turn to a slightly different setting to

get more insight. 


Separating value functions across time- scales

slide-15
SLIDE 15

– Benefits more tuning π‘ˆπΈ( βˆ† )

  • β€œPhasic” updates for standard TD

Separating value functions across time- scales

slide-16
SLIDE 16

– Benefits more tuning π‘ˆπΈ( βˆ† )

Get an error bound using large deviation analysis with bias and variance components! Dependent on k-steps and discount (also size of samples). 


Separating value functions across time- scales

Small Note: Kearns & Singh have a slightly different variance term constant, the proof was excluded from the 2000 paper, so we instead used Hoeffding inequality to reach this constant (see our supplemental).

slide-17
SLIDE 17

– Benefits more tuning π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

If we do the same for our method, get a bias-variance tradeoff.

slide-18
SLIDE 18

– Little tuning required π‘ˆπΈ( βˆ† )

  • 1. Adaptive optimizer handle the learning rate
  • 2. Set
  • r

, 1)

  • 3. Set to be double the horizon of

𝑙𝑗 = 1 1 βˆ’ 𝛿𝑗 πœ‡π‘— = min(π›Ώπ‘Ž πœ‡π‘Ž 𝛿𝑗 𝛿𝑗 π›Ώπ‘—βˆ’1

Separating value functions across time- scales

slide-19
SLIDE 19

– for actor critic algorithms π‘ˆπΈ( βˆ† )

  • 1. Train the Ws as described
  • 2. Use the sum of Ws instead of V in policy

update We apply it to PPO and test it on Atari

Separating value functions across time- scales

slide-20
SLIDE 20

– for actor critic algorithms π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-21
SLIDE 21

– for actor critic algorithms π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-22
SLIDE 22

– for actor critic algorithms π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-23
SLIDE 23

– for actor critic algorithms π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-24
SLIDE 24
  • Atari Experiments

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-25
SLIDE 25
  • Atari Experiments

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-26
SLIDE 26
  • Atari Experiments

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-27
SLIDE 27
  • What does it learn? (Atari)

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-28
SLIDE 28
  • What does it learn? (Atari)

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

slide-29
SLIDE 29
  • Benefits

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

More knobs to tune bias-variance trade-off! :) More insight into value of policy at different time-scales! Bellman update for learning separated value functions, allows for some theoretical insights. Natural splitting for distributed computing.

slide-30
SLIDE 30
  • Downsides

π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

More knobs to tune bias-variance trade-off! :( Somewhat more compute intensive.

slide-31
SLIDE 31

Meets Reward Estimation π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning, pp. 674-699. 2018.

Previously demonstrated simple property that by using a learned estimation

  • f the reward, we can reduce variance in learning especially in noisy

environments.

slide-32
SLIDE 32

Meets Reward Estimation π‘ˆπΈ( βˆ† )

Separating value functions across time- scales

Here, we are using many estimators, looking at a similar bias-variance trade-

  • ff.

An interesting future investigation would look into whether separating value functions across many estimators has similar natural benefits in the cases of noisy rewards as in our reward estimation work.

Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning, pp. 674-699. 2018.

slide-33
SLIDE 33

Other Extensions

  • Adding more Ws to move discount factor to 1
  • Q-learning extension
  • Use the natural time-scale split for

distributed computing updates

Separating value functions across time- scales

slide-34
SLIDE 34

Thanks! More Questions?