Separating value functions across time-scales Joshua Romoff* 1,2 , - PowerPoint PPT Presentation

Separating value functions across time-scales Joshua Romoff* 1,2 , Peter Henderson* 3 , Ahmed Touati 2,4 , Emma Brunskill 3 , Joelle Pineau 1,2 , Yann Ollivier 2 1 MILA-McGill University, 2 Facebook AI Research, 3 Stanford University, 4 MILA-Université de Montréal *Equal Contribution

Separating value functions across time- scales RL Background • Monte-Carlo Return / Target: ∞ ∑ 𝛿 𝑗 𝑠 𝑢 + 𝑗 𝐻 𝑢 ∶ = • � 𝑗 =0 • Value function: 𝑊 ( 𝑡 ) ∶ = 𝔽 [ 𝐻 𝑢 | 𝑡 𝑢 = 𝑡 , 𝜌 ] • �

Separating value functions across time- scales RL Background - bootstrapping • Multi-step returns: 𝑙 − 1 𝛿 𝑗 𝑠 𝑢 + 𝑗 + 𝛿 𝑙 𝑊 ( 𝑡 𝑢 + 𝑙 ) ∑ 𝐻 𝑙 𝑢 ∶ = • � 𝑗 =0 • – returns : 𝜇 ∞ 𝑙 ∑ 𝐻 𝜇 𝜇 𝑙 − 1 𝐻 𝑢 ∶ = (1 − 𝜇 ) 𝑢 • � 𝑙 =1

Separating value functions across time- scales 𝛿 Learning - Problems with large 𝑈𝐸 • Part of the problem formulation: ∞ ∑ 𝛿 𝑗 𝑠 𝑢 + 𝑗 𝐻 𝑢 ∶ = • � 𝑗 =0 • When training is difficult 𝛿 → 1 𝑊 𝛿

� Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: ∆ ∶ = ( 𝛿 0 , 𝛿 1 𝛿 � , …, 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗

� Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: ∆ ∶ = ( 𝛿 0 , 𝛿 1 𝛿 � , …, 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗 • Learn and 𝑋 0 ∶ = 𝑊 𝛿 0 𝑋 𝑗 ∶ = 𝑊 𝛿 𝑗 − 𝑊 𝛿 𝑗 − 1

Separating value functions across time- scales - Our Solution 𝑈𝐸 ( ∆ ) • Define a sequence of s: 𝛿 ∆ ∶ = ( 𝛿 0 , 𝛿 1 � , …, � 𝛿 𝑎 ) • with 𝛿 𝑗 ≤ 𝛿 𝑗 +1 ∀ 𝑗 • Learn and 𝑋 0 ∶ = 𝑊 𝛿 0 𝑋 𝑗 ∶ = 𝑊 𝛿 𝑗 − 𝑊 𝛿 𝑗 − 1 𝑎 ∑ • Recompose: 𝑋 𝑗 = 𝑊 𝛿 𝑎 𝑗 =0

Separating value functions across time- scales - Bellman Equation 𝑈𝐸 ( ∆ ) • We can use Bellman Equations: 𝑋 0 ∶ 𝑠 𝑢 + 𝛿 0 𝑋 0 ( 𝑇 𝑢 +1 ) • 𝑋 𝑗 >0 ∶ ( 𝛿 𝑗 − 𝛿 𝑗 − 1 ) 𝑊 𝛿 𝑗 − 1 ( 𝑇 𝑢 +1 ) + 𝛿 𝑗 𝑋 𝑗 ( 𝑇 𝑢 +1 ) • • We extend it to multi-step and 𝑈𝐸 𝑈𝐸 ( 𝜇 )

Separating value functions across time- scales - Bellman Equation 𝑈𝐸 ( ∆ )

  Separating value functions across time- scales - Equivalence results 𝑈𝐸 ( ∆ ) • Equivalence to standard 𝑈𝐸 ( 𝜇 ) We did it! Wait…

  Separating value functions across time- scales - Equivalence conditions 𝑈𝐸 ( ∆ ) • Linear function approximation • Same learning rates for each W • Same K-step / for each W 𝛿𝜇

  Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • We don’t have to be equivalent! • Change the learning rates • � -step / return 𝜇 𝑙

  Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • What will this get us? • Let’s turn to a slightly different setting to get more insight.

Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) • “Phasic” updates for standard TD

Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) Get an error bound using large deviation analysis with bias and variance components! Dependent on k-steps and discount (also size of samples).   Small Note: Kearns & Singh have a slightly different variance term constant, the proof was excluded from the 2000 paper, so we instead used Hoeffding inequality to reach this constant (see our supplemental).

Separating value functions across time- scales – Benefits more tuning 𝑈𝐸 ( ∆ ) If we do the same for our method, get a bias-variance tradeoff.

Separating value functions across time- scales – Little tuning required 𝑈𝐸 ( ∆ ) 1. Adaptive optimizer handle the learning rate 𝜇 𝑗 = min( 𝛿 𝑎 𝜇 𝑎 1 2. Set or , 1) 𝑙 𝑗 = 1 − 𝛿 𝑗 𝛿 𝑗 3. Set to be double the horizon of 𝛿 𝑗 𝛿 𝑗 − 1

Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ ) 1. Train the Ws as described 2. Use the sum of Ws instead of V in policy update We apply it to PPO and test it on Atari

Separating value functions across time- scales – for actor critic algorithms 𝑈𝐸 ( ∆ )

Separating value functions across time- scales - Atari Experiments 𝑈𝐸 ( ∆ )

Separating value functions across time- scales - What does it learn? (Atari) 𝑈𝐸 ( ∆ )

Separating value functions across time- scales - Benefits 𝑈𝐸 ( ∆ ) More knobs to tune bias-variance trade-off! :) More insight into value of policy at different time-scales! Bellman update for learning separated value functions, allows for some theoretical insights. Natural splitting for distributed computing.

Separating value functions across time- scales - Downsides 𝑈𝐸 ( ∆ ) More knobs to tune bias-variance trade-off! :( Somewhat more compute intensive.

Separating value functions across time- scales Meets Reward Estimation 𝑈𝐸 ( ∆ ) Previously demonstrated simple property that by using a learned estimation of the reward, we can reduce variance in learning especially in noisy environments. Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning, pp. 674-699. 2018.

Separating value functions across time- scales Meets Reward Estimation 𝑈𝐸 ( ∆ ) Here, we are using many estimators, looking at a similar bias-variance trade- off. An interesting future investigation would look into whether separating value functions across many estimators has similar natural benefits in the cases of noisy rewards as in our reward estimation work. Joshua Romoff*, Peter Henderson*, Alexandre Piche, Vincent Francois-Lavet, and Joelle Pineau. "Reward Estimation for Variance Reduction in Deep Reinforcement Learning." In Conference on Robot Learning , pp. 674-699. 2018.

Separating value functions across time- scales Other Extensions • Adding more Ws to move discount factor to 1 • Q-learning extension • Use the natural time-scale split for distributed computing updates

Thanks! More Questions?

Separating value functions across time-scales Joshua Romoff* 1,2 , - PowerPoint PPT Presentation

Separating value functions across time-scales Joshua Romoff* 1,2 , Peter Henderson* 3 , Ahmed Touati 2,4 , Emma Brunskill 3 , Joelle Pineau 1,2 , Yann Ollivier 2 1 MILA-McGill University, 2 Facebook AI Research, 3 Stanford University, 4

SEPARATING UNITS SEPARATING UNITS Application Separating workpieces and media, e.g. grinding

7. Separating Hyperplane Theorems I Daisuke Oyama Mathematics II May 1, 2020 Separating

Momentum i i Filtered Filtered = Momentum v f x G

Spring Scales Theyre only accurate when everything is at rest Turn off all electronic

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

Waste Data Automation Alan Housley Vice President Marketing / LoadMan On-Board Truck Scales

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Value of Time Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Value

Chapter 6 Attaway MATLAB 4E Types of Functions Categories of functions: functions that

More on Functions Thomas Schwarz, SJ Marquette University Functions of Functions Functions

Elementary Functions Part 1, Functions Lecture 1.4a, Symmetries of Functions: Even and Odd

Elementary Functions Part 1, Functions Lecture 1.1b, Functions defined by equations Dr. Ken W.

Orthonormal bases of functions April 24, 2018 Data - Vectors or Functions Vectors Functions

Functions Programmer-Defined Functions Local Variables in Functions Overloading

Functions Declarations vs Definitions Inline Functions Class Member functions

Periodic Functions and Orthogonal Systems Periodic Functions Even and Odd Functions

Announcements Assignment 1 is out, due Fri Sept 22 Presentation assignments - up this week

Learning to plan: Applications of search to robotics Kevin Xie* and Homanga Bharadhwaj* *1st

SYSCO 2Q FY20 EARNINGS RESULTS Forward Looking Statements Statements made in this presentation or

Deconstructing Household Wealth Trends in the United States, 1983 to 2016 First WID World

How PyTorch Scales Deep Learning from Experimentation to Production Vincent Quenneville-Blair,

Markov Decision Process Assumption: agent gets to observe the state [Drawing from Sutton and

Numerical Evidences for QED 3 being Scale-invariant Nikhil Karthik and Rajamani Narayanan

Likert-Scale Fuzzy Uncertainty from a Traditional Decision Making Viewpoint: Incorporating both