NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: - PowerPoint PPT Presentation

Policy Gradient Methods for Reinforcement Learning with Function Approximation NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: Silviu Pitis Date: January 21, 2020

Talk Outline ● Problem statement, background & motjvatjon ● Topics: – Statement of policy gradient theorem – Derivatjon of policy gradient theorem – Actjon-independent baselines – Compatjble value functjon approximatjon – Convergence of policy iteratjon with compatjble fn approx

Problem statement We want to learn a parameterized behavioral policy: that optimizes the long-run sum of (discounted) rewards: note : the paper also considers the average reward formulation This is exactly the reinforcement learning problem! (same results apply)

Traditional approach: Greedy value-based methods Traditional approaches (e.g., DP, Q-learning) learn a value function : They then induce a policy using a greedy argmax:

T wo problems with greedy, value-based methods 1) They can diverge when using function approximation, as small changes in the value function can cause large changes in the policy In fully observed, tabular case, guaranteed to have an optimal deterministic policy. 2) Traditionally focused on deterministic actions, but optimal policy may be stochastic when using function approximation (or when environment is partially observed)

Proposed approach: Policy gradient methods ● Instead of actjng greedily, policy gradient approaches parameterize the policy directly, and optjmize it via gradient descent on the cost functjon: ● NB1: cost must be difgerentjable with respect to theta! Non-degenerate, stochastjc policies ensure this. ● NB2: Gradient descent converges to a local optjmum of the cost functjon → so do policy gradient methods, but only if they are unbiased!

Stochastic Policy Value Function Visualization Source: Me (2018)

Stochastic Policy Gradient Descent Visualization Source: Dadashi et. al. (ICLR 2019)

Unbiasedness is critical ● Gradient descent converges → so do unbiased policy gradient methods! ● Recall the defjnitjon of the bias of an estjmator: – An estjmator of has bias: – It is unbiased if its bias equals 0. ● This is important to keep in mind, as not all policy gradient algorithms are unbiased, so may not converge to a local optjmum of the cost functjon.

Recap ● Traditional value-based methods may diverge when using function approximation � directly optimize the policy using gradient descent Let’s now look at the paper’s 3 contributions: 1) Policy gradient theorem --- statement & derivation 2) Baselines & compatible value function approximation 3) Convergence of Policy Iteration with compatible function approx

Policy gradient theorem (2 forms) Recall the objective: Sutton 2000 NB : This is the true future value of the policy, not an approximation! Modern form

The two forms are equivalent (Sutton 2000) (Modern form)

Trajectory Derivation: REINFORCE Estimator “Score function gradient estimator” also known as “REINFORCE gradient estimator” --- very generic, and very useful! NB: R(tau) is arbitrary (i.e., can be non-differentiable!)

Intuition of Score function gradient estimator Source: Emma Brunskill

Trajectory Derivation Continued Almost in modern form! Just one more step...

Trajectory Derivation, Final Step Since earlier rewards do not depend on later actions. And this now (proportional to) modern form!

Variance Reduction If f(x) is positive everywhere, we are always positively reinforcing the same policy! If we could somehow provide negative reinforcement for bad actions, we can reduce variance... Source: Emma Brunskill

Last step: Subtracting an Action-independent Baseline I Source: Hado Van Hasselt

Last step: Subtracting an Action-independent Baseline II Source: Hado Van Hasselt

Compatible Value Function Approximation ● Policy gradient theorem uses an unbiased estimator of the future rewards, ● What if we use a value function to approximate ? Does our convergence guarantee disappear? ● In general, yes. ● But not if we use a compatible function approximator --- Sutton et al. Provides a sufficient (but strong) condition for a function approximator to be compatible (i.e., provide an unbiased policy gradient estimate).

Source: Russ Salakhutdinov

Recap: Compatible Value Function Approx. ● If we approximate the true future reward with an approximator such that the policy gradient estimator remains unbiased � gradient descent converges to a local optimum. ● Sutton uses this this to prove the convergence of policy iteration when using a compatible value function approximator.

Critique I: Bias & Variance Tradeoffs ● Monte Carlo returns provide high variance estjmates, so we typically want to use a critjc to estjmate future returns. ● But unless the critjc is compatjble, it will introduce bias. ● “Tsitsiklis (personal communicatjon) points out that [the critjc] being linear in may be the only way to satjsfy the [compatjble value functjon approximatjon] conditjon.” ● Empirically speaking, we use non-compatjble (biased) critjcs because they perform betuer.

Critique II: Policy Gradients are On Policy ● The policy gradient theorem is, by defjnitjon, on policy . ● Recall : on-policy methods learn from dat that they themselves generate; ofg-policy methods (e.g., Q-learning) can learn from data produced by other (possibly unknown) policies. ● To use ofg-policy data with policy gradients, we need to use importance sampling, which results in high variance. ● Limits the ability to use data from previous iterates.

Recap ● Traditional value-based methods may diverge when using function approximation � directly optimize the policy using gradient descent ● We do this with the policy gradient theorem: ● Some key takeaways: ● REINFORCE log-gradient trick is very useful (know it!) ● We can reduce the variance by using a baseline ● There is thing called compatible approximation, but to my knowledge its not so practical ● IMO, the main limitation of policy gradient methods is their on-policyness (but see DPG!)

NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: - PowerPoint PPT Presentation

Policy Gradient Methods for Reinforcement Learning with Function Approximation NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: Silviu Pitis Date: January 21, 2020 Talk Outline Problem statement, background &

Local Committee 1 Advice Link Partnership Sutton - ALPS LB Sutton Contract ALPS provided by:

Sutton Page 7 Local Committee Agenda Item 6 1 Agenda Item 6 Advice Link Partnership Sutton -

Sutton South Cheam and Belmont Local Committee 9 March 2017 7pm Sutton South Cheam and Belmont

SUCCESSFUL SUTTON 2020 Annual Review Successful Sutton, the Business Improvement District for

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

Kingston and Sutton Shared Environment Service Page 9 Trading Standards Sutton Local Committee

Sutton Local Committee Page 43 13 December 2016 7pm Agenda Item 10 Agenda Item 10 Sutton

Carshalton & Clockhouse Page 47 Local Committee Agenda Item 11 1 Agenda Item 11 Advice

Winter Preparedness in Sutton Carolyn Moore Infection Prevention and Control Nurse Sutton

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1

Dennis & Leigh Costopoulos Adrian Haining with Pranish Singh Jackson Elliott with Pranish

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Adversarial Online Learning with noise Alon Resler Yishay Mansour Tel Aviv University Jun 13,

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

5/7/2019 Peers and Play: Supporting Social Interactions of Students with Autism Ann M. Sam,

1 2 3 4 5 6 7 8 9 10 11 12

Reproducible IR needs an (IR) (Graph) Query Language Chris Kamphuis and Arjen P. de Vries

DAQ Needs from Calibrations--- UPDATE What DAQ needs from calibration SYSTEMs What DAQ

Machine Learning, Reinforcement Learning AI Class 25 (Ch. 21.1, 20.220.2.5, 20.3) Thanks to Tim

Directed Diffusion for Wireless Sensor Networking Jussi Nikander Jussi.Nikander@hut.fi 9th

1 In-network data aggregation What is directed diffusion? Old way Interests Binding a

Routing in Ad-hoc networks P R E S E N T E D B Y - L E W I S T S E N G R A C H I T A G A R W A

NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: - PowerPoint PPT Presentation

Policy Gradient Methods for Reinforcement Learning with Function Approximation NeurIPS 2000 Sutton McAllester Singh Mansour Presenter: Silviu Pitis Date: January 21, 2020 Talk Outline Problem statement, background &

Local Committee 1 Advice Link Partnership Sutton - ALPS LB Sutton Contract ALPS provided by:

Sutton Page 7 Local Committee Agenda Item 6 1 Agenda Item 6 Advice Link Partnership Sutton -

Sutton South Cheam and Belmont Local Committee 9 March 2017 7pm Sutton South Cheam and Belmont

SUCCESSFUL SUTTON 2020 Annual Review Successful Sutton, the Business Improvement District for

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

Kingston and Sutton Shared Environment Service Page 9 Trading Standards Sutton Local Committee

Sutton Local Committee Page 43 13 December 2016 7pm Agenda Item 10 Agenda Item 10 Sutton

Carshalton &amp; Clockhouse Page 47 Local Committee Agenda Item 11 1 Agenda Item 11 Advice

Winter Preparedness in Sutton Carolyn Moore Infection Prevention and Control Nurse Sutton

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1

Dennis &amp; Leigh Costopoulos Adrian Haining with Pranish Singh Jackson Elliott with Pranish

Wild fires 1950 1950 2000 2000 250 1950 1950 2000 2000 30 40 50 20 10 0 350 200

Adversarial Online Learning with noise Alon Resler Yishay Mansour Tel Aviv University Jun 13,

Winlink 2000 Winlink 2000 May 22, 2007 May 22, 2007 Gwinnett Amateur Radio Emergency Service

TDR Assumptions for Pulsed Neutron Yield [/keV] Neutron Yield [/keV] 2500 2000 2000 2500

5/7/2019 Peers and Play: Supporting Social Interactions of Students with Autism Ann M. Sam,

1 2 3 4 5 6 7 8 9 10 11 12

Reproducible IR needs an (IR) (Graph) Query Language Chris Kamphuis and Arjen P. de Vries

DAQ Needs from Calibrations--- UPDATE What DAQ needs from calibration SYSTEMs What DAQ

Machine Learning, Reinforcement Learning AI Class 25 (Ch. 21.1, 20.220.2.5, 20.3) Thanks to Tim

Directed Diffusion for Wireless Sensor Networking Jussi Nikander Jussi.Nikander@hut.fi 9th

1 In-network data aggregation What is directed diffusion? Old way Interests Binding a

Routing in Ad-hoc networks P R E S E N T E D B Y - L E W I S T S E N G R A C H I T A G A R W A

Carshalton & Clockhouse Page 47 Local Committee Agenda Item 11 1 Agenda Item 11 Advice

Dennis & Leigh Costopoulos Adrian Haining with Pranish Singh Jackson Elliott with Pranish