10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 - PowerPoint PPT Presentation

10703 Deep Reinforcement Learning Policy Gradient Methods – Part 3 Tom Mitchell October 8, 2018 Recommended readings: next slide. (not covered in Barto & Sutton)

Used Materials • Disclaimer : Much of the material and slides for this lecture were borrowed from Ruslan Salakhutdinov, who in turn borrowed from Rich Sutton’s RL class and David Silver’s Deep RL tutorial

Recommended Readings on Natural Policy Gradient and Convergence of Actor-Critic Learning

Actor-Critic ‣ Monte-Carlo policy gradient still has high variance ‣ We can use a critic to estimate the action-value function: ‣ Actor-critic algorithms maintain two sets of parameters - Critic Updates action-value function parameters w - Actor Updates policy parameters θ , in direction suggested by critic ‣ Actor-critic algorithms follow an approximate policy gradient

Reducing Variance Using a Baseline ‣ We can subtract a baseline function B(s) from the policy gradient ‣ This can reduce variance, without changing expectation! ‣ A good baseline is the state value function ‣ So we can rewrite the policy gradient using the advantage function: ‣ Note that it is the exact same policy gradient:

Estimating the Advantage Function ‣ For the true value function , the TD error: is an unbiased estimate of the advantage function: ‣ So we can use the TD error to compute the policy gradient ‣ Remember the policy gradient

Estimating the Advantage Function ‣ For the true value function , the TD error: is an unbiased estimate of the advantage function ‣ So we can use the TD error to compute the policy gradient ‣ In practice we can use an approximate TD error ‣ This approach only requires one set of critic parameters v

Dueling Networks ‣ Split Q-network into two channels ‣ Action-independent value function V(s,v) ‣ Action-dependent advantage function A(s, a, w) ‣ Advantage function is defined as: Wang et.al., ICML, 2016

Advantage Actor-Critic Algorithm

So Far: Summary of PG Algorithms ‣ The policy gradient has many equivalent forms ‣ Each leads a stochastic gradient ascent algorithm ‣ Critic uses policy evaluation (e.g. MC or TD learning) to estimate

But will it converge if we use function approximation?? Under what conditions??

Bias in Actor-Critic Algorithms ‣ Approximating the policy gradient introduces bias ‣ A biased policy gradient may not find the right solution ‣ Luckily, if we choose value function approximation carefully ‣ Then we can avoid introducing any bias ‣ i.e. we can still follow the exact policy gradient

Compatible Function Approximation ‣ If the following two conditions are satisfied: 1. Value function approximator is compatible with the policy 2 Value function parameters w minimize the mean-squared error ‣ Then the policy gradient is exact, ‣ Remember:

Proof ‣ If w is chosen to minimize mean-squared error ε , then gradient of ε with respect to w must be zero, ‣ So Q w (s, a) can be substituted directly into the policy gradient, ‣ Remember:

Proof ‣ If w is chosen to minimize mean-squared error ε , note error ε need not then gradient of ε with respect to w must be zero, be zero, just needs to be minimized! note we only need to within a constant! ‣ So Q w (s, a) can be substituted directly into the policy gradient, ‣ Remember:

Compatible Function Approximation ‣ If the following two conditions are satisfied: 1. Value function approximator is compatible with the policy How can we achieve this?? 2 Value function parameters w minimize the mean-squared error ‣ Then the policy gradient is exact, ‣ Remember:

Compatible Function Approximation ‣ If the following two conditions are satisfied: 1. Value function approximator is compatible with the policy How can we achieve this?? One way: make Q w and π θ both be linear functions of same features of s,a ‣ let Φ (s,a) be a vector of features describing the pair (s,a) ‣ let Q w (s,a) = w T Φ (s,a) . let log π θ (s,a) = θ T Φ (s,a) ‣ then

Compatible Function Approximation How can we achieve this?? One way: make Q w and π θ both be linear functions of same features of s,a ‣ let Φ (s,a) be a vector of features describing the pair (s,a) ‣ let Q w (s,a) = w T Φ (s,a) . let log π θ (s,a) = θ T Φ (s,a) ‣ then Q w (s,a) = w a T Φ (s) log π θ (s,a) = θ a T Φ (s) Φ (s)

Alternative Policy Gradient Directions ‣ Generalized gradient ascent algorithms can follow any ascent direction ‣ A good ascent direction can significantly speed convergence ‣ Also, a policy can often be reparametrized without changing action probabilities ‣ For example, increasing score of all actions in a softmax policy ‣ The vanilla gradient is sensitive to these reparametrizations ‣ but the natural gradient is not!

Natural Policy Gradient ‣ The natural policy gradient is parameterization independent (i.e., not influenced by set of parameters you use to define ‣ it finds ascent direction that is closest to vanilla gradient ‣ where G θ is the Fisher information matrix

Natural Policy Gradient ‣ The natural policy gradient is parameterization independent (i.e., not influenced by set of parameters you use to define ‣ where G θ is the Fisher information matrix ‣ what is the <i, j>th element of G θ ? ‣ what is G θ if we have a parameterization that yields the natural gradient?

Under linear model: Natural Actor-Critic ‣ Using compatible function approximation, ‣ The natural policy gradient simplifies, ‣ i.e. update actor parameters in direction of critic parameters!

from: Peters and Schaal

from: Kakade

Summary of Policy Gradient Algorithms ‣ The policy gradient has many equivalent forms ‣ Each leads a stochastic gradient ascent algorithm ‣ Critic uses policy evaluation (e.g. MC or TD learning) to estimate

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 - PowerPoint PPT Presentation

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018 Recommended readings: next slide. (not covered in Barto & Sutton) Used Materials Disclaimer : Much of the material and slides for this

10703 Deep Reinforcement Learning Reinforcement Learning in Humans and Animals Tom Mitchell

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

10703 Deep Reinforcement Learning Imitation Learning - 1 Tom Mitchell November 4, 2018

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

10703 Deep Reinforcement Learning Exploration vs. Exploitation Tom Mitchell October 22, 2018

10703 Deep Reinforcement Learning Policy Gradient Methods Tom Mitchell October 1, 2018 Reading:

10703 Deep Reinforcement Learning Solving known MDPs Tom Mitchell September 10, 2018 Many

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Introduction to Machine Learning Milan Straka October 07, 2019 Charles University in Prague

MY FIRST STEPS IN SLIT SPECTROSCOPY BAAVSS Spectroscopy Workshop Norman Lockyer Observatory

Gauge mediation with a local flavour Felix Brmmer partly based on 1312.0935 (with M. McGarrie,

General Gauge Mediation @ the EW scale Diego Redigolo GGI, Florence September 4th based on

Lecture 2 Diagnostics and Model Evaluation Colin Rundel 1/23/2017 1 From last time 2 Linear

Introduction to Machine Learning Evaluation: Measures for Regression Learning goals Know the

CSE 158 Lecture 2 Web Mining and Recommender Systems Supervised learning Regression

Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning

Sambuz

Useful Links

Newsletter

Mail Us