Deep Reinforcement Learning Lecture 1 Sergey Levine How do we - PowerPoint PPT Presentation

Today’s Topics • Actor-critic algorithm: reducing policy gradient variance using prediction • Value-based algorithms: no more policy gradient, off-policy learning • Model-based algorithms: control by predicting the future • Open challenges and future directions

Today • Actor-critic algorithm: reducing policy gradient variance using prediction • Value-based algorithms: no more policy gradient, off-policy learning • Model-based algorithms: control by predicting the future • Open challenges and future directions

Improving the Gradient by Estimating the Value Function

Recap: policy gradients fit a model to estimate return generate samples (i.e. run the policy) improve the policy “reward to go”

Improving the policy gradient “reward to go”

What about the baseline?

State & state-action value functions fit a model to estimate return generate samples (i.e. run the policy) improve the policy the better this estimate, the lower the variance unbiased, but high variance single-sample estimate

Value function fitting fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Policy evaluation fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Monte Carlo evaluation with function approximation the same function should fit multiple samples!

Can we do better?

An actor-critic algorithm fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Aside: discount factors episodic tasks continuous/cyclical tasks

Actor-critic algorithms (with discount)

Architecture design two network design + simple & stable - no shared features between actor & critic shared network design

Online actor-critic in practice works best with a batch (e.g., parallel workers) synchronized parallel actor-critic asynchronous parallel actor-critic

Review • Actor-critic algorithms: fit a model to • Actor: the policy estimate return • Critic: value function generate • Reduce variance of policy gradient samples (i.e. run the policy) • Policy evaluation improve the • Fitting value function to policy policy • Discount factors • Actor-critic algorithm design • One network (with two heads) or two networks • Batch-mode, or online (+ parallel)

Actor-critic examples • High dimensional continuous control with generalized advantage estimation (Schulman, Moritz, L., Jordan, Abbeel ‘16) • Batch-mode actor-critic • Hybrid blend of Monte Carlo return estimates and critic called generalized advantage estimation (GAE)

Actor-critic examples • Asynchronous methods for deep reinforcement learning (Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16) • Online actor-critic, parallelized batch • N-step returns with N = 4 • Single network for actor and critic

Actor-critic suggested readings • Classic papers • Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation • Deep reinforcement learning actor-critic papers • Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic • Schulman, Moritz, L., Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns • Gu, Lillicrap, Ghahramani, Turner, L. (2017). Q-Prop: sample-efficient policy- gradient with an off-policy critic: policy gradient with Q-function control variate

Today • Actor-critic algorithm: reducing policy gradient variance using prediction • Value-based algorithms: no more policy gradient, off-policy learning • Model-based algorithms: control by predicting the future • Open challenges and future directions

Improving the Gradient by… not using it anymore

Can we omit policy gradient completely? forget policies, let’s just do this! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Policy iteration fit a model to High level idea: estimate return generate samples (i.e. how to do this? run the policy) improve the policy

Dynamic programming 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.6 0.4 0.5 0.5 0.5 0.7 just use the current estimate here

Policy iteration with dynamic programming fit a model to estimate return generate samples (i.e. run the policy) improve the policy 0.2 0.3 0.4 0.3 0.5 0.3 0.3 0.3 0.4 0.4 0.4 0.6 0.5 0.5 0.5 0.7

Even simpler dynamic programming approximates the new value! fit a model to estimate return generate samples (i.e. run the policy) improve the policy

Fitted value iteration curse of dimensionality fit a model to estimate return generate samples (i.e. run the policy) improve the policy

What if we don’t know the transition dynamics? need to know outcomes for different actions! Back to policy iteration… can fit this using samples

Can we do the “max” trick again? forget policy, compute value directly can we do this with Q-values also , without knowing the transitions? doesn’t require simulation of actions! + works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient - no convergence guarantees for non-linear function approximation

Fitted Q-iteration

Why is this algorithm off-policy? dataset of transitions Fitted Q-iteration

What is fitted Q-iteration optimizing? most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)

Online Q-learning algorithms fit a model to estimate return generate samples (i.e. run the policy) improve the policy off policy, so many choices here!

Exploration with Q-learning final policy: why is this a bad idea for step 1? “epsilon - greedy” “Boltzmann exploration”

Review • Value-based methods • Don’t learn a policy explicitly • Just learn value or Q-function fit a model to estimate return • If we have value function, we generate have a policy samples (i.e. run the policy) • Fitted Q-iteration improve the • Batch mode, off-policy method policy • Q-learning • Online analogue of fitted Q- iteration

What’s wrong? Q-learning is not gradient descent! no gradient through target value

Correlated samples in online Q-learning - sequential states are strongly correlated - target value is always changing synchronized parallel Q-learning asynchronous parallel Q-learning

Another solution: replay buffers special case with K = 1, and one gradient step any policy will work! (with broad support) just load data from a buffer here still use one gradient step dataset of transitions Fitted Q-iteration

Another solution: replay buffers + samples are no longer correlated + multiple samples in the batch (low-variance gradient) but where does the data come from? need to periodically feed the replay buffer… dataset of transitions (“replay buffer”) off-policy Q-learning

Putting it together K = 1 is common, though larger K more efficient dataset of transitions (“replay buffer”) off-policy Q-learning

What’s wrong? use replay buffer Q-learning is not gradient descent! This is still a problem! no gradient through target value

Q-Learning and Regression one gradient step, moving target perfectly well-defined, stable regression

Q-Learning with target networks supervised regression targets don’t change in inner loop!

“Classic” deep Q -learning algorithm (DQN) Mnih et al. ‘13

Fitted Q-iteration and Q-learning just SGD

A more general view current target parameters parameters dataset of transitions (“replay buffer”)

Deep Reinforcement Learning Lecture 1 Sergey Levine How do we - PowerPoint PPT Presentation

Deep Reinforcement Learning Lecture 1 Sergey Levine How do we build intelligent machines? Intelligent machines must be able to adapt Deep learning helps us handle unstructured environments Reinforcement learning provides a formalism for

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Deep Reinforcement Learning Philipp Koehn 18 April 2019 Philipp Koehn Artificial Intelligence:

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning

t t t st

Learning argumentative recommenders Olivier Cailloux LAMSADE, Universit Paris-Dauphine 22 nd

Real Root Finding for Equivariant Semi-Algebraic Systems ISSAC 2018 Cordian Riener 1 Mohab Safey

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Outline Introduction Hypothesis testing (pre-study) Findings and challenges

Spectral Processing Misha Kazhdan [Taubin, 1995] A Signal Processing Approach to Fair Surface

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

The Global Geometry of Stationary Surfaces in 4-dimensional Lorentz space Xiang Ma (Joint with