Introduction to Reinforcement Learning
CS 294-112: Deep Reinforcement Learning Sergey Levine
Reinforcement Learning CS 294-112: Deep Reinforcement Learning - - PowerPoint PPT Presentation
Introduction to Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 1 is due next Wednesday! Remember that Monday is a holiday, so no office hours 2. Remember to start forming final project
CS 294-112: Deep Reinforcement Learning Sergey Levine
Images: Bojarski et al. ‘16, NVIDIA
training data supervised learning
Andrey Markov
Andrey Markov Richard Bellman
Andrey Markov Richard Bellman
we’ll come back to partially observed later
state-action marginal
stationary distribution stationary = the same before and after transition
stationary distribution stationary = the same before and after transition
infinite horizon case finite horizon case
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
backprop backprop backprop generate samples (i.e. run the policy) fit a model/ estimate return improve the policy
collect data update the model f update the policy with backprop
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy real robot/car/power grid/whatever: 1x real time, until we invent time travel MuJoCo simulator: up to 10000x real time trivial, fast expensive
backprop backprop backprop
what if we knew this part?
generate samples (i.e. run the policy) fit a model/ estimate return improve the policy
(no explicit policy)
use it to improve policy
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
improve the policy
essentially backpropagation to optimize over actions
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
generate samples (i.e. run the policy) fit a model/ estimate the return improve the policy
different settings
generate samples (i.e. run the policy) fit a model/ estimate return improve the policy
do we need to get a good policy?
algorithm off policy?
without generating new samples from that policy
even a little bit, we need to generate new samples
generate samples (i.e. run the policy) fit a model/ estimate return improve the policy
just one gradient step
More efficient (fewer samples) Less efficient (more samples)
evolutionary or gradient-free algorithms
gradient algorithms actor-critic style methods
Q-function learning model-based deep RL model-based shallow RL
Why is any of this even a question???
anything in the nonlinear case
methods
smoothness
learning methods
methods
reinforcement learning, Mnih et al. ‘13
convolutional neural networks
deep visuomotor policies, L.* , Finn* ’16
(model-based RL) for image-based robotic manipulation
continuous control with generalized advantage estimation, Schulman et
function approximation
for real-world robotic grasping