Deep Reinforcement Learning
Lecture 1
Sergey Levine
Deep Reinforcement Learning Lecture 1 Sergey Levine How do we - - PowerPoint PPT Presentation
Deep Reinforcement Learning Lecture 1 Sergey Levine How do we build intelligent machines? Intelligent machines must be able to adapt Deep learning helps us handle unstructured environments Reinforcement learning provides a formalism for
Sergey Levine
decisions (actions) consequences
rewards
Mnih et al. ‘13 Schulman et al. ’14 & ‘15 Levine*, Finn*, et al. ‘16
standard computer vision features (e.g. HOG) mid-level features (e.g. DPM) classifier (e.g. SVM) deep learning
Felzenszwalb ‘08
end-to-end training standard reinforcement learning features more features linear policy
deep reinforcement learning end-to-end training
Action (run away) perception action
Action (run away) sensorimotor loop
robotic control pipeline
state estimation (e.g. vision) modeling & prediction planning low-level control controls
The reinforcement learning problem is the AI problem! decisions (actions) consequences
rewards Actions: motor current or torque Observations: camera images Rewards: task success measure (e.g., running speed) Actions: what to purchase Observations: inventory levels Rewards: profit Actions: words in French Observations: words in English Rewards: BLEU score
When your system is making single isolated decision, e.g. classification, regression When that decision does not affect future decisions
robotics autonomous driving language & dialogue (structured prediction) Common Applications business operations finance Limited supervision: you know what you want, but not how to get it Actions have consequences
L.-J. Lin, “Reinforcement learning for robots using neural networks.” 1993 Tesauro, 1995
Atari games:
Q-learning:
Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013).
Policy gradients:
et al. “Asynchronous methods for deep reinforcement learning”. (2016).
Real-world robots:
Guided policy search:
training of deep visuomotor policies”. (2015).
Q-learning:
Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates”. (2016).
Beating Go champions:
Supervised learning + policy gradients + value functions + Monte Carlo tree search:
with deep neural networks and tree search”. Nature (2016).
Images: Bojarski et al. ‘16, NVIDIA
Images: Bojarski et al. ‘16, NVIDIA
training data supervised learning
Andrey Markov
Andrey Markov Richard Bellman
Andrey Markov Richard Bellman
we’ll come back to partially observed later
infinite horizon case finite horizon case
a convenient identity
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
training data supervised learning
good stuff is made more likely bad stuff is made less likely simply formalizes the notion of “trial and error”!
high variance
“reward to go”
but… are we allowed to do that?? subtracting a baseline is unbiased in expectation! average reward is not the best baseline, but it’s pretty good!
a convenient identity
(image from Peters & Schaal 2008)
Essentially the same problem as this:
see Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization (figure from Peters & Schaal 2008)
Sergey Levine
deep networks + RL = end-to- end optimization of decision making and control
SGD is great
gradient
Mnih et al. ‘13 Schulman et al. ’14 & ‘15 Levine*, Finn*, et al. ‘16
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
RL objective and follow gradient
practical
gradient/trust region
“reward to go”
Pseudocode example (with discrete actions): Maximum likelihood:
# Given: # actions - (N*T) x Da tensor of actions # states - (N*T) x Ds tensor of states # Build the graph: logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits) loss = tf.reduce_mean(negative_likelihoods) gradients = loss.gradients(loss, variables)
Pseudocode example (with discrete actions): Policy gradient:
# Given: # actions - (N*T) x Da tensor of actions # states - (N*T) x Ds tensor of states # q_values – (N*T) x 1 tensor of estimated state-action values # Build the graph: logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits) weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values) loss = tf.reduce_mean(weighted_negative_likelihoods) gradients = loss.gradients(loss, variables) q_values
estimator as control variate for variance reduction
Schulman, Levine, Moritz, Jordan, Abbeel. ‘15
automatic step adjustment
continuous actions
Duan et al. ‘16)
reinforcement learning: introduces REINFORCE algorithm
decomposed policy gradient (not the first paper on this! see actor-critic section later)
very accessible overview of optimal baselines and natural gradient
with natural policy gradient and adaptive step size
algorithms: deep RL with importance sampled policy gradient
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
“reward to go”
“reward to go”
the better this estimate, the lower the variance unbiased, but high variance single-sample estimate
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
the same function should fit multiple samples!
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
episodic tasks continuous/cyclical tasks
two network design + simple & stable
shared network design
works best with a batch (e.g., parallel workers) synchronized parallel actor-critic asynchronous parallel actor-critic
networks
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
control with generalized advantage estimation (Schulman, Moritz, L., Jordan, Abbeel ‘16)
return estimates and critic called generalized advantage estimation (GAE)
reinforcement learning (Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16)
batch
reinforcement learning with function approximation: actor-critic algorithms with value function approximation
Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
gradient with an off-policy critic: policy gradient with Q-function control variate
forget policies, let’s just do this!
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
High level idea:
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
how to do this?
0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5
just use the current estimate here
generate samples (i.e. run the policy) fit a model to estimate return improve the policy 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5
approximates the new value!
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
curse of dimensionality
need to know outcomes for different actions! Back to policy iteration… can fit this using samples
doesn’t require simulation of actions!
+ works even for off-policy samples (unlike actor-critic) + only one network, no high-variance policy gradient
forget policy, compute value directly can we do this with Q-values also, without knowing the transitions?
dataset of transitions Fitted Q-iteration
most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
“epsilon-greedy” final policy: why is this a bad idea for step 1? “Boltzmann exploration”
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
have a policy
iteration
Q-learning is not gradient descent! no gradient through target value
synchronized parallel Q-learning asynchronous parallel Q-learning
special case with K = 1, and one gradient step any policy will work! (with broad support) just load data from a buffer here dataset of transitions Fitted Q-iteration still use one gradient step
dataset of transitions (“replay buffer”)
Q-learning + samples are no longer correlated + multiple samples in the batch (low-variance gradient) but where does the data come from? need to periodically feed the replay buffer…
K = 1 is common, though larger K more efficient dataset of transitions (“replay buffer”)
Q-learning
Q-learning is not gradient descent! no gradient through target value
use replay buffer
This is still a problem!
perfectly well-defined, stable regression
targets don’t change in inner loop! supervised regression
Mnih et al. ‘13
just SGD
dataset of transitions (“replay buffer”) target parameters current parameters
dataset of transitions (“replay buffer”) target parameters current parameters
at the same speed
loop of process 1
What’s the problem with continuous actions?
this max this max
How do we perform the max?
particularly problematic (inner loop of training)
Option 1: optimization
Simple solution:
+ dead simple + efficiently parallelizable
but… do we care? How good does the target need to be anyway?
More accurate solution:
works OK, for up to about 40 dimensions
Option 2: use function class that is easy to optimize
Gu, Lillicrap, Sutskever, L., ICML 2016
NAF: Normalized Advantage Functions
+ no change to algorithm + just as efficient as Q-learning
Option 3: learn an approximate maximizer DDPG (Lillicrap et al., ICLR 2016) “deterministic” actor-critic (really approximate Q-learning)
Option 3: learn an approximate maximizer
Slide partly borrowed from J. Schulman
simple and no downsides
Adam optimizer can help too
Slide partly borrowed from J. Schulman
generate samples (i.e. run the policy) fit a model to estimate return improve the policy
actions
reinforcement learning from raw visual data,” Lange & Riedmiller ‘12
latent space learned with autoencoder
function approximation (but neural net for embedding)
through deep reinforcement learning,” Mnih et al. ‘13
convolutional networks
target network
with double Q-learning (and other tricks)
with deep reinforcement learning and …,” Gu*, Holly*, et al. ‘17
NAF (quadratic in actions)
target network
simulator step for efficiency
multiple robots
Reinforcement Learning for Robotic Manipulation,” Haarnoja, et al. ’18
and “Reinforcement Learning with Deep Energy-Based Policies” (Haarnoja et al. ‘18 & ‘17)
maximum entropy Q-learning (we’ll cover this if time permits)
network
sim step
provides for robust policies
image-based Q-learning method using autoencoders to construct embeddings
convolutional networks for playing Atari.
effective trick to improve performance of deep Q-learning.
learning with actor network for approximate maximization.
continuous Q-learning with action-quadratic value functions.
deep reinforcement learning: separates value and advantage estimation in Q-function.
assume this is unknown don’t even attempt to learn it
1. Games (e.g., Go) 2. Easily modeled systems (e.g., navigating a car) 3. Simulated environments (e.g., simulated robots, video games)
1. System identification – fit unknown parameters of a known model 2. Learning – fit a general-purpose model to observed transition data
Does knowing the dynamics make things easier? Often, yes!
simplest method: guess & check “random shooting method”
can we do better? typically use Gaussian distribution see also: CMA-ES (sort of like CEM with momentum)
using our knowledge of physics, and fit just a few parameters
expressive model classes
go right to get higher!
every N steps
every N steps
each individual plan needs to be
well here!
backprop backprop backprop easy for deterministic policies, but also possible for stochastic policy (more on this later)
model-based RL provides data supervised learning provides reward term Levine*, Finn*, et al. End-to-end training of deep visuomotor policies. 2016.
supervised learning model-based RL
policy search”)
every N steps
Nagabandi, Yang, Asmar, Pandya, Kahn, L., Fearing. arxiv 2018
Nagabandi, Kahn, Fearing, L. ICRA 2018 pure model-based (about 10 minutes real time) model-free training (about 10 days…)
need to not overfit here… …but still have high capacity over here
exceeds performance of model-free after 40k steps (about 10 minutes of real time)
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models Chua, Calandra, McAllister, L.
PETS: Probabilistic Ensembles with Trajectory Sampling
Ebert, Finn, Lee, Levine. 2017. Self-Supervised Visual Planning with Temporal Skip Connections. CoRL 2017.
Designated Pixel Goal Pixel
model-based deep RL (e.g. PETS, guided policy search) model-based “shallow” RL (e.g. PILCO) replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.) policy gradient methods (e.g. TRPO) fully online methods (e.g. A3C) gradient-free methods (e.g. NES, CMA, etc.) 100,000,000 steps (100,000 episodes) (~ 15 days real time)
Wang et al. ‘17 TRPO+GAE (Schulman et al. ‘16) half-cheetah (slightly different version)
10,000,000 steps (10,000 episodes) (~ 1.5 days real time)
half-cheetah Gu et al. ‘16
1,000,000 steps (1,000 episodes) (~3 hours real time)
Chebotar et al. ’17 (note log scale)
10x gap about 20 minutes of experience on a real robot 10x 10x 10x 10x 10x
Chua et a. ’18: Deep Reinforcement Learning in a Handful of Trials
30,000 steps (30 episodes) (~5 min real time)
are you learning in a simulator? DDPG, NAF, SQL, SAC, TD3 TRPO, PPO, A3C is simulation cost negligible compared to training cost? BUT: if you have a simulator, you can compute gradients through it – do you need model-free RL? how patient are you? model-based RL (GPS, PETS)
Lillicrap et al. “Continuous control…” Gu et al. “Continuous deep Q-learning…” Haarnoja et al. “Reinforcement learning with deep energy-based…” Haarnoja et al. “Soft actor-critic” Fujimoto et al. “Addressing function approximation error…”
estimators are typically not contractions, hence no guarantee
buffer size, clipping, sensitivity to learning rates, etc.
backpropagation through time
Henderson et al. ‘17, “Deep Reinforcement Learning that Matters.”
world
answer is “not very”
algorithm x number of runs to sweep
are less sensitive to hyperparameters?
viable tool for real-world problems
model-based deep RL (e.g. PETS, guided policy search) model-based “shallow” RL (e.g. PILCO) replay buffer value estimation methods (Q-learning, DDPG, NAF, SAC, etc.) policy gradient methods (e.g. TRPO) fully online methods (e.g. A3C) gradient-free methods (e.g. NES, CMA, etc.) 100,000,000 steps (100,000 episodes) (~ 15 days real time)
Wang et al. ‘17 TRPO+GAE (Schulman et al. ‘16) half-cheetah (slightly different version)
10,000,000 steps (10,000 episodes) (~ 1.5 days real time)
half-cheetah Gu et al. ‘16
1,000,000 steps (1,000 episodes) (~3 hours real time)
Chebotar et al. ’17 (note log scale)
10x gap about 20 minutes of experience on a real robot 10x 10x 10x 10x 10x
Chua et a. ’18: Deep Reinforcement Learning in a Handful of Trials
30,000 steps (30 episodes) (~5 min real time)
(many tasks, many situations, etc.)
homework to finish running
impractical
simulators
Pinto & Gupta, 2015 Levine et al. 2016
physics) is more task-agnostic
sense that we have “features” in computer vision?
learning (Sutton et al. ’99)
different tasks, you need to get those tasks somewhere!
demonstration (inverse reinforcement learning)
decision making
learn and represent complex input-output mappings
domains governed by simple, known rules
inputs, given enough experience
provided expert behavior
to learn more: see rail.eecs.berkeley.edu/deeprlcourse