CS 330
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned
1
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - - PowerPoint PPT Presentation
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Introduction Some background: - Not a native English speaker so please lease please lease let me know if you don t understand something Karol - I like robots
1
2
3
4
7
8
Slide adapted from Sergey Levine
10
Slide adapted from Sergey Levine
11
Images: Bojarski et al. ‘16, NVIDIA
training data supervised learning
Slide adapted from Sergey Levine
15
Images: Bojarski et al. ‘16, NVIDIA
training data supervised learning
Slide adapted from Sergey Levine
16
Slide adapted from Sergey Levine
18
Slide adapted from Sergey Levine
20
21
infinite horizon case finite horizon case
Slide adapted from Sergey Levine
22
data generating distributions, loss
𝑗 ≜ {𝑞𝑗(𝐲), 𝑞𝑗(𝐳|𝐲), ℒ𝑗}
𝑗 ≜ {𝒯𝑗, 𝑗, 𝑞𝑗(𝐭1), 𝑞𝑗(𝐭′|𝐭, 𝐛), 𝑠 𝑗(𝐭, 𝐛)}
a Markov decision process dynamics action space state space initial state distribution reward much more than the semantic meaning of task!
23
𝑗 ≜ {𝒯𝑗, 𝑗, 𝑞𝑗(𝐭1), 𝑞𝑗(𝐭′|𝐭, 𝐛), 𝑠 𝑗(𝐭, 𝐛)}
Multi-robot RL: Character animation: across maneuvers across garments & initial states
𝑗(𝐭, 𝐛) vary
𝑞𝑗(𝐭1), 𝑞𝑗(𝐭′|𝐭, 𝐛) vary
25
𝑗 ≜ {𝒯𝑗, 𝑗, 𝑞𝑗(𝐭1), 𝑞𝑗(𝐭′|𝐭, 𝐛), 𝑠 𝑗(𝐭, 𝐛)}
state space action space initial state distribution dynamics reward
An alternative view:
𝑗 ≜ {𝒯𝑗, 𝑗, 𝑞𝑗(𝐭1), 𝑞(𝐭′|𝐭, 𝐛), 𝑠(𝐭, 𝐛)}
A task identifier is part of the state: 𝐭 = (𝐭, 𝐴𝑗)
It can be cast as a standard Markov decision process!
{𝒰
𝑗} = ⋃𝒯𝑗, ⋃𝑗, 1
𝑂 ∑
𝑗
𝑞𝑗(𝐭1), 𝑞(𝐭′|𝐭, 𝐛), 𝑠(𝐭, 𝐛)
26
The same as before, except: a task identifier is part of the state: 𝐭 = (𝐭, 𝐴𝑗)
Multi-task RL
e.g. one-hot task ID language description desired goal state, 𝐴𝑗 = 𝐭
What is the reward?
If it's still a standard Markov decision process,
then, why not apply standard RL algorithms? You can! You can often do better. “goal-conditioned RL” The same as before Or, for goal-conditioned RL:
𝑠(𝐭) = 𝑠(𝐭, 𝐭) = −𝑒(𝐭, 𝐭)
Distance function 𝑒 examples:
27
28
29
30
Slide adapted from Sergey Levine
32
a convenient identity Slide adapted from Sergey Levine
34
Slide adapted from Sergey Levine
35
generate samples (i.e. run the policy) fit a model to estimate return improve the policy Slide adapted from Sergey Levine
36
training data supervised learning Slide adapted from Sergey Levine
38
Slide adapted from Sergey Levine
40
41
42
𝑢′=𝑢 𝑈
total reward starting from 𝐭 and following 𝜌
𝑢′=𝑢 𝑈
total reward starting from 𝐭, taking 𝐛, and then following 𝜌 "how good is a state” "how good is a state-action pair”
𝐛 𝑅𝜌(𝐭, 𝐛) New policy is at least as good as old policy.
43
𝑢′=𝑢 𝑈
𝑢′=𝑢 𝑈
𝐛′ 𝑅⋆(𝐭′, 𝐛′)
44
total reward starting from 𝐭 and following 𝜌 total reward starting from 𝐭, taking 𝐛, and then following 𝜌 "how good is a state” "how good is a state-action pair”
45
Reward = 1 if I can play it in a month, 0 otherwise a3 a1 a2 st
𝐛 𝑅𝜌(𝐭, 𝐛)
Slide adapted from Sergey Levine
Algorithm hyperparameters This is not a gradient descent algorithm! Result: get a policy 𝜌(𝐛|𝐭) from arg𝑛𝑏𝑦
𝐛 𝑅𝜚(𝐭, 𝐛)
Important notes:
We can reuse data from previous policies! using replay buffers an off-policy algorithm Can be readily extended to multi-task/goal-conditioned RL
46
47
Continuous action space? Simple optimization algorithm -> Cross Entropy Method (CEM)
In-memory buffers
Bellman updaters stored data from all past experiments
CEM optimization
QT-Opt: Kalashnikov et al. ‘18, Google Brain
Slide adapted from D. Kalashnikov
State: over the shoulder RGB camera image, no depth Actio ion: 4DOF pose change in Cartesian space + gripper control Reward: binary reward at the end, if the object was lifted. Sparse. No shaping Automatic success detection:
Slide adapted from D. Kalashnikov
50
7 robots collected 580k grasps Unseen test objects 96 96% test success ra rate!
𝐛′ 𝑅⋆(𝐭′, 𝐛′)
51
52
Policy: 𝜌𝜄(𝐛|𝐭) —> 𝜌𝜄(𝐛|𝐭, 𝐴𝑗)
Q-function: 𝑅𝜚(𝐭, 𝐛) —> 𝑅𝜚(𝐭, 𝐛, 𝐴𝑗)
What is different about reinforcement learning? The data distribution is controlled by the agent! Should we share data in addition to sharing weights?
53
Why mention it now?
Task 1: passing Task 2: shooting goals What if you accidentally perform a good pass when trying to shoot a goal? Store experience as normal. *and* Relabel experience with task 2 ID & reward and store. “hindsight relabeling” "hindsight experience replay” (HER)
54
1:𝑈)} using some policy
𝑙
′ = {(𝐭1:𝑈, 𝐛1:𝑈, 𝐭𝑈, 𝑠 1:𝑈 ′ } where 𝑠𝑢 ′ = −𝑒(𝐭𝑢, 𝐭𝑈)
′
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
use any state from the trajectory
𝑙 + +
Result: exploration challenges alleviated
55
Why mention it now? Task 2: open a drawer Task 1: close a drawer Can we use episodes from drawer opening task for drawer closing task? How does that answer change for Q-learning vs Policy Gradient?
When can we apply relabeling?
1:𝑈)} using some policy
𝑘:
𝑙
′ = {(𝐭1:𝑈, 𝐛1:𝑈, 𝐴𝑘, 𝑠 1:𝑈 ′ } where 𝑠 𝑢 ′ = 𝑠 𝑘(𝐭𝑢)
′
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
𝑘 to choose?
𝑙 + +
trajectory gets high reward
57
Eysenbach et al. Rewriting History with Inverse RL Li et al. Generalized Hindsight for RL
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
58
′ = −𝑒(𝐭𝑢, 𝐭𝑈)
59
𝑙
′ = {(𝐭1:𝑈, 𝐛1:𝑈, 𝐭𝑈, 𝑠 1:𝑈 ′ } where 𝑠𝑢 ′ = −𝑒(𝐭𝑢, 𝐭𝑈)
′
60
61
Lynch, Khansari, Xiao, Kumar, Tompson, Levine, Sermanet. Learning Latent Plans from Play. ‘19
65
69
72