CS 330
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned
1
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS - - PowerPoint PPT Presentation
Reinforcement Learning: A Primer, Multi-Task, Goal-Conditioned CS 330 1 Logistics Homework 2 due Wednesday. Homework 3 out on Wednesday. Project proposal due next Wedesday. 2 Why Reinforcement Learning? When do you not need sequential
1
2
3
4
5
Slide adapted from Sergey Levine
6
Images: Bojarski et al. ‘16, NVIDIA
training data supervised learning
Slide adapted from Sergey Levine
7
Slide adapted from Sergey Levine
8
infinite horizon case finite horizon case
Slide adapted from Sergey Levine
9
data genera4ng distribu4ons, loss
a Markov decision process dynamics ac4on space state space ini4al state distribu4on reward much more than the seman4c meaning of task!
10
Personalized recommendations: vary across tasks
Multi-robot RL: Character animation: across maneuvers across garments & initial states vary
vary
vary
11
state space ac4on space ini4al state distribu4on dynamics reward
It can be cast as a standard Markov decision process!
{𝒰i} = {⋃𝒯i, ⋃i, 1 N ∑
i
pi(s1), p(s′|s, a), r(s, a)}
12
The same as before, except: a task identifier is part of the state: s = (¯
s, zi)
e.g. one-hot task ID language description desired goal state, zi = sg
If it's still a standard Markov decision process,
then, why not apply standard RL algorithms? You can! You can often do better. “goal-conditioned RL” The same as before Or, for goal-conditioned RL:
r(s) = r(¯ s, sg) = − d(¯ s, sg)
Distance function examples:
d ℓ2
13
14
generate samples (i.e. run the policy) fit a model to es7mate return improve the policy
15
Slide adapted from Sergey Levine
16
a convenient identity Slide adapted from Sergey Levine
17
Slide adapted from Sergey Levine
18
generate samples (i.e. run the policy) fit a model to estimate return improve the policy Slide adapted from Sergey Levine
19
training data supervised learning Slide adapted from Sergey Levine
20
Slide adapted from Sergey Levine
21
Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17
22
23
Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML ‘17
24
Mishra, Rohaninejad, Chen, Abbeel. A Simple Neural Attentive Meta-Learner ICLR ‘18
25
26
T
t′=t
total reward star4ng from and following
s π
T
t′=t
total reward star4ng from , taking , and then following
s a π
"how good is a state” "how good is a state-ac4on pair”
¯ a
27
T
t′=t
total reward star4ng from and following
s π
T
t′=t
total reward star4ng from , taking , and then following
s a π
"how good is a state” "how good is a state-ac4on pair”
a′ Q⋆(s′, a′)]
28
Slide adapted from Sergey Levine
Algorithm hyperparameters This is not a gradient descent algorithm! Result: get a policy from
π(a|s) arg max
a
Qϕ(s, a)
We can reuse data from previous policies! using replay buffers an off-policy algorithm Can be readily extended to mulO-task/goal-condiOoned RL
29
30
Policy: —>
πθ(a|¯ s) πθ(a|¯ s, zi)
Q-funcOon: —>
Qϕ(¯ s, a) Qϕ(¯ s, a, zi)
31
32
using some policy k = {(s1:T, a1:T, sg, r1:T)}
← ∪ k
using last state as goal: where k ′
k = {(s1:T, a1:T, sT, r′ 1:T}
r′
t = − d(st, sT)
← ∪ ′
k
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
use any state from the trajectory
k++
33
using some policy k = {(s1:T, a1:T, zi, r1:T)}
← ∪ k
for task : where k 𝒰j ′
k = {(s1:T, a1:T, zj, r′ 1:T}
r′
t = rj(st)
← ∪ ′
k
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
k++
trajectory gets high reward
34
Andrychowicz et al. Hindsight Experience Replay. NeurIPS ‘17
35
t = − d(st, sT)
36
using some policy k = {(s1:T, a1:T)}
using last state as goal: where k ′
k = {(s1:T, a1:T, sT, r′ 1:T}
r′
t = − d(st, sT)
← ∪ ′
k
37
38
Lynch, Khansari, Xiao, Kumar, Tompson, Levine, Sermanet. Learning Latent Plans from Play. ‘19
Srinivas, Jabri, Abbeel, Levine Finn. Universal Planning Networks. ICML ’18
39
Yu, Shevchuk, Sadigh, Finn. Unsupervised Visuomotor Control through Distributional Planning Networks. RSS ’19
Yu, Shevchuk, Sadigh, Finn. Unsupervised Visuomotor Control through Distributional Planning Networks. RSS ’19
40
Yu, Shevchuk, Sadigh, Finn. Unsupervised Visuomotor Control through Distributional Planning Networks. RSS ’19
Yu, Shevchuk, Sadigh, Finn. Unsupervised Visuomotor Control through Distributional Planning Networks. RSS ’19
41
42
43
44