CS 330
Hierarchical RL and Skill Discovery
1
Hierarchical RL and Skill Discovery CS 330 1 The Plan - - PowerPoint PPT Presentation
Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of
1
2
Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL
What if we want to discover interesting behaviors?
3
[The construction of movement by the spinal cord, Tresch et al., 1999] [Postural hand synergies for tool use, Santello, et al., 1998]
Coming up with tasks is tricky…
4
Write down task ideas for a tabletop manipulation scenario
[Meta-World, Yu, Quillen, He, et al., 2019]
Performing tasks at various levels of abstractions
5
Buy ingredients
Go to the store
Walk to the door
Take a step
Contract muscle X
Exploration
6
In Info formati mation-theor theoretic etic concepts pts Skill discovery Using discovered skills Hierarchical RL
7
Slide adapted from Sergey Levine
8
Distance between two distributions
9
Slide adapted from Sergey Levine
High MI? x- it rains tomorrow, y – streets are wet tomorrow x- it rains tomorrow, y – we find life on Mars tomorrow
1
Slide adapted from Sergey Levine
11
Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL
Objective: Value-, Q-functions, and the policy Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦
𝐛 𝑅𝜚(𝐭, 𝐛)
Soft Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦
𝐛 𝑅𝜚(𝐭, 𝐛)
12
Haarnoja et al. RL with Deep Energy-Based Policies, 2017
Robustness Exploration Fine-tunability
13
task index
Why can’t we just use MaxEnt RL
e entropy agent can take very different actions, but land in similar states
intuitively, we want low diversity for a fixed z, high diversity across z’s
Intuition: different skills should visit different state-space regions
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Slide adapted from Sergey Levine
15
Policy(Agent) Discriminator(D)
Skill (z) Environment Action State Predict Skill
Eysenbach, Gupta, Ibarz, Levine. Diver ersit sity y is All You Need. Slide adapted from Sergey Levine
16
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
Cheetah Ant
Mountain car
17
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. See also: Gregor et al. Variational Intrinsic Control. 2016 Slide adapted from Sergey Levine
18
19
Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
How can we use the learned skills to accomplish a task? Learn a policy that operates on z’s
20
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
21
It’s not very easy to use the learned skills Skills might not be particularly useful
22
What makes a useful skill?
Consequences are hard to predict Consequences are easy to predict
23
Future hard to predict for different skills Predictable future for a given skill
Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
24
We are learning a skill-dynamics model Skills are optimized specifically to make skill-dynamics easier to model compared to conventional global dynamics
Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
25
z2 z3 (s1, a1) … (sT, aT) (s1, a1) … (sT, aT) (s1, a1) … (sT, aT) Update q𝜚(s’ | s, z)
p(z) Update 𝞀(a | s, z)
repeat
(s1, a1, r1) … (sT, aT, rT) (s1, a1, r1) … (sT, aT, rT) (s1, a1, r1) … (sT, aT, rT)
Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
26
DIAYN DADS
27
skill-dynamics q𝝔, policy 𝝆 compute or estimate cumulative reward update planner iterate
p1: (z1, z2… zH) p2: (z1, z2… zH) p3: (z1, z2… zH) p1: (s0, a0, … sH, aH) p2: (s0, a0, … sH, aH) p3: (s0, a0, … sH, aH)
p1, ȓ1 p2, ȓ2 p3, ȓ3
Use skill-dynamics for model-based planning Plan for skills not actions Tasks can be learned zero-shot
28
29
33
Information-theoretic concepts Skill discovery Using discovered skills Hierar erarchic hical al RL RL
Performing tasks at various levels of abstractions
34
Buy ingredients
Go to the store
Walk to the door
T ake a step
Contract muscle X
Exploration
Design choices:
𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡1 v𝑡 𝑡2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6
35
Design choices:
Heess, Wayne, Tassa, Lillicrap, Riedmiller, Silver, Learning Locomotor Controllers, 2016. High-level controller Low-level controller Proprioceptive information Task-specific information Command updated every K steps
36
Design choices:
Bacon, Harb, Precup, The Option-Critic Architecture, 2016.
policy
policy gradient
37
Design choices:
Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.
38
Design choices:
Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.
39
Design choices:
Nachum, Gu, Lee, Levine HIRO, 2018.
corrections Design choices:
40
Design choices:
extended tasks
problems
Nachum, Lang, Lu, Gu, Lee, Levine, Why Does Hierarchy (Sometimes) Work? 2019.
41
4 2
Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL
43
Lifelong learning – Nov 4 Can the agent learn n continuou tinuousl sly y