 
              Hierarchical RL and Skill Discovery CS 330 1
The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2
Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of [Postural hand synergies for tool movement by the spinal cord, use, Santello, et al. , 1998] Tresch et al. , 1999] 3
Why Skill Discovery? More practical version Coming up with tasks is tricky… Write down task ideas for a tabletop manipulation scenario [Meta-World, Yu, Quillen, He, et al. , 2019] 4
Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door Take a step Contract muscle X 5
The Plan In Info formati mation-theor theoretic etic concepts pts Skill discovery Using discovered skills Hierarchical RL 6
Entropy Slide adapted from Sergey Levine 7
KL-divergence Distance between two distributions 8
Mutual information High MI? x- it rains tomorrow, y – streets are wet tomorrow x- it rains tomorrow, y – we find life on Mars tomorrow Slide adapted from Sergey Levine 9
Mutual information Slide adapted from Sergey Levine 1 0
The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 11
Soft Q-learning Objective: Value-, Q-functions, and the policy Q-learning Soft Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 12
Soft Q-learning Exploration Fine-tunability Robustness Haarnoja et al. RL with Deep Energy-Based Policies, 2017 13
Learning diverse skills task index Why can’t we just use MaxEnt RL 1. action on entropy is not the same as state e entropy agent can take very different actions, but land in similar states 2. MaxEnt policies are stochastic, but not always controllable intuitively, we want low diversity for a fixed z , high diversity across z’s Intuition: different skills should visit different state-space regions Slide adapted from Sergey Levine 15 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
Diversity-promoting reward function Environment Discriminator(D) Action State Policy(Agent) Skill (z) Predict Skill Slide adapted from Sergey Levine 16 Eysenbach, Gupta, Ibarz, Levine. Diver ersit sity y is All You Need.
Examples of learned tasks Cheetah Ant Mountain car 17 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
A connection to mutual information Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Slide adapted from Sergey Levine See also: Gregor et al. Variational Intrinsic Control. 2016 18
The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 19
How to use learned skills? How can we use the learned skills to accomplish a task? Learn a policy that operates on z’s 20 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
Results: hierarchical RL Can we do better? 21 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
What’s the problem? Skills might not be particularly useful It’s not very easy to use the learned skills What makes a useful skill? 22
What’s the problem? Consequences Consequences are hard to are easy to predict predict 23
Slightly different mutual information Future hard to Predictable predict for future for a different skills given skill 24 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
Skill-dynamics model We are learning a skill-dynamics model compared to conventional global dynamics Skills are optimized specifically to make skill-dynamics easier to model 25 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
DADS algorithm (s 1 , a 1 ) … (s T , a T ) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) Update 𝞀 (a | s, z) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) Update q 𝜚 (s’ | s, z) z 2 ---------- Compute r z (s, a, s’) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) z 3 p(z) repeat 26 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.
DADS results DIAYN DADS 27
Using learned skills Use skill-dynamics for model-based planning Plan for skills not actions Tasks can be learned zero-shot iterate p 1 : (z 1 , z 2 … z H ) p 1 : (s 0 , a 0 , … s H , a H ) p 1 , ȓ 1 update planner p 2 , ȓ 2 p 2 : (z 1 , z 2 … z H ) compute or p 2 : (s 0 , a 0 , … s H , a H ) skill-dynamics q 𝝔 , estimate policy 𝝆 cumulative p 3 , ȓ 3 p 3 : (s 0 , a 0 , … s H , a H ) reward p 3 : (z 1 , z 2 … z H ) 28
Summary - Two skill discovery algorithms that use mutual information - Predictability can be used as a proxy for “usefulness” - Method that optimizes for both, predictability and diversity - Model-based planning in the skill space - Opens new avenues such as unsupervised meta-RL - Gupta et al. Unsupervised Meta-Learning for RL , 2018 29
The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierar erarchic hical al RL RL 33
Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door T ake a step Contract muscle X 34
Hierarchical RL – design choices Design choices: - goal-conditioned vs not - pre-trained vs e2e 𝜌 ℎ 𝜌 ℎ - self-terminating vs fixed rate - on-policy vs off-policy 𝑨 1 𝑨 2 v𝑡 𝑡 1 v𝑡 𝑡 2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 1 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 35
Learning Locomotor Controllers Command updated High-level every K steps controller 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 1 𝑨 2 v𝑡 𝑡 2 Low-level 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 controller 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - HL and LL trained separately - Trained with policy gradients - goal-conditioned vs not - Hierarchical noise - pre-trained vs e2e - self-terminating vs fixed rate Task-specific - on-policy vs off-policy information Proprioceptive information 36 Heess, Wayne, Tassa, Lillicrap, Riedmiller, Silver, Learning Locomotor Controllers, 2016.
Option Critic 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 1 𝑨 2 v𝑡 𝑡 2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - Option is a self-terminating mini- policy - goal-conditioned vs not - Everything trained together with - pre-trained vs e2e policy gradient - self-terminating vs fixed rate - on-policy vs off-policy 37 Bacon, Harb, Precup, The Option-Critic Architecture, 2016.
Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 1 𝑨 2 v𝑡 𝑡 2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - self-terminating vs fixed rate - on-policy vs off-policy 38 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.
Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 1 𝑨 2 v𝑡 𝑡 2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - Goal-conditioned policies with relabeling - self-terminating vs fixed rate - Demonstrations to pre-train everything - on-policy vs off-policy - On-policy 39 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.
HIRO 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 1 𝑨 2 v𝑡 𝑡 2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: Design choices: - - goal-conditioned vs not goal-conditioned vs not - - pre-trained vs e2e pre-trained vs e2e - - self-terminating vs fixed rate self-terminating vs fixed rate - Goal-conditioned policies with relabeling - - on-policy vs off-policy on-policy vs off-policy - Off-policy training through off-policy corrections 40 Nachum, Gu, Lee, Levine HIRO, 2018.
Recommend
More recommend