Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1

The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2

Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of [Postural hand synergies for tool movement by the spinal cord, use, Santello, et al. , 1998] Tresch et al. , 1999] 3

Why Skill Discovery? More practical version Coming up with tasks is tricky… Write down task ideas for a tabletop manipulation scenario [Meta-World, Yu, Quillen, He, et al. , 2019] 4

Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door Take a step Contract muscle X 5

The Plan In Info formati mation-theor theoretic etic concepts pts Skill discovery Using discovered skills Hierarchical RL 6

Entropy Slide adapted from Sergey Levine 7

KL-divergence Distance between two distributions 8

Mutual information High MI? x- it rains tomorrow, y – streets are wet tomorrow x- it rains tomorrow, y – we find life on Mars tomorrow Slide adapted from Sergey Levine 9

Mutual information Slide adapted from Sergey Levine 1 0

Soft Q-learning Objective: Value-, Q-functions, and the policy Q-learning Soft Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦 𝐛 𝑅 𝜚 (𝐭, 𝐛) 12

Soft Q-learning Exploration Fine-tunability Robustness Haarnoja et al. RL with Deep Energy-Based Policies, 2017 13

Learning diverse skills task index Why can’t we just use MaxEnt RL 1. action on entropy is not the same as state e entropy agent can take very different actions, but land in similar states 2. MaxEnt policies are stochastic, but not always controllable intuitively, we want low diversity for a fixed z , high diversity across z’s Intuition: different skills should visit different state-space regions Slide adapted from Sergey Levine 15 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Diversity-promoting reward function Environment Discriminator(D) Action State Policy(Agent) Skill (z) Predict Skill Slide adapted from Sergey Levine 16 Eysenbach, Gupta, Ibarz, Levine. Diver ersit sity y is All You Need.

Examples of learned tasks Cheetah Ant Mountain car 17 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

A connection to mutual information Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Slide adapted from Sergey Levine See also: Gregor et al. Variational Intrinsic Control. 2016 18

How to use learned skills? How can we use the learned skills to accomplish a task? Learn a policy that operates on z’s 20 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Results: hierarchical RL Can we do better? 21 Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

What’s the problem? Skills might not be particularly useful It’s not very easy to use the learned skills What makes a useful skill? 22

What’s the problem? Consequences Consequences are hard to are easy to predict predict 23

Slightly different mutual information Future hard to Predictable predict for future for a different skills given skill 24 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

Skill-dynamics model We are learning a skill-dynamics model compared to conventional global dynamics Skills are optimized specifically to make skill-dynamics easier to model 25 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

DADS algorithm (s 1 , a 1 ) … (s T , a T ) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) Update 𝞀 (a | s, z) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) Update q 𝜚 (s’ | s, z) z 2 ---------- Compute r z (s, a, s’) (s 1 , a 1 , r 1 ) … (s T , a T , r T ) (s 1 , a 1 ) … (s T , a T ) z 3 p(z) repeat 26 Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

DADS results DIAYN DADS 27

Using learned skills Use skill-dynamics for model-based planning Plan for skills not actions Tasks can be learned zero-shot iterate p 1 : (z 1 , z 2 … z H ) p 1 : (s 0 , a 0 , … s H , a H ) p 1 , ȓ 1 update planner p 2 , ȓ 2 p 2 : (z 1 , z 2 … z H ) compute or p 2 : (s 0 , a 0 , … s H , a H ) skill-dynamics q 𝝔 , estimate policy 𝝆 cumulative p 3 , ȓ 3 p 3 : (s 0 , a 0 , … s H , a H ) reward p 3 : (z 1 , z 2 … z H ) 28

Summary - Two skill discovery algorithms that use mutual information - Predictability can be used as a proxy for “usefulness” - Method that optimizes for both, predictability and diversity - Model-based planning in the skill space - Opens new avenues such as unsupervised meta-RL - Gupta et al. Unsupervised Meta-Learning for RL , 2018 29

The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierar erarchic hical al RL RL 33

Why Hierarchical RL? Performing tasks at various levels of abstractions Exploration Bake a cheesecake Buy ingredients Go to the store Walk to the door T ake a step Contract muscle X 34

Hierarchical RL – design choices Design choices: - goal-conditioned vs not - pre-trained vs e2e 𝜌 ℎ 𝜌 ℎ - self-terminating vs fixed rate - on-policy vs off-policy 𝑨 1 𝑨 2 v𝑡 𝑡 𝑕1 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 1 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 35

Learning Locomotor Controllers Command updated High-level every K steps controller 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 Low-level 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 controller 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - HL and LL trained separately - Trained with policy gradients - goal-conditioned vs not - Hierarchical noise - pre-trained vs e2e - self-terminating vs fixed rate Task-specific - on-policy vs off-policy information Proprioceptive information 36 Heess, Wayne, Tassa, Lillicrap, Riedmiller, Silver, Learning Locomotor Controllers, 2016.

Option Critic 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - Option is a self-terminating mini- policy - goal-conditioned vs not - Everything trained together with - pre-trained vs e2e policy gradient - self-terminating vs fixed rate - on-policy vs off-policy 37 Bacon, Harb, Precup, The Option-Critic Architecture, 2016.

Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - self-terminating vs fixed rate - on-policy vs off-policy 38 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

Relay Policy Learning 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: - goal-conditioned vs not - pre-trained vs e2e - Goal-conditioned policies with relabeling - self-terminating vs fixed rate - Demonstrations to pre-train everything - on-policy vs off-policy - On-policy 39 Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

HIRO 𝜌 ℎ 𝜌 ℎ 𝑨 1 v𝑡 𝑡 𝑕1 𝑨 2 v𝑡 𝑡 𝑕2 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝜌 𝑚 𝑡 3 𝑡 0 𝑏 0 𝑡 1 𝑏 1 𝑡 2 𝑏 3 𝑡 4 𝑏 4 𝑡 5 𝑡 6 𝑏 2 𝑏 5 𝑏 6 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 𝑓𝑜𝑤 Design choices: Design choices: - - goal-conditioned vs not goal-conditioned vs not - - pre-trained vs e2e pre-trained vs e2e - - self-terminating vs fixed rate self-terminating vs fixed rate - Goal-conditioned policies with relabeling - - on-policy vs off-policy on-policy vs off-policy - Off-policy training through off-policy corrections 40 Nachum, Gu, Lee, Levine HIRO, 2018.

Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

Facilitating Skill & Employment to Migrant Labours Skill & Employment to Migrant

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Rising Skill Premium? The Roles of Capital-Skill Complementarity and Sectoral Shifts in a

Skill Demand 2 Skill Demand Diversified Today and Tomorrow Business Enterprise Workforce

and Skill Mismatches Peter Cappelli The Wharton School Whos responsible for job skills?

Development of Intelligent Tutoring System Framework: Using Guided Discovery Learning Raja

Cognitive Science Overview Ausubels Meaningful Reception Learning Theory Schema Theory

DSG Recovery Plans Kay Goodacre, Lewis Goodger & Mark Wilson National Fair Funding Conference

t-spanners for Transmission Graphs Using the Path-Greedy Algorithm Stav Ashur and Paz Carmi

A Disruptive Technology Developing the Worlds First Fully Reusable Spacecraft Lee Valentine,

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 The Course Web Page

Powerful tools for learning: Powerful tools for learning: Kernels and Similarity Kernels and

Sambuz

Useful Links

Newsletter

Mail Us

Hierarchical RL and Skill Discovery CS 330 1 The Plan - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

Facilitating Skill &amp; Employment to Migrant Labours Skill &amp; Employment to Migrant

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Rising Skill Premium? The Roles of Capital-Skill Complementarity and Sectoral Shifts in a

Skill Demand 2 Skill Demand Diversified Today and Tomorrow Business Enterprise Workforce

and Skill Mismatches Peter Cappelli The Wharton School Whos responsible for job skills?

Development of Intelligent Tutoring System Framework: Using Guided Discovery Learning Raja

Cognitive Science Overview Ausubels Meaningful Reception Learning Theory Schema Theory

DSG Recovery Plans Kay Goodacre, Lewis Goodger &amp; Mark Wilson National Fair Funding Conference

t-spanners for Transmission Graphs Using the Path-Greedy Algorithm Stav Ashur and Paz Carmi

A Disruptive Technology Developing the Worlds First Fully Reusable Spacecraft Lee Valentine,

Distributed Learning over Unreliable Networks Chen Yu , Hanlin Tang, Cedric Renggli, Simon

Distributed Learning Amir H. Payberah payberah@kth.se 10/12/2019 The Course Web Page

Powerful tools for learning: Powerful tools for learning: Kernels and Similarity Kernels and

Sambuz

Useful Links

Newsletter

Mail Us

Facilitating Skill & Employment to Migrant Labours Skill & Employment to Migrant

DSG Recovery Plans Kay Goodacre, Lewis Goodger & Mark Wilson National Fair Funding Conference