Hierarchical RL and Skill Discovery CS 330 1 The Plan - - PowerPoint PPT Presentation

hierarchical rl and skill discovery
SMART_READER_LITE
LIVE PREVIEW

Hierarchical RL and Skill Discovery CS 330 1 The Plan - - PowerPoint PPT Presentation

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL 2 Why Skill Discovery? What if we want to discover interesting behaviors? [The construction of


slide-1
SLIDE 1

CS 330

Hierarchical RL and Skill Discovery

1

slide-2
SLIDE 2

The Plan

2

Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL

slide-3
SLIDE 3

Why Skill Discovery?

What if we want to discover interesting behaviors?

3

[The construction of movement by the spinal cord, Tresch et al., 1999] [Postural hand synergies for tool use, Santello, et al., 1998]

slide-4
SLIDE 4

Why Skill Discovery? More practical version

Coming up with tasks is tricky…

4

Write down task ideas for a tabletop manipulation scenario

[Meta-World, Yu, Quillen, He, et al., 2019]

slide-5
SLIDE 5

Why Hierarchical RL?

Performing tasks at various levels of abstractions

5

Bake a cheesecake

Buy ingredients

Go to the store

Walk to the door

Take a step

Contract muscle X

Exploration

slide-6
SLIDE 6

The Plan

6

In Info formati mation-theor theoretic etic concepts pts Skill discovery Using discovered skills Hierarchical RL

slide-7
SLIDE 7

Entropy

7

Slide adapted from Sergey Levine

slide-8
SLIDE 8

KL-divergence

8

Distance between two distributions

slide-9
SLIDE 9

Mutual information

9

Slide adapted from Sergey Levine

High MI? x- it rains tomorrow, y – streets are wet tomorrow x- it rains tomorrow, y – we find life on Mars tomorrow

slide-10
SLIDE 10

Mutual information

1

Slide adapted from Sergey Levine

slide-11
SLIDE 11

The Plan

11

Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL

slide-12
SLIDE 12

Soft Q-learning

Objective: Value-, Q-functions, and the policy Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦

𝐛 𝑅𝜚(𝐭, 𝐛)

Soft Q-learning 𝜌 𝐛 𝐭 = arg𝑛𝑏𝑦

𝐛 𝑅𝜚(𝐭, 𝐛)

12

slide-13
SLIDE 13

Soft Q-learning

Haarnoja et al. RL with Deep Energy-Based Policies, 2017

Robustness Exploration Fine-tunability

13

slide-14
SLIDE 14

Learning diverse skills

task index

Why can’t we just use MaxEnt RL

  • 1. action
  • n entropy is not the same as state

e entropy agent can take very different actions, but land in similar states

  • 2. MaxEnt policies are stochastic, but not always controllable

intuitively, we want low diversity for a fixed z, high diversity across z’s

Intuition: different skills should visit different state-space regions

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Slide adapted from Sergey Levine

15

slide-15
SLIDE 15

Diversity-promoting reward function

Policy(Agent) Discriminator(D)

Skill (z) Environment Action State Predict Skill

Eysenbach, Gupta, Ibarz, Levine. Diver ersit sity y is All You Need. Slide adapted from Sergey Levine

16

slide-16
SLIDE 16

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Cheetah Ant

Examples of learned tasks

Mountain car

17

slide-17
SLIDE 17

A connection to mutual information

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. See also: Gregor et al. Variational Intrinsic Control. 2016 Slide adapted from Sergey Levine

18

slide-18
SLIDE 18

The Plan

19

Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL

slide-19
SLIDE 19

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

How to use learned skills?

How can we use the learned skills to accomplish a task? Learn a policy that operates on z’s

20

slide-20
SLIDE 20

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Results: hierarchical RL

Can we do better?

21

slide-21
SLIDE 21

What’s the problem?

It’s not very easy to use the learned skills Skills might not be particularly useful

22

What makes a useful skill?

slide-22
SLIDE 22

Consequences are hard to predict Consequences are easy to predict

What’s the problem?

23

slide-23
SLIDE 23

Slightly different mutual information

Future hard to predict for different skills Predictable future for a given skill

Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

24

slide-24
SLIDE 24

Skill-dynamics model

We are learning a skill-dynamics model Skills are optimized specifically to make skill-dynamics easier to model compared to conventional global dynamics

Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

25

slide-25
SLIDE 25

z2 z3 (s1, a1) … (sT, aT) (s1, a1) … (sT, aT) (s1, a1) … (sT, aT) Update q𝜚(s’ | s, z)

  • Compute rz(s, a, s’)

p(z) Update 𝞀(a | s, z)

repeat

(s1, a1, r1) … (sT, aT, rT) (s1, a1, r1) … (sT, aT, rT) (s1, a1, r1) … (sT, aT, rT)

DADS algorithm

Sharma, Gu, Levine, Kumar, Hausman, DADS, 2019.

26

slide-26
SLIDE 26

DADS results

DIAYN DADS

27

slide-27
SLIDE 27

skill-dynamics q𝝔, policy 𝝆 compute or estimate cumulative reward update planner iterate

p1: (z1, z2… zH) p2: (z1, z2… zH) p3: (z1, z2… zH) p1: (s0, a0, … sH, aH) p2: (s0, a0, … sH, aH) p3: (s0, a0, … sH, aH)

p1, ȓ1 p2, ȓ2 p3, ȓ3

Using learned skills

Use skill-dynamics for model-based planning Plan for skills not actions Tasks can be learned zero-shot

28

slide-28
SLIDE 28

Summary

  • Two skill discovery algorithms that use mutual information
  • Predictability can be used as a proxy for “usefulness”
  • Method that optimizes for both, predictability and diversity
  • Model-based planning in the skill space
  • Opens new avenues such as unsupervised meta-RL
  • Gupta et al. Unsupervised Meta-Learning for RL, 2018

29

slide-29
SLIDE 29

The Plan

33

Information-theoretic concepts Skill discovery Using discovered skills Hierar erarchic hical al RL RL

slide-30
SLIDE 30

Why Hierarchical RL?

Performing tasks at various levels of abstractions

34

Bake a cheesecake

Buy ingredients

Go to the store

Walk to the door

T ake a step

Contract muscle X

Exploration

slide-31
SLIDE 31

Hierarchical RL – design choices

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

35

slide-32
SLIDE 32

Learning Locomotor Controllers

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

Heess, Wayne, Tassa, Lillicrap, Riedmiller, Silver, Learning Locomotor Controllers, 2016. High-level controller Low-level controller Proprioceptive information Task-specific information Command updated every K steps

  • HL and LL trained separately
  • Trained with policy gradients
  • Hierarchical noise

36

slide-33
SLIDE 33

Option Critic

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

Bacon, Harb, Precup, The Option-Critic Architecture, 2016.

  • Option is a self-terminating mini-

policy

  • Everything trained together with

policy gradient

37

slide-34
SLIDE 34

Relay Policy Learning

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

38

slide-35
SLIDE 35

Relay Policy Learning

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

Gupta, Kumar, Lynch, Levine, Hausman, Relay Policy Learning, 2019.

  • Goal-conditioned policies with relabeling
  • Demonstrations to pre-train everything
  • On-policy

39

slide-36
SLIDE 36

HIRO

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

Nachum, Gu, Lee, Levine HIRO, 2018.

  • Goal-conditioned policies with relabeling
  • Off-policy training through off-policy

corrections Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy

40

slide-37
SLIDE 37

HRL Summary

𝑓𝑜𝑤 𝜌𝑚 𝑓𝑜𝑤 𝑏0 𝑨1 𝑡0 𝑓𝑜𝑤 𝜌𝑚 𝑏1 𝑡1 𝑓𝑜𝑤 𝜌𝑚 𝑏2 𝑡2 𝜌ℎ 𝑓𝑜𝑤 𝜌𝑚 𝑏3 𝑨2 𝑡3 𝑓𝑜𝑤 𝜌𝑚 𝑏4 𝑡4 𝑓𝑜𝑤 𝜌𝑚 𝑏5 𝑡5 𝜌ℎ v𝑡 𝑡𝑕1 v𝑡 𝑡𝑕2 𝑓𝑜𝑤 𝜌𝑚 𝑏6 𝑡6

Design choices:

  • goal-conditioned vs not
  • pre-trained vs e2e
  • self-terminating vs fixed rate
  • n-policy vs off-policy
  • Multiple design choices and frameworks
  • Helps with exploration and temporally

extended tasks

  • Can be difficult to get it to work
  • Seems like a natural direction for harder RL

problems

Nachum, Lang, Lu, Gu, Lee, Levine, Why Does Hierarchy (Sometimes) Work? 2019.

41

slide-38
SLIDE 38

The Plan

4 2

Information-theoretic concepts Skill discovery Using discovered skills Hierarchical RL

slide-39
SLIDE 39

43

Next Week

Lifelong learning – Nov 4 Can the agent learn n continuou tinuousl sly y

  • ver their life-time?