Skill discovery from unstructured demonstrations Skill discovery - - PowerPoint PPT Presentation

skill discovery from unstructured demonstrations skill
SMART_READER_LITE
LIVE PREVIEW

Skill discovery from unstructured demonstrations Skill discovery - - PowerPoint PPT Presentation

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za Initial objective Initial objective We


slide-1
SLIDE 1

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za

slide-2
SLIDE 2

Initial objective Initial objective

  • We want agents that can feasibly learn to do

We want agents that can feasibly learn to do things autonomously things autonomously

  • Minimize the burden on an expert

Minimize the burden on an expert

– Specify what, not how

Specify what, not how

slide-3
SLIDE 3

Reinforcement Learning Reinforcement Learning

  • Reinforcement Learning

Reinforcement Learning

– Learn behaviour from experience

Learn behaviour from experience

– MDP = (S, A, T, R)

MDP = (S, A, T, R)

– Take actions that maximise long term reward

Take actions that maximise long term reward

– Expert burden is reduced to specifying reward

Expert burden is reduced to specifying reward function function

S1 S2 T(s1,a1)

a1 Reward Reward

S3 T(s2,a2)

a2 Reward

slide-4
SLIDE 4

Reinforcement Learning Reinforcement Learning

  • Reinforcement Learning Process

Reinforcement Learning Process

– We specify transition dynamics and reward function

We specify transition dynamics and reward function and get a policy and get a policy

Reinforcement Learning Algorithm System dynamics Reward function Policy

slide-5
SLIDE 5

Reinforcement Learning Reinforcement Learning

  • SARSA / Q-Learning

SARSA / Q-Learning

– Observe state, take action, receive reward, observe

Observe state, take action, receive reward, observe new state new state

– Keep track of the

Keep track of the value value of an action in a particular

  • f an action in a particular

state state

– Estimate the value of a state as the immediate

Estimate the value of a state as the immediate reward received plus the value of the new state reward received plus the value of the new state

– Update estimates by moving the estimate in the

Update estimates by moving the estimate in the direction of the observation direction of the observation

slide-6
SLIDE 6

Skills Skills

  • Problem: Too many states and actions

Problem: Too many states and actions

– Actions could be too low level (eg. Robot walking)

Actions could be too low level (eg. Robot walking)

  • Potential Solution: Use the options framework to

Potential Solution: Use the options framework to introduce high level actions introduce high level actions

– Each

Each option

  • ption is an RL task of its own

is an RL task of its own

– We can then invoke an entire option as an action

We can then invoke an entire option as an action

– Analogous to skills

Analogous to skills

– Requires the expert to specify MANY RL tasks, hence

Requires the expert to specify MANY RL tasks, hence many reward functions many reward functions

slide-7
SLIDE 7

Updated objective Updated objective

  • We want agents that can feasibly learn to do

We want agents that can feasibly learn to do things autonomously things autonomously

  • Minimize the burden on an expert when many

Minimize the burden on an expert when many tasks are to be learned tasks are to be learned

– Specify what, not how

Specify what, not how

– Demonstrate what, not how

Demonstrate what, not how

slide-8
SLIDE 8

Inverse Reinforcement Learning Inverse Reinforcement Learning

  • Reinforcement learning can produce action

Reinforcement learning can produce action selections (policy) from a reward function selections (policy) from a reward function

  • Inverse Reinforcement Learning produces a

Inverse Reinforcement Learning produces a reward function by observing action selections reward function by observing action selections

  • Iteratively proposes and evaluates reward

Iteratively proposes and evaluates reward functions, attempting to match expert functions, attempting to match expert

  • bservations
  • bservations
slide-9
SLIDE 9

Inverse Reinforcement Learning Inverse Reinforcement Learning

  • Inverse Reinforcement Learning Process

Inverse Reinforcement Learning Process

– We provide trajectories and dynamics and get a

We provide trajectories and dynamics and get a reward function (which if optimized would match reward function (which if optimized would match expert behaviour) expert behaviour)

Inverse Reinforcement Learning Algorithm System dynamics Expert behaviour Reward Function

slide-10
SLIDE 10

Inverse Reinforcement Learning Inverse Reinforcement Learning

  • Well, how pointless was that?

Well, how pointless was that?

– Surprisingly pointful

Surprisingly pointful

– Captures the goal of the demonstrator rather than

Captures the goal of the demonstrator rather than just the actions just the actions

– Allows action selection in situations the expert did

Allows action selection in situations the expert did not encounter not encounter

– Allows robustness to changing environments and

Allows robustness to changing environments and capabilities capabilities

slide-11
SLIDE 11

Learning from demonstration Learning from demonstration

  • Must provide many demonstrations to learn

Must provide many demonstrations to learn many reward functions for many small tasks many reward functions for many small tasks (options) (options)

– The demonstrator could demonstrate small tasks

The demonstrator could demonstrate small tasks repetitively (annoying and time consuming) repetitively (annoying and time consuming)

– Annotations could be provided indicating when

Annotations could be provided indicating when each task begins and ends (still annoying, and each task begins and ends (still annoying, and difficult) difficult)

slide-12
SLIDE 12

Objective Objective

  • We want agents that can feasibly learn to do

We want agents that can feasibly learn to do things autonomously things autonomously

  • Minimize the burden on an expert when many

Minimize the burden on an expert when many tasks are to be learned tasks are to be learned

– Specify what, not how

Specify what, not how

– Demonstrate what, not how

Demonstrate what, not how

– Unstructured demonstrations

Unstructured demonstrations

slide-13
SLIDE 13

NPBRS NPBRS

  • We introduce a technique called Nonparamteric

We introduce a technique called Nonparamteric Bayesian Reward Segmentation Bayesian Reward Segmentation

– Takes

Takes unstructured demonstrations unstructured demonstrations and and produces many reward functions along with the produces many reward functions along with the policies that optimise them policies that optimise them

– Does this by segmenting trajectories into more

Does this by segmenting trajectories into more likely pieces likely pieces

A B C A All

slide-14
SLIDE 14

Segmentation Segmentation

  • What information do we have to segmention?

What information do we have to segmention?

  • Reward based segmentation

Reward based segmentation

– Performs IRL on each segment

Performs IRL on each segment

– Evaluates the quality of the IRL

Evaluates the quality of the IRL

– Bad segmentation will lead to bad IRL

Bad segmentation will lead to bad IRL

A B C A

One reward function - lousy

A B C A

Three reward functions – great

slide-15
SLIDE 15

Our model Our model

  • Assume separate skill sets per trajectory, generated from a Beta

Assume separate skill sets per trajectory, generated from a Beta process process

– Allows for an infinitely sized skill set

Allows for an infinitely sized skill set

– Encourages shared skills across trajectories

Encourages shared skills across trajectories

– Allows skill dynamics to change depending on the skill set

Allows skill dynamics to change depending on the skill set

  • Within each skill set, model the skill transition dynamics as a

Within each skill set, model the skill transition dynamics as a sticky sticky Hidden Markov Model Hidden Markov Model

  • The skill sequence is drawn from the skill transition distribution

The skill sequence is drawn from the skill transition distribution

  • Within each skill, the observations are generated from a skill

Within each skill, the observations are generated from a skill specific MDP, where every skill shares transition dynamics but has specific MDP, where every skill shares transition dynamics but has a specific reward function a specific reward function

slide-16
SLIDE 16

Our model Our model

  • Perform inference on this model using a Markov

Perform inference on this model using a Markov chain Monte Carlo sampler chain Monte Carlo sampler

– Sample based on model likelihood – ie. the probability

Sample based on model likelihood – ie. the probability

  • f the data given the model
  • f the data given the model

– Observation log likelihood is the sum of the log

Observation log likelihood is the sum of the log likelihood of each transition likelihood of each transition

– The likelihood of each transition is the probability of

The likelihood of each transition is the probability of action selection under the optimal policy for the reward action selection under the optimal policy for the reward function generated from IRL on all segments assigned function generated from IRL on all segments assigned to that skill to that skill

A B C A

slide-17
SLIDE 17

Does it work? Does it work?

In car domain In car domain

– Skill A : Hit every other car

Skill A : Hit every other car

– Skill B : Stay in the left lane but switch to avoid

Skill B : Stay in the left lane but switch to avoid collisions collisions

– Skill C : Stay in the right lane but switch to avoid

Skill C : Stay in the right lane but switch to avoid collisions collisions

  • Data generated by randomly switching between

Data generated by randomly switching between policies with probability 0.01 policies with probability 0.01

slide-18
SLIDE 18

Does it work? Does it work?

slide-19
SLIDE 19

Does it work? Does it work?

  • Quadcopter domain

Quadcopter domain

– Car domain in 3D

Car domain in 3D

– Skill A : Go through all hoops

Skill A : Go through all hoops

– Skill B : Stay in top row and avoid all hoops

Skill B : Stay in top row and avoid all hoops

– Skill C : Stay in bottom row and avoid all hoops

Skill C : Stay in bottom row and avoid all hoops

slide-20
SLIDE 20

Does it work? Does it work?

slide-21
SLIDE 21

Conclusion Conclusion

  • We can now recover multiple skills from a set of

We can now recover multiple skills from a set of unstructured trajectories unstructured trajectories

  • Now to see if the discovered skills are useful

Now to see if the discovered skills are useful

  • ptions in learning new tasks
  • ptions in learning new tasks
  • Continuous domains

Continuous domains