Skill discovery from unstructured demonstrations Skill discovery - PowerPoint PPT Presentation

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za

Initial objective Initial objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert Minimize the burden on an expert – Specify what, not how Specify what, not how

Reinforcement Learning Reinforcement Learning ● Reinforcement Learning Reinforcement Learning – Learn behaviour from experience Learn behaviour from experience – MDP = (S, A, T, R) MDP = (S, A, T, R) S 1 S 2 S 3 T(s 1 ,a 1 ) T(s 2 ,a 2 ) Reward Reward Reward a 1 a 2 – Take actions that maximise long term reward Take actions that maximise long term reward – Expert burden is reduced to specifying reward Expert burden is reduced to specifying reward function function

Reinforcement Learning Reinforcement Learning ● Reinforcement Learning Process Reinforcement Learning Process – We specify transition dynamics and reward function We specify transition dynamics and reward function and get a policy and get a policy System dynamics Reinforcement Learning Policy Reward function Algorithm

Reinforcement Learning Reinforcement Learning ● SARSA / Q-Learning SARSA / Q-Learning – Observe state, take action, receive reward, observe Observe state, take action, receive reward, observe new state new state – Keep track of the Keep track of the value value of an action in a particular of an action in a particular state state – Estimate the value of a state as the immediate Estimate the value of a state as the immediate reward received plus the value of the new state reward received plus the value of the new state – Update estimates by moving the estimate in the Update estimates by moving the estimate in the direction of the observation direction of the observation

Skills Skills ● Problem: Too many states and actions Problem: Too many states and actions – Actions could be too low level (eg. Robot walking) Actions could be too low level (eg. Robot walking) ● Potential Solution: Use the options framework to Potential Solution: Use the options framework to introduce high level actions introduce high level actions – Each Each option option is an RL task of its own is an RL task of its own – We can then invoke an entire option as an action We can then invoke an entire option as an action – Analogous to skills Analogous to skills – Requires the expert to specify MANY RL tasks, hence Requires the expert to specify MANY RL tasks, hence many reward functions many reward functions

Updated objective Updated objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert when many Minimize the burden on an expert when many tasks are to be learned tasks are to be learned – Specify what, not how Specify what, not how – Demonstrate what, not how Demonstrate what, not how

Inverse Reinforcement Learning Inverse Reinforcement Learning ● Reinforcement learning can produce action Reinforcement learning can produce action selections (policy) from a reward function selections (policy) from a reward function ● Inverse Reinforcement Learning produces a Inverse Reinforcement Learning produces a reward function by observing action selections reward function by observing action selections ● Iteratively proposes and evaluates reward Iteratively proposes and evaluates reward functions, attempting to match expert functions, attempting to match expert observations observations

Inverse Reinforcement Learning Inverse Reinforcement Learning ● Inverse Reinforcement Learning Process Inverse Reinforcement Learning Process – We provide trajectories and dynamics and get a We provide trajectories and dynamics and get a reward function (which if optimized would match reward function (which if optimized would match expert behaviour) expert behaviour) System dynamics Inverse Reward Function Expert behaviour Reinforcement Learning Algorithm

Inverse Reinforcement Learning Inverse Reinforcement Learning ● Well, how pointless was that? Well, how pointless was that? – Surprisingly pointful Surprisingly pointful – Captures the goal of the demonstrator rather than Captures the goal of the demonstrator rather than just the actions just the actions – Allows action selection in situations the expert did Allows action selection in situations the expert did not encounter not encounter – Allows robustness to changing environments and Allows robustness to changing environments and capabilities capabilities

Learning from demonstration Learning from demonstration ● Must provide many demonstrations to learn Must provide many demonstrations to learn many reward functions for many small tasks many reward functions for many small tasks (options) (options) – The demonstrator could demonstrate small tasks The demonstrator could demonstrate small tasks repetitively (annoying and time consuming) repetitively (annoying and time consuming) – Annotations could be provided indicating when Annotations could be provided indicating when each task begins and ends (still annoying, and each task begins and ends (still annoying, and difficult) difficult)

Objective Objective ● We want agents that can feasibly learn to do We want agents that can feasibly learn to do things autonomously things autonomously ● Minimize the burden on an expert when many Minimize the burden on an expert when many tasks are to be learned tasks are to be learned – Specify what, not how Specify what, not how – Demonstrate what, not how Demonstrate what, not how – Unstructured demonstrations Unstructured demonstrations

NPBRS NPBRS ● We introduce a technique called Nonparamteric We introduce a technique called Nonparamteric Bayesian Reward Segmentation Bayesian Reward Segmentation – Takes Takes unstructured demonstrations unstructured demonstrations and and produces many reward functions along with the produces many reward functions along with the policies that optimise them policies that optimise them – Does this by segmenting trajectories into more Does this by segmenting trajectories into more likely pieces likely pieces All A B C A

Segmentation Segmentation ● What information do we have to segmention? What information do we have to segmention? ● Reward based segmentation Reward based segmentation – Performs IRL on each segment Performs IRL on each segment – Evaluates the quality of the IRL Evaluates the quality of the IRL – Bad segmentation will lead to bad IRL Bad segmentation will lead to bad IRL One reward function - lousy A B C A Three reward functions – great A B C A

Our model Our model ● Assume separate skill sets per trajectory, generated from a Beta Assume separate skill sets per trajectory, generated from a Beta process process – Allows for an infinitely sized skill set Allows for an infinitely sized skill set – Encourages shared skills across trajectories Encourages shared skills across trajectories – Allows skill dynamics to change depending on the skill set Allows skill dynamics to change depending on the skill set ● Within each skill set, model the skill transition dynamics as a Within each skill set, model the skill transition dynamics as a sticky Hidden Markov Model Hidden Markov Model sticky ● The skill sequence is drawn from the skill transition distribution The skill sequence is drawn from the skill transition distribution ● Within each skill, the observations are generated from a skill Within each skill, the observations are generated from a skill specific MDP, where every skill shares transition dynamics but has specific MDP, where every skill shares transition dynamics but has a specific reward function a specific reward function

Our model Our model ● Perform inference on this model using a Markov Perform inference on this model using a Markov chain Monte Carlo sampler chain Monte Carlo sampler – Sample based on model likelihood – ie. the probability Sample based on model likelihood – ie. the probability of the data given the model of the data given the model – Observation log likelihood is the sum of the log Observation log likelihood is the sum of the log likelihood of each transition likelihood of each transition – The likelihood of each transition is the probability of The likelihood of each transition is the probability of action selection under the optimal policy for the reward action selection under the optimal policy for the reward function generated from IRL on all segments assigned function generated from IRL on all segments assigned to that skill to that skill A B C A

Does it work? Does it work? In car domain In car domain – Skill A : Hit every other car Skill A : Hit every other car – Skill B : Stay in the left lane but switch to avoid Skill B : Stay in the left lane but switch to avoid collisions collisions – Skill C : Stay in the right lane but switch to avoid Skill C : Stay in the right lane but switch to avoid collisions collisions ● Data generated by randomly switching between Data generated by randomly switching between policies with probability 0.01 policies with probability 0.01

Does it work? Does it work?

Skill discovery from unstructured demonstrations Skill discovery - PowerPoint PPT Presentation

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za Initial objective Initial objective We

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Medicare Demonstrations Medicare Demonstrations and Clinical Integration and Clinical

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Visualization & Situational Awareness Demonstrations Project Presentation for EPIC Symposium

Next Generation Data Discovery Fusing Structured and Unstructured Content from Multiple

Facilitating Skill & Employment to Migrant Labours Skill & Employment to Migrant

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Virginia Tech NASA USLI CDR Presentation Ishan Arora, Nicholas Corbin, William Dillingham,

for Hybrid Indoor Navigation for Monitoring inside Building using Quadcopter Sanya Khruahong

WELCOME IQAC 14 th Regular Meeting Monday, 30 th April 2018 14 th IQAC meeting, 30 th April 2018. V

Preliminary Design Review University of Alabama in Huntsville Charger Rocket Works November 7 th

Leading the Way with Long-Range Unmanned Aerial Systems (UAS) 1 Overview Introductions

Video Wall Contr oller s T he Mo st Ve rsa tile Arc hite c ture 1. Signal Ac quisition:

Full Year 2016 Results 6 April 2017 Confidential Disclaimer This Presentation has been prepared

Institutional Master Plan for the Shadyside Campus September 1, 2020 | Development Activities

Skill discovery from unstructured demonstrations Skill discovery - PowerPoint PPT Presentation

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations Pravesh Ranchod School of Computer Science University of the Witwatersrand pravesh.ranchod@wits.ac.za Initial objective Initial objective We

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Medicare Demonstrations Medicare Demonstrations and Clinical Integration and Clinical

Hierarchical RL and Skill Discovery CS 330 1 The Plan Information-theoretic concepts Skill

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Data and Analysis Part III Unstructured Data Ian Stark February 2011 Part III: Unstructured

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Visualization &amp; Situational Awareness Demonstrations Project Presentation for EPIC Symposium

Next Generation Data Discovery Fusing Structured and Unstructured Content from Multiple

Facilitating Skill &amp; Employment to Migrant Labours Skill &amp; Employment to Migrant

Flipping Coins in the War Room: Skill and Chance in the NFL Draft Cade Massey Yale University

1. What is skill and how are skill classified? 2. How do people learn skills? 3. How can

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Virginia Tech NASA USLI CDR Presentation Ishan Arora, Nicholas Corbin, William Dillingham,

for Hybrid Indoor Navigation for Monitoring inside Building using Quadcopter Sanya Khruahong

WELCOME IQAC 14 th Regular Meeting Monday, 30 th April 2018 14 th IQAC meeting, 30 th April 2018. V

Preliminary Design Review University of Alabama in Huntsville Charger Rocket Works November 7 th

Leading the Way with Long-Range Unmanned Aerial Systems (UAS) 1 Overview Introductions

Video Wall Contr oller s T he Mo st Ve rsa tile Arc hite c ture 1. Signal Ac quisition:

Full Year 2016 Results 6 April 2017 Confidential Disclaimer This Presentation has been prepared

Institutional Master Plan for the Shadyside Campus September 1, 2020 | Development Activities

Visualization & Situational Awareness Demonstrations Project Presentation for EPIC Symposium

Facilitating Skill & Employment to Migrant Labours Skill & Employment to Migrant