CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - PowerPoint PPT Presentation

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement Learning Animesh Garg

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Richard S. Sutton , Doina Precup , Satinder Singh Topic: Hierarchical RL Presenter: Panteha Naderian

Motivation: Temporal abstraction • Consider an activity such as cooking o High-level: Choose a recipe, make grocery List o Medium-level: get a pot, put ingredients in the Pot, stir until smooth o Low-level: wrist and arm movement, muscle Contraction • All have to be seamlessly integrated.

Contributions • Temporal abstraction within the framework of RL by introducing options. • Applying results from theory of SMDPs for planning and Learning in the context of options. • Changing and learning option’s internal structure. o Interrupting options o Sub goals o Intra-option learning

Background: MDP MDP consists of: • A set of actions • A set of states $ = Pr{𝑡 *+, = 𝑡 - |𝑡 * = 𝑡, 𝑏 * = 𝑏} • Transition dynamics: 𝑞 "" # $ = 𝐹{𝑠 • Expected reward: 𝑠 *+, |𝑡 * = 𝑡, 𝑏 * = 𝑏} "

Background: MDP • Policy: 𝜌: 𝑇×𝒝 → [0,1] • 𝑊 ? 𝑡 = 𝐹 𝑠 *+B + 𝛿 B 𝑠 *+, + 𝛿𝑠 *+C + ⋯ 𝑡 * = 𝑡, 𝜌 $ 𝑊 ? (𝑡 - )] $ + 𝛿 ∑ " # 𝑞 "" # = ∑ $∈𝒝 G 𝜌 𝑡, 𝑏 [𝑠 " $ 𝑊 ∗ (𝑡 - )] • 𝑊 ∗ 𝑡 = max 𝑊 ? 𝑡 = max $ + 𝛿 ∑ " # 𝑞 "" # $∈𝒝 G [𝑠 " ?

Background: Semi-MDP

Options • Generalize actions to include temporally extended courses of actions. • An option (𝐽, 𝜌, 𝛾) has three components: o An initiation set 𝐽 ⊆ 𝑇 o A terminations condition 𝛾: 𝑇 → 0,1 o A policy 𝜌: 𝑇× 𝒝 → [0,1] • If the option (𝐽, 𝜌, 𝛾) is taken at 𝑡 ∈ 𝐽 , then actions are selected according to 𝜌 until the option terminates stochastically according to 𝛾 .

Options: Example • Open-the-door • 𝐽: all states in which a closed door is within reach • 𝜌: pre-defined controller for reaching, grasping, and turning the door knob • 𝛾: terminate when the door is open

Option: more definitions and details • Viewing simple actions as single-step options • Composing options • Policies over options: 𝜈: 𝑇×𝑃 → [0,1] • Theorem 1. (MDP+ options=SMDP). For any MDP, and any set of options defined on that MDP, the decision process that only selects among those options, executing each to the termination, is an SMDP .

Option models • Rewards: Y = 𝐹 𝑠 *+B + ⋯ + 𝛿 Z[, 𝑠 𝑆 " *+, + 𝛿𝑠 Z+* 𝑃 𝑗𝑡 𝑗𝑜𝑗𝑢𝑗𝑏𝑢𝑓𝑒 𝑗𝑜 𝑡𝑢𝑏𝑢𝑓 𝑡 𝑏𝑢 𝑢𝑗𝑛𝑓 𝑢 𝑏𝑜𝑒 𝑚𝑏𝑡𝑢 𝑙 𝑡𝑢𝑓𝑞𝑡 } • Dynamics: h e = f 𝛿 i 𝑞(𝑡 - , 𝑙) 𝑄 "" # Zg,

Rewriting Bellman Equations with Options 𝑊 j 𝑡 = 𝐹 𝑠 Z+* +𝛿 Z 𝑊 j 𝑡 *+Z *+B + ⋯ + 𝛿 Z[, 𝑠 *+, + 𝛿𝑠 𝜁 𝜈, 𝑡, 𝑢 (k is the duration of the first option selected by 𝜈 ) e 𝑊 j 𝑡 - ] Y + f = f 𝜈 𝑡, 𝑝 [𝑠 𝑞 "" # " " # Y∈e G e 𝑊 ∗ 𝑡 - ] 𝑊 ∗ 𝑡 = max Y + ∑ " # 𝑞 "" # Y∈e G [𝑠 "

Options value learning • State s, initiate option o, execute until termination • Observe termination state 𝑡 - , number of steps 𝑙 , discounted reward r Q 𝑡, 𝑝 = 𝑅 𝑡, 𝑝 + 𝛽(𝑠 + 𝛿 Z max Y # ∈e G# 𝑅 𝑡 - , 𝑝 - − 𝑅(𝑡, 𝑝))

Between MDPs and semi-MDPs 1. Interrupting options 2. Intra-option model/ value learning 3. Sub goals

1.Interrupting options • We don’t have to follow options until termination, we can re-evaluate our commitment at each step. • If the value of continuing option o, 𝑅(𝑡, 𝑝) is less than the value of selecting a new option 𝑊 j 𝑡 = ∑ p 𝜈(𝑡, 𝑟)𝑅 j (𝑡, 𝑟) , then switch. • Theorem 2. policy 𝜈 - is the interrupted policy of 𝜈 . Then: For all s ∈ 𝑇: 𝑊 j # (𝑡) ≥ 𝑊 j (𝑡) I. II. If from state 𝑡 ∈ 𝑇 , there is a non zero probability of encountering an interrupted history, then 𝑊 j # (𝑡) > 𝑊 j (𝑡)

Interrupting options: Example

2.Intra-option algorithms • Learning about one option at a time is very inefficient. • Instead, learn all options consistent with the behavior. • Update every Markov option o whose policy could have selected 𝑏 * according to the same distribution 𝜌(𝑡 * , . ) . 𝑅 𝑡 * , 𝑝 ← 𝑅 𝑡 * , 𝑝 + α (𝑠 *+, +𝛿𝑉 𝑡 *+, , 𝑝 − 𝑅 𝑡 * , 𝑝 ] • Where Y # ∈e 𝑅(𝑡, 𝑝 - ) 𝑉 𝑡, 𝑝 = 1 − 𝛾 𝑡 𝑅 𝑡, 𝑝 + 𝛾(𝑡) max Is an estimate of the value of state-option pair (s,o) upon arrival in state s.

2.Intra-option algorithms • Theorem 3 (Convergence of intra-option Q-learning). For any set of Markov options, O , with deterministic policies, one-step intra-option Q-learning converges with probability 1 to the optimal Q-values, for every option regardless of what options are executed during learning, provided that every action gets executed in every state infinitely often.

• Proof. 𝑅 𝑡, 𝑝 ← 𝑅 𝑡, 𝑝 + α (𝑠 - + 𝛿𝑉 𝑡 - , 𝑝 − 𝑅 𝑡, 𝑝 ] We prove that the operator 𝐹[𝑠 - + 𝛿𝑉 𝑡 - , 𝑝 ] is a contraction. $ 𝑉 (𝑡 - , 𝑝) − 𝑅 ∗ 𝑡, 𝑝 | 𝐹 𝑠 - + 𝛿𝑉 𝑡 - , 𝑝 − 𝑅 ∗ 𝑡, 𝑝 $ + 𝛿 f = |𝑠 𝑞 "" # " " # $ 𝑉 𝑡 - , 𝑝 − 𝑠 $ 𝑉 ∗ 𝑡 - , 𝑝 $ + 𝛿 f $ + 𝛿 f = 𝑠 𝑞 "" # 𝑞 "" # ≤ " " " # " # $ [ 1 − 𝛾 𝑡 - 𝑅 𝑡 - , 𝑝 − 𝑅 ∗ 𝑡 - , 𝑝 + 𝛾 𝑡 - (max Y # 𝑅 𝑡 - , 𝑝 - − max Y # 𝑅 ∗ 𝑡 - , 𝑝 - )] | ≤ | f 𝑞 "" # " # " ## ,Y ## |𝑅 𝑡 -- , 𝑝 -- − 𝑅 ∗ 𝑡 -- , 𝑝 -- | = $ f 𝑞 "" # max " # " ## ,Y ## |𝑅 𝑡 -- , 𝑝 -- − 𝑅 ∗ 𝑡 -- , 𝑝 -- | 𝛿 max

3.Subgoals for learning options • It is natural to think of options as achieving subgoals of some kind, and to adapt each option’s policy to better achieve its subgoal. • A simple way to formulate a subgoal for an option is to assign a terminal subgoal value, g(s), to each state. • For example, to learn a hallway option in the rooms task, the target hallway might be assigned a subgoal value of +1, while other get the subgoal value of zero. • Learn policies using subgoals independently using an off-policy learning method such as Q-learning .

3.Subgoals for learning options

Contributions (Recap) • Problem: enable temporally abstract knowledge and action to be included in the reinforcement learning • Introduced options, temporally extended courses of actions. • Extended theory of SMDPs to the context of options. • Introduction of intra-option learning algorithm that are able to learn about options from fragment of execution. • Propose notion of subgoals that can be used to improve option themselves.

Limitations • Require to formalize subgoals/options. • Might necessitate a small state-action space. • The integration with state abstraction remain incompletely understood.

Questions 1. Why should we use off-policy learning methods for learning the option policies using subgoals? 2. What cases can you think of which intra value learning improve upon the original option value learning? 3. Is planning over options always going to speed up the planning?

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - PowerPoint PPT Presentation

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement Learning Animesh Garg Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Richard S. Sutton , Doina

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Behavioral Cloning from

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 1: Introduction &

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 4: Q-Value based RL Animesh

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised & Imitation

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Probability Refresher Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Robot Motion Planning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Supervised Learning Part 3/3 Kai Arras Social Robotics Lab, University

Research Officer, Research Center Lahore High Court, Lahore Ph.D Law(Scholar)IIU Islamabad

Planning and Forecasting 2019 Consultation Process Briefing Webinar Wednesday 3 April 2019

Environmental Technologies Awareness Course July 2015 Welcome and Introduction The aim of the

Tool Demonstration: Demand Forecasting PACE D 2.0 RE Team Agenda Demand Forecasting

IPv6 Router Alert Option for MPLS OAM (draft-raza-mpls-oam-ipv6-rao-00) Kamran Raza Nobo Akiya

Parameterized Algorithms for Book-Embedding Problems Sujoy Bhore , Robert Ganian, Fabrizio

As a Sales Manager In order to monitor inventory I want . report

25. Review Double integrals Integrate function f ( x, y ) over a region R : f d A. R

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - PowerPoint PPT Presentation

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement Learning Animesh Garg Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Richard S. Sutton , Doina

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Behavioral Cloning from

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 1: Introduction &amp;

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 4: Q-Value based RL Animesh

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised &amp; Imitation

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Sensors for Robotics

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Probability Refresher Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Robot Motion Planning Kai Arras Social Robotics Lab, University of

Human-Oriented Robotics Supervised Learning Part 3/3 Kai Arras Social Robotics Lab, University

Research Officer, Research Center Lahore High Court, Lahore Ph.D Law(Scholar)IIU Islamabad

Planning and Forecasting 2019 Consultation Process Briefing Webinar Wednesday 3 April 2019

Environmental Technologies Awareness Course July 2015 Welcome and Introduction The aim of the

Tool Demonstration: Demand Forecasting PACE D 2.0 RE Team Agenda Demand Forecasting

IPv6 Router Alert Option for MPLS OAM (draft-raza-mpls-oam-ipv6-rao-00) Kamran Raza Nobo Akiya

Parameterized Algorithms for Book-Embedding Problems Sujoy Bhore , Robert Ganian, Fabrizio

As a Sales Manager In order to monitor inventory I want . report

25. Review Double integrals Integrate function f ( x, y ) over a region R : f d A. R

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 1: Introduction &

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised & Imitation

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics