The Problem of Temporal Abstraction How do we connect the high level - PowerPoint PPT Presentation

The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great, excellent rep’ns of decision-making, choice, outcome but they are too flat Can we keep their elegance, clarity, and simplicity, while connecting and crossing levels?

Goal: Extend RL framework   to temporally abstract action • While minimizing changes to ❖ Value functions ❖ Bellman equations It’s a ❖ Models of the environment ❖ Planning methods dimensional ❖ Learning algorithms thing • While maximizing generality ❖ General dynamics and rewards ❖ Ability to express all courses of behavior ❖ Minimal commitments to other choices • Execution, e.g., hierarchy, interruption, intermixing with planning • Planning, e.g., incremental, synchronous, trajectory based, “utility” problems • State abstraction and function approximation • Creation/Constructivism

Options – Temporally Abstract Actions An option is a triple, o = h π o , γ o i is the policy followed during o π o : S × A → [0 , 1] γ o : S → [0 , γ ] is the probability of the option continuing (not terminating) in each state Execution is nominally hierarchical (call-and-return) E.g., the docking option: : hand-crafted controller π o : terminate when docked or charger not visible γ o ...there are also “semi-Markov” options

Options are like actions Just as a state has a set of actions, A ( s ) It also has a set of options, O ( s ) Just as we can have a flat policy , over actions, π : S × A → [0 , 1] We can have a hierarchical policy, over options , h : O × S → [0 , 1] To execute h in s : select option o with probability h ( o | s ) follow o until it terminates, in s’ then choose a next option with probability again, and so on h ( o 0 | s 0 ) Every hierarchical policy determines a flat policy π = f ( h ) Even if all the options are Markov, f ( h ) is usually not Markov Actions are a special case of options

Value Functions with Temporal Abstraction Define value functions for hierarchical policies and options: � S t = s, A t : ∞ ∼ h R t +1 + γ R t +2 + γ 2 R t +3 + · · · � ⇥ ⇤ v h ( s ) = E q h ( s, o ) = E [ G t | S t = s, A t : t + k − 1 ∼ π o , k ∼ γ o , A t + k : ∞ ∼ h ] Now consider a limited set of options O and hierarchical policies that choose only from them h ∈ Π ( O ) v O ∗ ( s ) = max h ∈ Π ( O ) v h ( s ) A new set of optimization problems q O ∗ ( s, o ) = max h ∈ Π ( O ) q h ( s, o )

Options define a   Semi-Markov Decision Process (SMDP) overlaid on the MDP Time Discrete time MDP State Homogeneous discount Continuous time SMDP Discrete events Interval-dependent discount Discrete time Options Overlaid discrete events over MDP Interval-dependent discount A discrete-time SMDP overlaid on an MDP. Can be analyzed at either level.

Models of the Environment with Temporal Abstraction Planning requires models of the consequences of action The model of an action has a reward part and a state transition part: r ( s, a ) = E [ R t +1 | S t = s, A t = a ] p ( s 0 | s, a ) = Pr { S t +1 = s 0 | S t = s, A t = a } As does the model of an option: � S t = s, A t : t + k − 1 ∼ π o , k ∼ γ o R t +1 + · · · + γ k − 1 R t + k ⇥ � ⇤ r ( s, o ) = E 1 X Pr { S t + k = s 0 , termination at t + k | S t = s, A t : t + k � 1 ∼ π o } γ k p ( s 0 | s, o ) = k =1

Bellman Equations with Temporal Abstraction For policy-specific value functions: " # X X p ( s 0 | s, o ) v h ( s 0 ) v h ( s ) = h ( o | s ) r ( s, o ) + o s 0 X X p ( s 0 | s, o ) h ( o 0 | s 0 ) q h ( s 0 , o 0 ) q h ( s, o ) = r ( s, o ) + s 0 o 0 s, o s r ( s, o ) p h s 0 o r ( s, o ) h p a 0 s 0 v h q h

Planning with Temporal Abstraction Initialize: V ( s ) ← 0 , ∀ s ∈ S " # X Iterate: p ( s 0 | s, o ) V ( s 0 ) V ( s ) ← max r ( s, o ) + o s 0 V → v O ∗ " # X h O ⇤ ( s ) = greedy( s, v O p ( s 0 | s, o ) v O ⇤ ( s 0 ) ⇤ ) = arg max r ( s, o ) + o 2 O s 0 Reduces to conventional value iteration if O = A

Sutton, Precup, Rooms Example & Singh, 1999 4 stochastic primitive actions HALLWAYS up Fail 33% left right of the time o G 1 1 down G 2 8 multi-step options o (to each room's 2 hallways) 2 Policy of Target All rewards zero, Hallway one option: except +1 into goal γ = .9

Planning is much faster with Temporal Abstraction with with ce cell-to to-ce cell primitive actions primitive actions Without TA V ( goal )=1 Iteration #0 Iteration #1 Iteration #2 with with room-to-room room-to-room options option With TA V ( goal )=1 Iteration #0 Iteration #1 Iteration #2

Temporal Abstraction helps even with Goal ≠ Subgoal   given both primitive actions and options Initial values Iteration #1 Iteration #2 Iteration #3 Iteration #4 Iteration #5

Temporal Abstraction helps even with Goal ≠ Subgoal   given both primitive actions and options Initial values Iteration #1 Iteration #2 why? Iteration #3 Iteration #4 Iteration #5

Temporal abstraction also speeds learning about path-to-goal

SMDP Theory Provides a lot of this • Policies over options: µ : S × O a [0,1] Hierarchical policies over options: h ( o | s ) µ ( s , o ), V * ( s , o ) • Value functions over options: V µ ( s ), Q * ( s ), Q O v h ( s ) , q h ( s, o ) , v O ∗ ( s ) , q O ∗ ( s, o ) O • Learning methods: Bradtke & Duff (1995), Parr (1998) : r ( s, o ) , p ( s 0 | s, o ) • Models of options • Planning methods: e.g. value iteration, policy iteration, Dyna... • A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level But not all. The most interesting issues are beyond SMDPs...

Outline • The RL (MDP) framework • The extension to temporally abstract “options” ❖ Options and Semi-MDPs ❖ Hierarchical planning and learning • Rooms example • Between MDPs and Semi-MDPs ❖ Improvement by interruption (including Spy plane demo) ❖ A taste of • Intra-option learning • Subgoals for learning options • RoboCup soccer demo

Interruption Idea: We can do better by sometimes interrupting ongoing options - forcing them to terminate before says to γ o Theorem: For any hierarchical policy h : O × S → [0 , 1] , suppose we interrupt its options one or more times, t , when the action we are about to take o , is such that q h ( S t , o ) < q h ( S t , h ( S t )) to obtain h’, Then h’ ≥ h (it attains more or equal reward everywhere) q O h = h O Application: Suppose we have determined and thus ∗ ∗ h O Then h’ is guaranteed better than ∗ and is available with no further computation

Landmarks Task Task: navigate from S to G as range (input set) of each fast as possible run-to-landmark controller G 4 primitive actions, for taking tiny steps up, down, left, right 7 controllers for going straight to each one of the landmarks, landmarks from within a circular region where the landmark is visible S In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

Termination Improvement for Landmarks Task G Termination-Improved Solution (474 Steps) SMDP Solution (600 Steps) S Allowing early termination based on models improves the value function at no additional cost!

Spy Plane Example • Mission: Fly over (observe) most valuable sites and return to base 25 15 (reward) • Stochastic weather affects observability (cloudy or clear) of sites 25 (mean time between 50 weather changes) • Limited fuel 8 • Intractable with classical optimal options control methods • Temporal scales: 50 ❖ Actions: which direction to fly now ❖ Options: which site to head for 5 • Options compress space and time 100 10 ❖ Reduce steps from ~600 to ~6 ❖ Reduce states from ~10 10 to ~10 6 50 Base X q O p ( s 0 | s, o ) v O ⇤ ( s 0 ) ⇤ ( s, o ) = r ( s, o ) + 100 decision steps s 0 any state ~ 10 10 sites only ~ 10 6

Spy Drone

Spy Plane Example (Results) Expected Reward/Mission • SMDP planner: 60 ❖ Assumes options followed to completion High Fuel ❖ Plans optimal SMDP solution 50 • SMDP planner with interruption Low Fuel ❖ Plans as if options must be followed to completion 40 ❖ But actually takes them for only one step ❖ Re-picks a new option on every 30 SMDP   SMDP step Static TI SMDP Static planner   Planner Re-planner • Static planner: with   ❖ Assumes weather will not change interruption Temporal abstraction finds ❖ Plans optimal tour among clear better approximation than sites static planner, with little ❖ Re-plans whenever weather more computation than changes SMDP planner

Outline • The RL (MDP) framework • The extension to temporally abstract “options” ❖ Options and Semi-MDPs ❖ Hierarchical planning and learning • Rooms example • Between MDPs and Semi-MDPs ❖ Improvement by interruption (including Spy plane demo) ❖ A taste of • Intra-option learning • Subgoals for learning options • RoboCup soccer demo

The Problem of Temporal Abstraction How do we connect the high level - PowerPoint PPT Presentation

The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

Data Abstraction Announcements Data Abstraction Data Abstraction 4 Data Abstraction

Predicate Abstraction with SATABS Existential Abstraction Predicate Abstraction for Software

Data Abstraction Announcements Data Abstraction Data Abstraction Programmers Compound

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Point, Line, & Plane 1 Abstraction Abstraction is the act of considering something as a

Chapter 3: Data Abstraction Modularity and Abstraction Abstraction, modularity, information

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

What is abstraction? What is abstraction? Workable answer a blurring of details Idea:

Managing Water Abstraction Reforming Abstraction and Modernising Regulation Richard Austen Water

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

61A Lecture 18 Announcements Sequences The Sequence Abstraction 4 The Sequence Abstraction

CS 1331 Introduction to Object Oriented Programming Data Abstraction Christopher Simpkins

Introduction to Patterns and Many problems recur. Design Patterns Many problems have the

On Abstraction, Information Hiding and Crosscutting Modularity

Methods and Functional Abstraction Nathaniel Osgood Agent-Based Modeling Bootcamp for Health

Process Life Cycle Virtualizing the CPU OS keeps a PCB for each process It has space to hold a

Lecture 11: Abstraction I ntro. to Programming, lecture 11: Abstraction 2 Topics for today

CPU Virtualization: The Process Abstraction Prof. Patrick G. Bridges 1 University of New Mexico

QEMU internal APIs How abstractions inside QEMU (don't) work together Eduardo Habkost

Operating System Review Chi Zhang czhang@cs.fiu.edu 1 About the Course Prerequisite: COP

The Problem of Temporal Abstraction How do we connect the high level - PowerPoint PPT Presentation

The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

Data Abstraction Announcements Data Abstraction Data Abstraction 4 Data Abstraction

Predicate Abstraction with SATABS Existential Abstraction Predicate Abstraction for Software

Data Abstraction Announcements Data Abstraction Data Abstraction Programmers Compound

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Point, Line, &amp; Plane 1 Abstraction Abstraction is the act of considering something as a

Chapter 3: Data Abstraction Modularity and Abstraction Abstraction, modularity, information

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

What is abstraction? What is abstraction? Workable answer a blurring of details Idea:

Managing Water Abstraction Reforming Abstraction and Modernising Regulation Richard Austen Water

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

61A Lecture 18 Announcements Sequences The Sequence Abstraction 4 The Sequence Abstraction

CS 1331 Introduction to Object Oriented Programming Data Abstraction Christopher Simpkins

Introduction to Patterns and Many problems recur. Design Patterns Many problems have the

On Abstraction, Information Hiding and Crosscutting Modularity

Methods and Functional Abstraction Nathaniel Osgood Agent-Based Modeling Bootcamp for Health

Process Life Cycle Virtualizing the CPU OS keeps a PCB for each process It has space to hold a

Lecture 11: Abstraction I ntro. to Programming, lecture 11: Abstraction 2 Topics for today

CPU Virtualization: The Process Abstraction Prof. Patrick G. Bridges 1 University of New Mexico

QEMU internal APIs How abstractions inside QEMU (don't) work together Eduardo Habkost

Operating System Review Chi Zhang czhang@cs.fiu.edu 1 About the Course Prerequisite: COP

Point, Line, & Plane 1 Abstraction Abstraction is the act of considering something as a