CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - - PowerPoint PPT Presentation

β–Ά
csc2621 topics in robotics
SMART_READER_LITE
LIVE PREVIEW

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - - PowerPoint PPT Presentation

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement Learning Animesh Garg Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Richard S. Sutton , Doina


slide-1
SLIDE 1

CSC2621 Topics in Robotics

Reinforcement Learning in Robotics

Week 11: Hierarchical Reinforcement Learning Animesh Garg

slide-2
SLIDE 2

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Richard S. Sutton , Doina Precup , Satinder Singh Topic: Hierarchical RL Presenter: Panteha Naderian

slide-3
SLIDE 3

Motivation: Temporal abstraction

  • Consider an activity such as cooking
  • High-level: Choose a recipe, make grocery List
  • Medium-level: get a pot, put ingredients in the

Pot, stir until smooth

  • Low-level: wrist and arm movement, muscle

Contraction

  • All have to be seamlessly integrated.
slide-4
SLIDE 4

Contributions

  • Temporal abstraction within the framework of RL by introducing
  • ptions.
  • Applying results from theory of SMDPs for planning and Learning in

the context of options.

  • Changing and learning option’s internal structure.
  • Interrupting options
  • Sub goals
  • Intra-option learning
slide-5
SLIDE 5

Background: MDP

MDP consists of:

  • A set of actions
  • A set of states
  • Transition dynamics: π‘ž""#

$

= Pr{𝑑*+, = 𝑑-|𝑑* = 𝑑, 𝑏* = 𝑏}

  • Expected reward: 𝑠

" $ = 𝐹{𝑠 *+,|𝑑* = 𝑑, 𝑏* = 𝑏}

slide-6
SLIDE 6

Background: MDP

  • Policy: 𝜌: 𝑇×𝒝 β†’ [0,1]
  • π‘Š? 𝑑 = 𝐹 𝑠

*+, + 𝛿𝑠 *+B + 𝛿B𝑠 *+C + β‹― 𝑑* = 𝑑, 𝜌

= βˆ‘$βˆˆπ’G 𝜌 𝑑, 𝑏 [𝑠

" $ + 𝛿 βˆ‘"# π‘ž""# $ π‘Š?(𝑑-)]

  • π‘Šβˆ— 𝑑 = max

?

π‘Š? 𝑑 = max

$βˆˆπ’G[𝑠 " $ + 𝛿 βˆ‘"# π‘ž""# $ π‘Šβˆ—(𝑑-)]

slide-7
SLIDE 7

Background: Semi-MDP

slide-8
SLIDE 8

Options

  • Generalize actions to include temporally extended courses of actions.
  • An option (𝐽, 𝜌, 𝛾) has three components:
  • An initiation set 𝐽 βŠ† 𝑇
  • A terminations condition 𝛾: 𝑇 β†’ 0,1
  • A policy 𝜌: 𝑇× 𝒝 β†’ [0,1]
  • If the option (𝐽, 𝜌, 𝛾) is taken at 𝑑 ∈ 𝐽, then actions are selected

according to 𝜌 until the option terminates stochastically according to 𝛾.

slide-9
SLIDE 9

Options: Example

  • Open-the-door
  • 𝐽: all states in which a closed door is within reach
  • 𝜌: pre-defined controller for reaching, grasping, and turning the door knob
  • 𝛾: terminate when the door is open
slide-10
SLIDE 10

Option: more definitions and details

  • Viewing simple actions as single-step options
  • Composing options
  • Policies over options: 𝜈: 𝑇×𝑃 β†’ [0,1]
  • Theorem 1. (MDP+ options=SMDP). For any MDP, and any set of
  • ptions defined on that MDP, the decision process that only selects

among those options, executing each to the termination, is an SMDP.

slide-11
SLIDE 11

Option models

  • Rewards:

𝑆"

Y = 𝐹 𝑠 *+, + 𝛿𝑠 *+B + β‹― + 𝛿Z[,𝑠 Z+*

𝑃 𝑗𝑑 π‘—π‘œπ‘—π‘’π‘—π‘π‘’π‘“π‘’ π‘—π‘œ 𝑑𝑒𝑏𝑒𝑓 𝑑 𝑏𝑒 𝑒𝑗𝑛𝑓 𝑒 π‘π‘œπ‘’ π‘šπ‘π‘‘π‘’ 𝑙 π‘‘π‘’π‘“π‘žπ‘‘ }

  • Dynamics:

𝑄""#

e = f Zg, h

𝛿i π‘ž(𝑑-, 𝑙)

slide-12
SLIDE 12

Rewriting Bellman Equations with Options

π‘Šj 𝑑 = 𝐹 𝑠

*+, + 𝛿𝑠 *+B + β‹― + 𝛿Z[,𝑠 Z+* +𝛿Z π‘Šj 𝑑*+Z

𝜁 𝜈, 𝑑, 𝑒

(k is the duration of the first option selected by 𝜈)

= f

Y∈eG

𝜈 𝑑, 𝑝 [𝑠

" Y + f "#

π‘ž""#

e π‘Šj 𝑑- ]

π‘Šβˆ— 𝑑 = max

Y∈eG[𝑠 " Y + βˆ‘"# π‘ž""# e π‘Šβˆ— 𝑑- ]

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17

Options value learning

  • State s, initiate option o, execute until termination
  • Observe termination state 𝑑-, number of steps 𝑙, discounted reward r

Q 𝑑, 𝑝 = 𝑅 𝑑, 𝑝 + 𝛽(𝑠 + 𝛿Z max

Y#∈eG# 𝑅 𝑑-, 𝑝- βˆ’ 𝑅(𝑑, 𝑝))

slide-18
SLIDE 18

Between MDPs and semi-MDPs

  • 1. Interrupting options
  • 2. Intra-option model/ value learning
  • 3. Sub goals
slide-19
SLIDE 19

1.Interrupting options

  • We don’t have to follow options until termination, we can re-evaluate
  • ur commitment at each step.
  • If the value of continuing option o, 𝑅(𝑑, 𝑝)is less than the value of

selecting a new option π‘Šj 𝑑 = βˆ‘p 𝜈(𝑑, π‘Ÿ)𝑅j(𝑑, π‘Ÿ), then switch.

  • Theorem 2. policy 𝜈- is the interrupted policy of 𝜈. Then:

I. For all s ∈ 𝑇: π‘Šj#(𝑑) β‰₯ π‘Šj(𝑑) II. If from state 𝑑 ∈ 𝑇, there is a non zero probability of encountering an interrupted history, then π‘Šj#(𝑑) > π‘Šj(𝑑)

slide-20
SLIDE 20

Interrupting options: Example

slide-21
SLIDE 21

2.Intra-option algorithms

  • Learning about one option at a time is very inefficient.
  • Instead, learn all options consistent with the behavior.
  • Update every Markov option o whose policy could have selected 𝑏*

according to the same distribution 𝜌(𝑑*, . ). 𝑅 𝑑*, 𝑝 ← 𝑅 𝑑*, 𝑝 + Ξ± (𝑠

*+,+𝛿𝑉 𝑑*+,, 𝑝

βˆ’ 𝑅 𝑑*, 𝑝 ]

  • Where

𝑉 𝑑, 𝑝 = 1 βˆ’ 𝛾 𝑑 𝑅 𝑑, 𝑝 + 𝛾(𝑑) max

Y#∈e 𝑅(𝑑, 𝑝-)

Is an estimate of the value of state-option pair (s,o) upon arrival in state s.

slide-22
SLIDE 22

2.Intra-option algorithms

  • Theorem 3 (Convergence of intra-option Q-learning). For any set of

Markov options, O, with deterministic policies, one-step intra-option Q-learning converges with probability 1 to the optimal Q-values, for every option regardless of what options are executed during learning, provided that every action gets executed in every state infinitely often.

slide-23
SLIDE 23
  • Proof.

𝑅 𝑑, 𝑝 ← 𝑅 𝑑, 𝑝 + Ξ± (𝑠- + 𝛿𝑉 𝑑-, 𝑝 βˆ’ 𝑅 𝑑, 𝑝 ] We prove that the operator 𝐹[𝑠- + 𝛿𝑉 𝑑-, 𝑝 ] is a contraction. 𝐹 𝑠- + 𝛿𝑉 𝑑-, 𝑝 βˆ’ π‘…βˆ— 𝑑, 𝑝 = |𝑠

" $ + 𝛿 f "#

π‘ž""#

$ 𝑉 (𝑑-, 𝑝) βˆ’ π‘…βˆ— 𝑑, 𝑝 |

= 𝑠

" $ + 𝛿 f "#

π‘ž""#

$ 𝑉 𝑑-, 𝑝 βˆ’ 𝑠 " $ + 𝛿 f "#

π‘ž""#

$ π‘‰βˆ— 𝑑-, 𝑝

≀ | f

"#

π‘ž""#

$ [ 1 βˆ’ 𝛾 𝑑-

𝑅 𝑑-, 𝑝 βˆ’ π‘…βˆ— 𝑑-, 𝑝 + 𝛾 𝑑- (max

Y# 𝑅 𝑑-, 𝑝- βˆ’ max Y# π‘…βˆ— 𝑑-, 𝑝- )] | ≀

f

"#

π‘ž""#

$

max

"##,Y## |𝑅 𝑑--, 𝑝-- βˆ’ π‘…βˆ— 𝑑--, 𝑝-- | =

𝛿 max

"##,Y## |𝑅 𝑑--, 𝑝-- βˆ’ π‘…βˆ— 𝑑--, 𝑝-- |

slide-24
SLIDE 24
slide-25
SLIDE 25

3.Subgoals for learning options

  • It is natural to think of options as achieving subgoals of some kind,

and to adapt each option’s policy to better achieve its subgoal.

  • A simple way to formulate a subgoal for an option is to assign a

terminal subgoal value, g(s), to each state.

  • For example, to learn a hallway option in the rooms task, the target

hallway might be assigned a subgoal value of +1, while other get the subgoal value of zero.

  • Learn policies using subgoals independently using an off-policy

learning method such as Q-learning .

slide-26
SLIDE 26

3.Subgoals for learning options

slide-27
SLIDE 27

Contributions (Recap)

  • Problem: enable temporally abstract knowledge and action to be

included in the reinforcement learning

  • Introduced options, temporally extended courses of actions.
  • Extended theory of SMDPs to the context of options.
  • Introduction of intra-option learning algorithm that are able to learn

about options from fragment of execution.

  • Propose notion of subgoals that can be used to improve option

themselves.

slide-28
SLIDE 28

Limitations

  • Require to formalize subgoals/options.
  • Might necessitate a small state-action space.
  • The integration with state abstraction remain incompletely

understood.

slide-29
SLIDE 29

Questions

  • 1. Why should we use off-policy learning methods for learning the
  • ption policies using subgoals?
  • 2. What cases can you think of which intra value learning improve

upon the original option value learning?

  • 3. Is planning over options always going to speed up the planning?