CSC2621 Topics in Robotics
Reinforcement Learning in Robotics
Week 11: Hierarchical Reinforcement Learning Animesh Garg
CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week - - PowerPoint PPT Presentation
CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement Learning Animesh Garg Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning Richard S. Sutton , Doina
Week 11: Hierarchical Reinforcement Learning Animesh Garg
Richard S. Sutton , Doina Precup , Satinder Singh Topic: Hierarchical RL Presenter: Panteha Naderian
Pot, stir until smooth
Contraction
the context of options.
MDP consists of:
$
= Pr{π‘*+, = π‘-|π‘* = π‘, π* = π}
" $ = πΉ{π *+,|π‘* = π‘, π* = π}
*+, + πΏπ *+B + πΏBπ *+C + β― π‘* = π‘, π
= β$βπG π π‘, π [π
" $ + πΏ β"# π""# $ π?(π‘-)]
?
π? π‘ = max
$βπG[π " $ + πΏ β"# π""# $ πβ(π‘-)]
according to π until the option terminates stochastically according to πΎ.
among those options, executing each to the termination, is an SMDP.
π"
Y = πΉ π *+, + πΏπ *+B + β― + πΏZ[,π Z+*
π ππ‘ ππππ’πππ’ππ ππ π‘π’ππ’π π‘ ππ’ π’πππ π’ πππ πππ‘π’ π π‘π’πππ‘ }
π""#
e = f Zg, h
πΏi π(π‘-, π)
πj π‘ = πΉ π
*+, + πΏπ *+B + β― + πΏZ[,π Z+* +πΏZ πj π‘*+Z
π π, π‘, π’
(k is the duration of the first option selected by π)
= f
YβeG
π π‘, π [π
" Y + f "#
π""#
e πj π‘- ]
πβ π‘ = max
YβeG[π " Y + β"# π""# e πβ π‘- ]
Q π‘, π = π π‘, π + π½(π + πΏZ max
Y#βeG# π π‘-, π- β π (π‘, π))
selecting a new option πj π‘ = βp π(π‘, π)π j(π‘, π), then switch.
I. For all s β π: πj#(π‘) β₯ πj(π‘) II. If from state π‘ β π, there is a non zero probability of encountering an interrupted history, then πj#(π‘) > πj(π‘)
according to the same distribution π(π‘*, . ). π π‘*, π β π π‘*, π + Ξ± (π
*+,+πΏπ π‘*+,, π
β π π‘*, π ]
π π‘, π = 1 β πΎ π‘ π π‘, π + πΎ(π‘) max
Y#βe π (π‘, π-)
Is an estimate of the value of state-option pair (s,o) upon arrival in state s.
Markov options, O, with deterministic policies, one-step intra-option Q-learning converges with probability 1 to the optimal Q-values, for every option regardless of what options are executed during learning, provided that every action gets executed in every state infinitely often.
π π‘, π β π π‘, π + Ξ± (π - + πΏπ π‘-, π β π π‘, π ] We prove that the operator πΉ[π - + πΏπ π‘-, π ] is a contraction. πΉ π - + πΏπ π‘-, π β π β π‘, π = |π
" $ + πΏ f "#
π""#
$ π (π‘-, π) β π β π‘, π |
= π
" $ + πΏ f "#
π""#
$ π π‘-, π β π " $ + πΏ f "#
π""#
$ πβ π‘-, π
β€ | f
"#
π""#
$ [ 1 β πΎ π‘-
π π‘-, π β π β π‘-, π + πΎ π‘- (max
Y# π π‘-, π- β max Y# π β π‘-, π- )] | β€
f
"#
π""#
$
max
"##,Y## |π π‘--, π-- β π β π‘--, π-- | =
πΏ max
"##,Y## |π π‘--, π-- β π β π‘--, π-- |
and to adapt each optionβs policy to better achieve its subgoal.
terminal subgoal value, g(s), to each state.
hallway might be assigned a subgoal value of +1, while other get the subgoal value of zero.
learning method such as Q-learning .
included in the reinforcement learning
about options from fragment of execution.
themselves.
understood.
upon the original option value learning?