Learning(Curriculum(Policies(for( Reinforcement(Learning
Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu
Learning(Curriculum(Policies(for( Reinforcement(Learning - - PowerPoint PPT Presentation
Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu Successes(of(Reinforcement(Learning
Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu
University$of$Texas$at$Austin Sanmit$Narvekar 2
University$of$Texas$at$Austin Sanmit$Narvekar 3
4 Sanmit$Narvekar University$of$Texas$at$Austin
5
.$.$.$.$.$.
Sanmit$Narvekar University$of$Texas$at$Austin
6
Empty$task Pawns$only Pawns$+$King One$piece$per$type Target$task
Sanmit$Narvekar University$of$Texas$at$Austin
7
and$transfer$learning
Sanmit$Narvekar University$of$Texas$at$Austin
Environment Agent Action State Reward
Task$=$MDP Assume$Given This$work:$2$types
University$of$Texas$at$Austin Sanmit$Narvekar 8
Qsource(s,a)
Image$credit:$Taylor$and$Stone,$JMLR$2009
University$of$Texas$at$Austin Sanmit$Narvekar 9
New$Reward Old$Reward Shaping$Reward
University$of$Texas$at$Austin Sanmit$Narvekar 10
target$task,$and$using$heuristics$to$select$next$task$
University$of$Texas$at$Austin Sanmit$Narvekar 11
Environment RL$ Agent Action State Reward
Task$1
Environment Action State Reward
Task$2
RL$ Agent Environment Action State Reward
Task$N
RL$ Agent Curriculum Agent
Curriculum$Task
Curriculum$Action Curriculum$State Curriculum$Reward Curriculum Agent
University$of$Texas$at$Austin Sanmit$Narvekar 12
!0 !3 !1 !2 !5 !4 !f M1 M2 M3 M3 M3 M4 M4 M4 R0,1 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4
University$of$Texas$at$Austin Sanmit$Narvekar 13
!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4
train$on$given$learning$agent$policy$!i
University$of$Texas$at$Austin Sanmit$Narvekar 14
!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4
space
Extract$Raw$CMDP$ State$Variables Extract$Features Function$Approximation$ and$Learning [0,0,0,…0] [1,3,4,…0] [1,2,3,…9] [1,2,3,…0]
CMDP'State'1 Left Right Policy State$1 0.3 0.7 ! State$2 0.1 0.9 ! State$3 0.4 0.6 ! State$4 0.0 1.0$ ! CMDP'State'2 Left Right Policy State$1 0.2 0.8 ! State$2 0.2 0.8 ! State$3 0.2 0.8 ! State$4 0.3 0.7 ! CMDP'State'3 Left Right Policy State$1 0.7 0.3 " State$2 0.9 0.1 " State$3 0.6 0.4 " State$4 0.0 1.0$ !
State' 1
State'2 State'3 State'4
CMDP$representation$space$$
primitive$state,$the$more$common$tiles$will$be$activated
CMDP$state
University$of$Texas$at$Austin Sanmit$Narvekar 16 Normalized Q(State51,5Right) Normalized Q(State51,5Left)
State'1
Normalized Q(State52,5Right) Normalized Q(State52,5Left)
State'2
are$not$local$to$a$state
each$domain
function$approximator,$one$can$ use$tile$coding$over$the$ parameters$as$before
University$of$Texas$at$Austin Sanmit$Narvekar 17
University$of$Texas$at$Austin Sanmit$Narvekar 18
"0 "3 "1 "2 "5 "4 "f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4
function
VF,$shaping$reward,$etc.$
representations
algorithms
University$of$Texas$at$Austin Sanmit$Narvekar 19
Agent'Types
CMDP'Representations
University$of$Texas$at$Austin Sanmit$Narvekar 20
University$of$Texas$at$Austin Sanmit$Narvekar 21
University$of$Texas$at$Austin Sanmit$Narvekar 22
University$of$Texas$at$Austin Sanmit$Narvekar 23
Agent'Representation
CMDP'Representation
Transfer'Methods
How'long'to'train'on'a'source'task?
University$of$Texas$at$Austin Sanmit$Narvekar 24
University$of$Texas$at$Austin Sanmit$Narvekar 25 100 200 300 400 500 600 700
CMDP Episodes
−250000 −200000 −150000 −100000 −50000
Cost to Learn Target Task
no curriculum continuous state representation naive length 1 representation naive length 2 representation
University$of$Texas$at$Austin Sanmit$Narvekar 26 100 200 300 400 500 600 700
CMDP Episodes
−3500 −3000 −2500 −2000 −1500 −1000 −500
Cost to Learn Target Task
no curriculum Svetlik et al. (2017) continuous state representation naive length 2 representation
University$of$Texas$at$Austin Sanmit$Narvekar 27
100 200 300 400 500
CMDP Episodes
Cost to Learn Target Task
reward shaping (return-based) reward shaping (small fixed) value function (return-based) value function (small fixed)
Restrictions'on'source'tasks
Heuristic'based'sequencing
MDP/POMDP'based'sequencing
CL'for'supervised'learning
University$of$Texas$at$Austin Sanmit$Narvekar 28
generation$as$an$MDP
learned,$and$is$robust$to:
29 Sanmit$Narvekar University$of$Texas$at$Austin
!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4
100 200 300 400 500 600 700CMDP Episodes
−250000 −200000 −150000 −100000 −50000Cost to Learn Target Task
no curriculum continuous state representation naive length 1 representation naive length 2 representation