Source'Task'Creation'for' Curriculum'Learning
Sanmit'Narvekar,"Jivko Sinapov,"Matteo"Leonetti,"and"Peter"Stone Department"of"Computer"Science University"of"Texas"at"Austin {sanmit,"jsinapov,"matteo,"pstone}"@cs.utexas.edu
Source'Task'Creation'for' Curriculum'Learning Sanmit'Narvekar - - PowerPoint PPT Presentation
Source'Task'Creation'for' Curriculum'Learning Sanmit'Narvekar ,"Jivko Sinapov,"Matteo"Leonetti,"and"Peter"Stone Department"of"Computer"Science University"of"Texas"at"Austin
Sanmit'Narvekar,"Jivko Sinapov,"Matteo"Leonetti,"and"Peter"Stone Department"of"Computer"Science University"of"Texas"at"Austin {sanmit,"jsinapov,"matteo,"pstone}"@cs.utexas.edu
easier"tasks
A"good"curriculum:
2 Sanmit"Narvekar University"of"Texas"at"Austin
fundamentals"of"chess
3 Sanmit"Narvekar University"of"Texas"at"Austin
4
.".".".".".
Sanmit"Narvekar University"of"Texas"at"Austin
5
Empty"task Pawns"only Pawns"+"King One"piece"per"type Target"task
Sanmit"Narvekar University"of"Texas"at"Austin
6
and"transfer"learning
Task'Creation Sequencing Transfer'Learning
Environment Agent Action State Reward
Task"="MDP
Sanmit"Narvekar University"of"Texas"at"Austin
7
Task'Creation Sequencing Transfer'Learning
Sanmit"Narvekar University"of"Texas"at"Austin
8
Task'Creation Sequencing Transfer'Learning
Sanmit"Narvekar University"of"Texas"at"Austin
9 Sanmit"Narvekar University"of"Texas"at"Austin
the"agent’s"performance on"a"task
that"create"a"source"task"Ms given"a"target"task"Mt and"(s,a,s’,r)" trajectory"tuples"X"from"Mt
10 Sanmit"Narvekar University"of"Texas"at"Austin
source"tasks
University"of"Texas"at"Austin Sanmit"Narvekar 11
State/Action'Space Rewards
Reward"for"promotion
Transitions Initial/Terminal'State'Distributions
1. Task"Simplification 2. Promising"Initializations 3. Mistake"Learning 4. Action"Simplification 5. OptionUbased"Subgoals 6. TaskUbased"Subgoals 7. Composite"Subtasks
12
Observes"the"agent Uses"knowledge"of"domain
Sanmit"Narvekar University"of"Texas"at"Austin
Ms.'PacDMan Half'Field'Offense
13 Sanmit"Narvekar University"of"Texas"at"Austin
simplify"the"task
14
Easier Harder
Sanmit"Narvekar University"of"Texas"at"Austin
15 Sanmit"Narvekar University"of"Texas"at"Austin
16
Number"of"steps"away Euclidean"Distance Number"of"“moves”"away
Sanmit"Narvekar University"of"Texas"at"Austin
there
17
Rewind Revise MISTAKE Checkmate
Sanmit"Narvekar University"of"Texas"at"Austin
18
Getting"eaten" by"ghost MISTAKES Not"eating" edible"ghost Failing"to"score Losing"possession How"far"back" to"rewind?
Sanmit"Narvekar University"of"Texas"at"Austin
19
2v2"Half"Field"Offense Ms."PacUMan
(results"in"paper)
Sanmit"Narvekar University"of"Texas"at"Austin
20 Sanmit"Narvekar University"of"Texas"at"Austin
21
Agent Target"Task Empty"Task Promising"Initializations Mistake"Learning Task"Simplification X"="{(s,a,s’,r)," …}
Sanmit"Narvekar University"of"Texas"at"Austin
rare
goals
target"exploration"in"this"region
22 Sanmit"Narvekar University"of"Texas"at"Austin
far"away
the"field"while"maintaining" possession,"until"a"shot"is"likely"to" score
23 Sanmit"Narvekar University"of"Texas"at"Austin
24
Baseline
Sanmit"Narvekar University"of"Texas"at"Austin
25
Baseline One"step
Sanmit"Narvekar University"of"Texas"at"Austin
26
Baseline One"step Two"step
Sanmit"Narvekar University"of"Texas"at"Austin
27
Baseline
Sanmit"Narvekar University"of"Texas"at"Austin
28
Baseline Two"step
Sanmit"Narvekar University"of"Texas"at"Austin
29
Baseline Two"step Three"step
Sanmit"Narvekar University"of"Texas"at"Austin
in"a"curriculum"
performance
30 Sanmit"Narvekar University"of"Texas"at"Austin
Key"Differences
31 Sanmit"Narvekar University"of"Texas"at"Austin
context"of"reinforcement"learning
formalism"to"create"source"tasks," tailored"to"the"performance"of"the" agent
curriculum"can"improve"learning"speed"
32 Task'Creation Sequencing Transfer'Learning Sanmit"Narvekar University"of"Texas"at"Austin