Source'Task'Creation'for' Curriculum'Learning Sanmit'Narvekar - - PowerPoint PPT Presentation

source task creation for curriculum learning
SMART_READER_LITE
LIVE PREVIEW

Source'Task'Creation'for' Curriculum'Learning Sanmit'Narvekar - - PowerPoint PPT Presentation

Source'Task'Creation'for' Curriculum'Learning Sanmit'Narvekar ,"Jivko Sinapov,"Matteo"Leonetti,"and"Peter"Stone Department"of"Computer"Science University"of"Texas"at"Austin


slide-1
SLIDE 1

Source'Task'Creation'for' Curriculum'Learning

Sanmit'Narvekar,"Jivko Sinapov,"Matteo"Leonetti,"and"Peter"Stone Department"of"Computer"Science University"of"Texas"at"Austin {sanmit,"jsinapov,"matteo,"pstone}"@cs.utexas.edu

slide-2
SLIDE 2

Introduction

  • Curricula"widespread"in"human"learning
  • Education,"sports,"games…
  • Why"curricula?
  • Target"task"too"hard"to"make"progress
  • Faster"to"learn"and"combine"skills"from

easier"tasks

A"good"curriculum:

  • Breaks"down"the"task
  • Lets"the"agent"learn"on"its"own
  • Adjusts"to"the"progress"of"the"agent

2 Sanmit"Narvekar University"of"Texas"at"Austin

slide-3
SLIDE 3

Example:'Quick'Chess

  • Quickly"learn"the"

fundamentals"of"chess

  • 5"x"6"board"
  • Fewer"pieces"per"type
  • No"castling
  • No"enUpassant"

3 Sanmit"Narvekar University"of"Texas"at"Austin

slide-4
SLIDE 4

Example:'Quick'Chess

4

.".".".".".

Sanmit"Narvekar University"of"Texas"at"Austin

slide-5
SLIDE 5

Task'Space

5

Empty"task Pawns"only Pawns"+"King One"piece"per"type Target"task

  • Quick"Chess"is"a"curriculum"designed"for"people
  • We"want"to"do"something"similar"for"autonomous"agents

Sanmit"Narvekar University"of"Texas"at"Austin

slide-6
SLIDE 6

Curriculum'Learning

6

  • Curriculum"learning"is"a"complex"problem"that"ties"task"creation,"sequencing,"

and"transfer"learning

Task'Creation Sequencing Transfer'Learning

Environment Agent Action State Reward

Task"="MDP

Sanmit"Narvekar University"of"Texas"at"Austin

slide-7
SLIDE 7

Transfer'Learning

7

Task'Creation Sequencing Transfer'Learning

  • Well"studied"problem"[Taylor"2009,"Lazaric"2011]
  • Given a"source"and"target"task,"howto"transfer"knowledge
  • We"transfer"value"functions

Sanmit"Narvekar University"of"Texas"at"Austin

slide-8
SLIDE 8

Task'Creation

8

Task'Creation Sequencing Transfer'Learning

  • This"talk"will"focus"on"task"creation
  • Automatic"sequencingis"an"important"direction"for"future"work
  • Show"we"can"create"a"useful"space"of"tasks"to"compose"a"curriculum""

Sanmit"Narvekar University"of"Texas"at"Austin

slide-9
SLIDE 9

Task'Creation

9 Sanmit"Narvekar University"of"Texas"at"Austin

slide-10
SLIDE 10

Formalism'for'Task'Creation

  • Key"Idea:"create"tasks"using"both"domain"knowledge and"by"observing"

the"agent’s"performance on"a"task

  • We"propose"a"formalism"for"task"creation
  • Consists"of"a"set"of"heuristic"functions

that"create"a"source"task"Ms given"a"target"task"Mt and"(s,a,s’,r)" trajectory"tuples"X"from"Mt

  • Formalism"is"domainUindependent (applicable"to"many"domains)

10 Sanmit"Narvekar University"of"Texas"at"Austin

slide-11
SLIDE 11

Formalism'for'Task'Creation

  • Each"function"alters different"parts of"the"MDP"M to"create"

source"tasks

University"of"Texas"at"Austin Sanmit"Narvekar 11

State/Action'Space Rewards

Reward"for"promotion

Transitions Initial/Terminal'State'Distributions

slide-12
SLIDE 12

Heuristic'Functions

1. Task"Simplification 2. Promising"Initializations 3. Mistake"Learning 4. Action"Simplification 5. OptionUbased"Subgoals 6. TaskUbased"Subgoals 7. Composite"Subtasks

12

Observes"the"agent Uses"knowledge"of"domain

Sanmit"Narvekar University"of"Texas"at"Austin

slide-13
SLIDE 13

Experimental'Domains

Ms.'PacDMan Half'Field'Offense

13 Sanmit"Narvekar University"of"Texas"at"Austin

slide-14
SLIDE 14

Task'Simplification

  • Use"knowledge"of"the"domain encoded"in"degrees"of"freedom"F to"

simplify"the"task

  • F"="[F1,"F2,"…"Fn]"vector"of"features"that"parameterize"the"domain
  • Assumes"ordering"over"each"Fi corresponding"to"task"complexity
  • Reduces the"complexity"of"one"degree"of"freedom"at"a"time

14

Easier Harder

Sanmit"Narvekar University"of"Texas"at"Austin

slide-15
SLIDE 15

Promising'Initializations

  • Positive"outcomes can"be"rare at"onset"of"learning
  • Explores"regions"of"state"space"near"positive"outcomes/rewards
  • C(s1,"s2):"distance"measure"quantifying"state"proximity
  • :"threshold"on"distance
  • :"percentile"threshold"on"which"states/rewards"in"X"are"positive"outcomes
  • Returns"MDP"that"initializes"start"state"distribution"to"these"states

15 Sanmit"Narvekar University"of"Texas"at"Austin

slide-16
SLIDE 16

Promising'Initializations

16

Number"of"steps"away Euclidean"Distance Number"of"“moves”"away

Sanmit"Narvekar University"of"Texas"at"Austin

slide-17
SLIDE 17

Mistake'Learning

  • Create"subtasks"to"avoid"or"correct"mistakes
  • Specified"by"the"domain
  • Eg."Termination"in"nonUgoal"state
  • Rewind the"episode"epsilon"steps"back,"and"learn"a"revised"policy"from"

there

17

Rewind Revise MISTAKE Checkmate

Sanmit"Narvekar University"of"Texas"at"Austin

slide-18
SLIDE 18

Mistake'Learning

18

Getting"eaten" by"ghost MISTAKES Not"eating" edible"ghost Failing"to"score Losing"possession How"far"back" to"rewind?

Sanmit"Narvekar University"of"Texas"at"Austin

slide-19
SLIDE 19

Results

19

2v2"Half"Field"Offense Ms."PacUMan

(results"in"paper)

Sanmit"Narvekar University"of"Texas"at"Austin

slide-20
SLIDE 20

2v2'HFO'Baseline

20 Sanmit"Narvekar University"of"Texas"at"Austin

slide-21
SLIDE 21

Curriculum'Generation

21

Agent Target"Task Empty"Task Promising"Initializations Mistake"Learning Task"Simplification X"="{(s,a,s’,r)," …}

Sanmit"Narvekar University"of"Texas"at"Austin

slide-22
SLIDE 22

Shoot'Task

  • Initially,"goal"scoring"episodes"are"

rare

  • We"observe"a"few"successful"

goals

  • Use"PromisingInitializationsto"

target"exploration"in"this"region

  • Agents"learn"to"shoot"on"goal

22 Sanmit"Narvekar University"of"Texas"at"Austin

slide-23
SLIDE 23

Dribble'Task

  • Agent"takes"too"many"shots"from"

far"away

  • Skill"needed:"move"the"ball"up"

the"field"while"maintaining" possession,"until"a"shot"is"likely"to" score

23 Sanmit"Narvekar University"of"Texas"at"Austin

slide-24
SLIDE 24

2v2'HFO'Results'

24

Baseline

Sanmit"Narvekar University"of"Texas"at"Austin

slide-25
SLIDE 25

2v2'HFO'Results'

25

Baseline One"step

Sanmit"Narvekar University"of"Texas"at"Austin

slide-26
SLIDE 26

2v2'HFO'Results'

26

Baseline One"step Two"step

Sanmit"Narvekar University"of"Texas"at"Austin

slide-27
SLIDE 27

2v3'HFO'Results

27

Baseline

Sanmit"Narvekar University"of"Texas"at"Austin

slide-28
SLIDE 28

2v3'HFO'Results

28

Baseline Two"step

Sanmit"Narvekar University"of"Texas"at"Austin

slide-29
SLIDE 29

2v3'HFO'Results

29

Baseline Two"step Three"step

Sanmit"Narvekar University"of"Texas"at"Austin

slide-30
SLIDE 30

Experimental'Recap

  • Tasks"created"by"our"formalism can"be"used"as"source"tasks"

in"a"curriculum"

  • Learning"via"a"curriculum"can"improve"learning"speed"or"

performance

30 Sanmit"Narvekar University"of"Texas"at"Austin

slide-31
SLIDE 31

Related'Work

  • Curriculum"learning"in"supervised"learning"[Bengio et"al."2009]
  • MultiUtask"reinforcement"learning"[Wilson"et"al."2007]
  • Lifelong"reinforcement"learning"[Ammar et"al."2014]
  • Learning"task"transferability"[Sinapovet"al."2015]

Key"Differences

  • Source"tasks"created"solelyto"improve"performance"on"target
  • Focus"on"task"generation,"not selection
  • AgentUtailored source"tasks"based"on"agent"performance

31 Sanmit"Narvekar University"of"Texas"at"Austin

slide-32
SLIDE 32

Summary

  • Presented"curriculum"learningin"the"

context"of"reinforcement"learning

  • Defined"a"domainUindependent"

formalism"to"create"source"tasks," tailored"to"the"performance"of"the" agent

  • Empirically"demonstrated"using"a"

curriculum"can"improve"learning"speed"

  • r"performance

32 Task'Creation Sequencing Transfer'Learning Sanmit"Narvekar University"of"Texas"at"Austin