Autonomous(Task(Sequencing(for(Customized( - - PowerPoint PPT Presentation

autonomous task sequencing for customized curriculum
SMART_READER_LITE
LIVE PREVIEW

Autonomous(Task(Sequencing(for(Customized( - - PowerPoint PPT Presentation

Autonomous(Task(Sequencing(for(Customized( Curriculum(Design(in(Reinforcement(Learning Sanmit'Narvekar, Jivko Sinapov,+and+Peter+Stone Department+of+Computer+Science University+of+Texas+at+Austin {sanmit,+jsinapov,+pstone}+@cs.utexas.edu


slide-1
SLIDE 1

Autonomous(Task(Sequencing(for(Customized( Curriculum(Design(in(Reinforcement(Learning

Sanmit'Narvekar, Jivko Sinapov,+and+Peter+Stone Department+of+Computer+Science University+of+Texas+at+Austin {sanmit,+jsinapov,+pstone}+@cs.utexas.edu

slide-2
SLIDE 2

Successes(of(Reinforcement(Learning

University+of+Texas+at+Austin Sanmit+Narvekar 2

Approaching+or+passing+human+level+performance BUT Can+take+millions of+episodes!+People+learn+this+MUCH faster

slide-3
SLIDE 3

People(Learn(via(Curricula

University+of+Texas+at+Austin Sanmit+Narvekar 3

People+are+able+to+learn+a+lot+of+complex+tasks+very+efficiently+

slide-4
SLIDE 4

Example:(Quick(Chess

  • Quickly+learn+the+

fundamentals+of+chess

  • 5+x+6+board+
  • Fewer+pieces+per+type
  • No+castling
  • No+enQpassant+

4 Sanmit+Narvekar University+of+Texas+at+Austin

slide-5
SLIDE 5

Example:(Quick(Chess

5

.+.+.+.+.+.

Sanmit+Narvekar University+of+Texas+at+Austin

slide-6
SLIDE 6

Task(Space

6

Empty+task Pawns+only Pawns+++King One+piece+per+type Target+task

  • Quick+Chess+is+a+curriculum+designed+for+people
  • We+want+to+do+something+similar+automatically for+autonomous+agents

Sanmit+Narvekar University+of+Texas+at+Austin

slide-7
SLIDE 7

Curriculum(Learning

7

  • Curriculum+learning+is+a+complex+problem+that+ties+task+creation,+sequencing,+

and+transfer+learning

Task'Creation Sequencing Transfer'Learning

Sanmit+Narvekar University+of+Texas+at+Austin

Environment Agent Action State Reward

Task+=+MDP Presented+at+AAMAS+‘16 via+Value+Function+Transfer

slide-8
SLIDE 8

Autonomous(Task(Sequencing

University+of+Texas+at+Austin Sanmit+Narvekar 8

slide-9
SLIDE 9

Sequencing(as(an(MDP

University+of+Texas+at+Austin Sanmit+Narvekar 9

!0 !3 !1 !2 !5 !4 !f M1 M2 M3 M3 M3 M4 M4 M4 R0,1 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • State'space'SC:+All+policies+!i an+agent+can+represent
  • Action'space'AC:+Different+tasks+Mj an+agent+can+train+on
  • Transition'function'pC(sC,aC):+Learning+task+aC transforms+an+agent’s+policy+sC
  • Reward'function'rC(sC,aC):+Cost+in+time+steps+to+learn+task+aC given+policy+sC
slide-10
SLIDE 10

Sequencing(as(an(MDP

University+of+Texas+at+Austin Sanmit+Narvekar 10

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • A+policy !C:+SC ! AC on+this+curriculum+MDP+(CMDP)+specifies+which+task+to+

train+on+given+learning+agent+policy+!i

  • Learning+full+policy+!C can+be+difficult!+
  • Taking+an+action+requires+solving+a+full+task+MDP
  • Transitions+are+not+deterministic+
slide-11
SLIDE 11

Sequencing(as(an(MDP

University+of+Texas+at+Austin Sanmit+Narvekar 11

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • Instead,+find+one+trace/execution in+CMDP+of+!C*
  • Main'Idea:+Leverage+fact+that+we+know+the+target+task and+therefore+what+is+

relevant+for+the+final+state+policy+!f to+guide+selection of+tasks

Target+Task

slide-12
SLIDE 12

Autonomous(Sequencing

  • Grid+world+domain
  • Objectives
  • Navigate+the+world
  • Pick+up+keys
  • Unlock+locks
  • Avoid+pits

University+of+Texas+at+Austin Sanmit+Narvekar 12

Target'Task

slide-13
SLIDE 13

Autonomous(Sequencing

  • Recursive+algorithm+(6+steps)
  • Each+iteration+adds+a+source+task+to+

the+curriculum

  • This+in+turn+updates+the+policy
  • Terminates+when+performance+on+

target+task+greater than+desired+ performance+threshold+

University+of+Texas+at+Austin Sanmit+Narvekar 13

Solvable+Tasks Unsolvable+Tasks

1 2 3 4 5 6

slide-14
SLIDE 14

Autonomous(Sequencing

Step'1

  • Assume+learning+budget+"
  • Attempt+to+solve target+task+

directly+in+" steps.+Save+samples

  • Solvable?
  • Target+task+easy+to+learn
  • Started+with+policy+that+made+it+easy+

to+learn.+Done

  • Goal:+incrementally learn+subtasks+

to+build+a+policy that+can+learn+the+ target+task

University+of+Texas+at+Austin Sanmit+Narvekar 14

Target'Task

1

slide-15
SLIDE 15

Autonomous(Sequencing

Step'2

  • Could+not+solve+target
  • Create+source+tasks using+

methods+from+AAMAS+‘16.+ Step'3

  • Attempt+to+solve+each+source+

in+" steps

  • Partition+sources+into+

solvable+/+unsolvable+

University+of+Texas+at+Austin Sanmit+Narvekar 15

1 2 3

Solvable+Tasks Unsolvable+Tasks

slide-16
SLIDE 16

Autonomous(Sequencing

Step'4

  • If+solvable+tasks+exist,+select+

the+one+that+updates+the+ policy the+most+on+samples+ drawn+from+the+target+task

  • Assumption
  • Source+tasks+that+can+be+

solved+have+policies+that+are+ relevant+to+the+target+task

  • Don’t+provide+negative+

transfer

University+of+Texas+at+Austin Sanmit+Narvekar 16

Initial+Policy+!0 U … P , , , [ ]

4

U … P , , , [ ] …P , , , [ ] !1 !2

  • [s1,+s2,+s3,+s4 …+s"]

Solvable+Tasks

slide-17
SLIDE 17

Autonomous(Sequencing

Step'4'(cont.)

  • Add+source+task to+curriculum
  • Return+to+Step+1
  • (ReQevaluate+on+target+task)
  • Policy+has+changed,+so+we+will+get+a+new+set+of+samples
  • Samples+biased towards+agent’s+current+set+of+experiences
  • This+in+turn+guides+selection of+source+tasks

University+of+Texas+at+Austin Sanmit+Narvekar 17

New+Policy+!1 [s1,+s2,+s3,+s4 …+s"] P … P , , , [ ]

slide-18
SLIDE 18

Autonomous(Sequencing

Step'5

  • No+sources+solvable+
  • Sort+tasks+by+sample+relevance
  • Compare+states+experienced+in+

target+task+with+those+in+ experienced+in+sources

  • Recursively create+subQsource+

tasks

  • Return+to+Step+2+with+the+

current+source+task+as+the+ target+task

University+of+Texas+at+Austin Sanmit+Narvekar 18

Solvable+Tasks Unsolvable+Tasks

5

[s1,+s2,+s3 …+s"] [s4,+s5,+s6 …+s"] [s1,+s2,+s3 …+s"]

slide-19
SLIDE 19

Autonomous(Sequencing

Step'6

  • No+sources+usable after+

exhausting+the+tree

  • Increase+budget,+return+to+

Step+1

  • Learning+can+be+cached,+so+

agent+can+pick+up+where+it+ left+off

University+of+Texas+at+Austin Sanmit+Narvekar 19

Solvable+Tasks Unsolvable+Tasks

1 2 3 4 5 6

slide-20
SLIDE 20

Connection(to(CMDPs

  • An+optimal+path in+CMDP+is+one+that+reaches+!f with+least+cost
  • Selection+in+Step+4+picks+tasks+that+update+most+towards+!f
  • Learning+budget+minimizes+cost
  • Algorithm+behaves+greedily to+balance+updates+and+cost

University+of+Texas+at+Austin Sanmit+Narvekar 20

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

Solvable+Tasks Unsolvable+Tasks

1 2 3 4 5 6

slide-21
SLIDE 21

Experimental(Setup

  • Grid+world+domain+presented+previously

Create'multiple'agents

  • Multiple+agents+shows+the+algorithm+is+not+dependent+on+

implementation of+RL+agent

  • Evaluate+whether+different+agents+benefit+from+individualized+

curricula+

University+of+Texas+at+Austin Sanmit+Narvekar 21

slide-22
SLIDE 22

Experimental(Setup

Agent'Types

  • Basic+Agent
  • State:+Sensors+on+4+sides+that+measure+distance+to+keys,+locks,+etc.
  • Actions:+Move+in+4+directions,+pickup+key,+unlock+lock
  • ActionQdependent+Agent+
  • State+difference:+weights on+features+are+shared over+4+directions
  • Rope+Agent
  • Action+difference:+Like+basic,+but+can+use+rope+action+to+negate+a+pit

University+of+Texas+at+Austin Sanmit+Narvekar 22

slide-23
SLIDE 23

Basic(Agent(Results

University+of+Texas+at+Austin Sanmit+Narvekar 23

slide-24
SLIDE 24

ActionEDependent(Agent(Results

University+of+Texas+at+Austin Sanmit+Narvekar 24

slide-25
SLIDE 25

Rope(Agent(Results

University+of+Texas+at+Austin Sanmit+Narvekar 25

slide-26
SLIDE 26

Summary

  • Presented+a+novel+formulation+of+

curriculum+generation+as+an+MDP

  • Proposed+an+algorithm+to+approximate+a+

trace in+this+MDP

  • Demonstrated+method+proposed+can+

create+curricula+tailored+to+sensing+and+ action+capabilities+of+agents

26 Sanmit+Narvekar University+of+Texas+at+Austin

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

Solvable+Tasks Unsolvable+Tasks

1 2 3 4 5 6