autonomous task sequencing for customized curriculum
play

Autonomous(Task(Sequencing(for(Customized( - PowerPoint PPT Presentation

Autonomous(Task(Sequencing(for(Customized( Curriculum(Design(in(Reinforcement(Learning Sanmit'Narvekar, Jivko Sinapov,+and+Peter+Stone Department+of+Computer+Science University+of+Texas+at+Austin {sanmit,+jsinapov,+pstone}+@cs.utexas.edu


  1. Autonomous(Task(Sequencing(for(Customized( Curriculum(Design(in(Reinforcement(Learning Sanmit'Narvekar, Jivko Sinapov,+and+Peter+Stone Department+of+Computer+Science University+of+Texas+at+Austin {sanmit,+jsinapov,+pstone}+@cs.utexas.edu

  2. Successes(of(Reinforcement(Learning Approaching+or+passing+human+level+performance BUT Can+take+ millions of+episodes!+People+learn+this+MUCH faster University+of+Texas+at+Austin Sanmit+Narvekar 2

  3. People(Learn(via(Curricula People+are+able+to+learn+a+lot+of+complex+tasks+very+efficiently+ University+of+Texas+at+Austin Sanmit+Narvekar 3

  4. Example:(Quick(Chess • Quickly+learn+the+ fundamentals+of+chess • 5+x+6+board+ • Fewer+pieces+per+type • No+castling • No+enQpassant+ University+of+Texas+at+Austin Sanmit+Narvekar 4

  5. Example:(Quick(Chess .+.+.+.+.+. University+of+Texas+at+Austin Sanmit+Narvekar 5

  6. Task(Space Pawns+++King Pawns+only Target+task Empty+task One+piece+per+type • Quick+Chess+is+a+curriculum+designed+for+people • We+want+to+do+something+similar+automatically for+autonomous+agents University+of+Texas+at+Austin Sanmit+Narvekar 6

  7. Curriculum(Learning Task+=+MDP Environment State Action Reward Agent Task'Creation Presented+at+AAMAS+‘16 Transfer'Learning Sequencing via+Value+Function+Transfer • Curriculum+learning+is+a+complex+problem+that+ties+task+creation,+sequencing,+ and+transfer+learning University+of+Texas+at+Austin Sanmit+Narvekar 7

  8. Autonomous(Task(Sequencing University+of+Texas+at+Austin Sanmit+Narvekar 8

  9. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • State'space' S C :+All+policies+ ! i an+agent+can+represent • Action'space' A C :+Different+tasks+ M j an+agent+can+train+on • Transition'function' p C (s C ,a C ) :+Learning+task+ a C transforms+an+agent’s+policy+ s C • Reward'function' r C (s C ,a C ) :+Cost+in+time+steps+to+learn+task+ a C given+policy+ s C University+of+Texas+at+Austin Sanmit+Narvekar 9

  10. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • A+policy ! C :+S C ! A C on+this+curriculum+MDP+(CMDP)+specifies+which+task+to+ train+on+given+learning+agent+policy+ ! i • Learning+full+policy+ ! C can+be+difficult!+ • Taking+an+action+requires+solving+a+full+task+MDP • Transitions+are+not+deterministic+ University+of+Texas+at+Austin Sanmit+Narvekar 10

  11. Sequencing(as(an(MDP M 3 Target+Task ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • Instead,+find+one+trace/execution in+CMDP+of+ ! C* • Main'Idea :+Leverage+fact+that+we+know+the+target+task and+therefore+what+is+ relevant+for+the+final+state+policy+ ! f to+guide+selection of+tasks University+of+Texas+at+Austin Sanmit+Narvekar 11

  12. Autonomous(Sequencing Target'Task • Grid+world+domain • Objectives • Navigate+the+world • Pick+up+keys • Unlock+locks • Avoid+pits University+of+Texas+at+Austin Sanmit+Narvekar 12

  13. Autonomous(Sequencing 1 • Recursive+algorithm+(6+steps) 2 • Each+iteration+adds+a+source+task+to+ 3 the+curriculum Unsolvable+Tasks Solvable+Tasks • This+in+turn+updates+the+policy 4 5 • Terminates+when+performance+on+ target+task+greater than+desired+ performance+threshold+ University+of+Texas+at+Austin Sanmit+Narvekar 13 6

  14. Autonomous(Sequencing Step'1 1 Target'Task • Assume+learning+budget+ " • Attempt+to+solve target+task+ directly+in+ " steps.+Save+samples • Solvable? • Target+task+easy+to+learn • Started+with+policy+that+made+it+easy+ to+learn.+Done • Goal:+incrementally learn+subtasks+ to+build+a+policy that+can+learn+the+ target+task University+of+Texas+at+Austin Sanmit+Narvekar 14

  15. Autonomous(Sequencing Step'2 1 • Could+not+solve+target • Create+source+tasks using+ methods+from+AAMAS+‘16.+ 2 Step'3 • Attempt+to+solve+each+source+ in+ " steps • Partition+sources+into+ 3 solvable+/+unsolvable+ Solvable+Tasks Unsolvable+Tasks University+of+Texas+at+Austin Sanmit+Narvekar 15

  16. Autonomous(Sequencing Initial+Policy+ ! 0 Step'4 • If+solvable+tasks+exist,+select+ [s 1 ,+s 2 ,+s 3 ,+s 4 …+s " ] the+one+that+updates+the+ policy the+most+on+samples+ [ U … P ] , , , drawn+from+the+target+task Solvable+Tasks • Assumption • Source+tasks+that+can+be+ solved+have+policies+that+are+ ! 1 ! 2 relevant+to+the+target+task • Don’t+provide+negative+ [ … P [ U … P ] ] , , , , , , 4 transfer � University+of+Texas+at+Austin Sanmit+Narvekar 16

  17. Autonomous(Sequencing Step'4'(cont.) New+Policy+ ! 1 • Add+source+task to+curriculum • Return+to+Step+1 [s 1 ,+s 2 ,+s 3 ,+s 4 …+s " ] [ P … P ] , , , • (ReQevaluate+on+target+task) • Policy+has+changed,+so+we+will+get+a+new+set+of+samples • Samples+biased towards+agent’s+current+set+of+experiences • This+in+turn+guides+selection of+source+tasks University+of+Texas+at+Austin Sanmit+Narvekar 17

  18. Autonomous(Sequencing [s 1 ,+s 2 ,+s 3 …+s " ] Step'5 • No+sources+solvable+ • Sort+tasks+by+sample+relevance [s 4 ,+s 5 ,+s 6 …+s " ] [s 1 ,+s 2 ,+s 3 …+s " ] • Compare+states+experienced+in+ target+task+with+those+in+ Solvable+Tasks Unsolvable+Tasks experienced+in+sources • Recursively create+subQsource+ 5 tasks • Return+to+Step+2+with+the+ current+source+task+as+the+ target+task University+of+Texas+at+Austin Sanmit+Narvekar 18

  19. Autonomous(Sequencing 1 Step'6 2 • No+sources+usable after+ exhausting+the+tree 3 • Increase+budget,+return+to+ Unsolvable+Tasks Solvable+Tasks Step+1 4 5 • Learning+can+be+cached,+so+ agent+can+pick+up+where+it+ left+off University+of+Texas+at+Austin Sanmit+Narvekar 19 6

  20. Connection(to(CMDPs 1 2 3 M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 Unsolvable+Tasks Solvable+Tasks M 4 M 3 4 ! f ! 2 5 R 2,4 ! 0 R 0,3 M 4 M 2 ! 5 R 5,4 M 3 R 0,2 R 3,3 ! 3 6 • An+optimal+path in+CMDP+is+one+that+reaches+ ! f with+least+cost • Selection+in+Step+4+picks+tasks+that+update+most+towards+ ! f • Learning+budget+minimizes+cost • Algorithm+behaves+greedily to+balance+updates+and+cost University+of+Texas+at+Austin Sanmit+Narvekar 20

  21. Experimental(Setup • Grid+world+domain+presented+previously Create'multiple'agents • Multiple+agents+shows+the+algorithm+is+not+dependent+on+ implementation of+RL+agent • Evaluate+whether+different+agents+benefit+from+individualized+ curricula+ University+of+Texas+at+Austin Sanmit+Narvekar 21

  22. Experimental(Setup Agent'Types • Basic+Agent • State:+Sensors+on+4+sides+that+measure+distance+to+keys,+locks,+etc. • Actions:+Move+in+4+directions,+pickup+key,+unlock+lock • ActionQdependent+Agent+ • State+difference:+weights on+features+are+shared over+4+directions • Rope+Agent • Action+difference:+Like+basic,+but+can+use+rope+action+to+negate+a+pit University+of+Texas+at+Austin Sanmit+Narvekar 22

  23. Basic(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 23

  24. ActionEDependent(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 24

  25. Rope(Agent(Results University+of+Texas+at+Austin Sanmit+Narvekar 25

  26. Summary ! 1 M 3 R 1,3 ! 4 M 1 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 ! 5 M 2 R 5,4 M 3 R 0,2 R 3,3 ! 3 • Presented+a+novel+formulation+of+ curriculum+generation+as+an+MDP 1 • Proposed+an+algorithm+to+approximate+a+ 2 trace in+this+MDP 3 • Demonstrated+method+proposed+can+ Solvable+Tasks Unsolvable+Tasks 4 create+curricula+tailored+to+sensing+and+ 5 action+capabilities+of+agents 6 University+of+Texas+at+Austin Sanmit+Narvekar 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend