learning curriculum policies for reinforcement learning
play

Learning(Curriculum(Policies(for( Reinforcement(Learning - PowerPoint PPT Presentation

Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu Successes(of(Reinforcement(Learning


  1. Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu

  2. Successes(of(Reinforcement(Learning Approaching$or$passing$human$level$performance BUT Can$take$ millions of$episodes!$People$learn$this$MUCH faster University$of$Texas$at$Austin Sanmit$Narvekar 2

  3. People(Learn(via(Curricula People$are$able$to$learn$a$lot$of$complex$tasks$very$efficiently$ University$of$Texas$at$Austin Sanmit$Narvekar 3

  4. Example:(Quick(Chess • Quickly$learn$the$ fundamentals$of$chess • 5$x$6$board$ • Fewer$pieces$per$type • No$castling • No$enOpassant$ University$of$Texas$at$Austin Sanmit$Narvekar 4

  5. Example:(Quick(Chess .$.$.$.$.$. University$of$Texas$at$Austin Sanmit$Narvekar 5

  6. Task(Space Pawns$+$King Pawns$only Target$task Empty$task One$piece$per$type • Quick$Chess$is$a$curriculum$designed$for$people • We$want$to$do$something$similar$automatically for$autonomous$agents University$of$Texas$at$Austin Sanmit$Narvekar 6

  7. Curriculum(Learning Task$=$MDP Environment State Action Reward Agent Task'Creation Assume$Given Transfer'Learning Sequencing This$work:$2$types • Curriculum$learning$is$a$complex$problem$that$ties$task$creation,$sequencing,$ and$transfer$learning University$of$Texas$at$Austin Sanmit$Narvekar 7

  8. Value(Function(Transfer • Initialize Q$function$in$target$task$using$values$learned$in$a$ source task Q source (s,a) • Assumptions: • Tasks$have$overlapping state$and$action$spaces$ • OR$an$interOtask$mapping is$provided • Existing$related$work$on$learning$mappings Image$credit:$Taylor$and$Stone,$JMLR$2009 University$of$Texas$at$Austin Sanmit$Narvekar 8

  9. Reward(Shaping(Transfer • Reward$function$in$target$task$augmented with$a$shaping$reward$ f : New$Reward Old$Reward Shaping$Reward • PotentialObased$advice$restricts$f$to$be$difference$of$potential$ functions: • Use$the$value$function$of$the$source as$the$potential$function: University$of$Texas$at$Austin Sanmit$Narvekar 9

  10. The(Problem:(Autonomous(Sequencing • Existing'work'heuristic=based ,$such$as$examining$performance$on$the$ target$task,$and$using$heuristics$to$select$next$task$ • In$this$work,$we$use$ learning'to'do'sequencing University$of$Texas$at$Austin Sanmit$Narvekar 10

  11. Sequencing(as(an(MDP Curriculum$Task Curriculum Agent Curriculum$Action Task$2 Task$1 Task$N Environment Environment Environment State State State Action Action Action Reward Reward Reward RL$ RL$ RL$ Agent Agent Agent Curriculum$State Curriculum$Reward Curriculum Agent University$of$Texas$at$Austin Sanmit$Narvekar 11

  12. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • State'space' S C :$All$policies$ ! i an$agent$can$represent • Action'space' A C :$Different$tasks$ M j an$agent$can$train$on • Transition'function' p C (s C ,a C ) :$Learning$task$ a C transforms$an$agent’s$policy$ s C • Reward'function' r C (s C ,a C ) :$Cost$in$time$steps$to$learn$task$ a C given$policy$ s C University$of$Texas$at$Austin Sanmit$Narvekar 12

  13. Sequencing(as(an(MDP M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 M 3 R 0,2 R 3,3 ! 3 • A$policy ! C :$S C ! A C on$this$curriculum$MDP$(CMDP)$specifies$which$task$to$ train$on$given$learning$agent$policy$ ! i • Essentially$training$a$teacher • How$to$do$learning$over$CMDP? • How$does$CMDP$change$when$transfer$method$changes? University$of$Texas$at$Austin Sanmit$Narvekar 13

  14. Learning(in(Curriculum(MDPs [1,3,4,…0] [1,2,3,…0] M 3 ! 1 R 1,3 M 1 ! 4 M 4 R 0,1 [1,2,3,…9] R 4,4 M 4 M 3 ! f ! 2 R 2,4 ! 0 R 0,3 M 4 R 5,4 M 2 ! 5 [0,0,0,…0] M 3 R 0,2 R 3,3 ! 3 Extract$Raw$CMDP$ Function$Approximation$ Extract$Features State$Variables and$Learning • Express$raw$CMDP$state$using$the$weights$of$base$agent’s$VF/policy • Extract$features so$that$similar$policies$(CMDP$states)$are$“close”$in$feature$ space University$of$Texas$at$Austin Sanmit$Narvekar 14

  15. Example:(Discrete(Representations State' State'2 State'3 State'4 1 CMDP'State'1 CMDP'State'2 CMDP'State'3 Left Right Policy Left Right Policy Left Right Policy State$1 0.3 0.7 ! State$1 0.2 0.8 ! State$1 0.7 0.3 " State$2 0.1 0.9 State$2 0.2 0.8 State$2 0.9 0.1 ! ! " State$3 0.4 0.6 State$3 0.2 0.8 State$3 0.6 0.4 ! ! " State$4 0.0 1.0$ State$4 0.3 0.7 State$4 0.0 1.0$ ! ! ! • CMDP$states$1$and$2$encode$very$similar$policies,$and$should$be$close$in$ CMDP$representation$space$$

  16. Example:(Discrete(Representations State'2 State'1 Normalized Normalized Q(State52,5Left) Q(State51,5Left) Normalized Normalized Q(State52,5Right) Q(State51,5Right) • One$approach:$use$tile$coding$ • Create$a$separate$tiling$on$a$stateObyOstate$level • When$comparing$CMDP$states,$the$more$similar$the$policies are$in$a$ primitive$state,$the$more$common$tiles$will$be$activated • Each$primitive$state$contributes$equally$towards$the$similarity$of$the$ CMDP$state University$of$Texas$at$Austin Sanmit$Narvekar 16

  17. Continuous(CMDP(Representations • In$continuous$domains,$weights$ are$not$local$to$a$state • Needs$to$be$done$separately$for$ each$domain • Neural$networks • Tile$coding • Etc… • If$the$base$agent$uses$a$linear$ function$approximator,$one$can$ use$tile$coding$over$the$ parameters$as$before University$of$Texas$at$Austin Sanmit$Narvekar 17

  18. Changes(in(Transfer(Algorithm M 3 " 1 R 1,3 M 1 " 4 M 4 R 0,1 R 4,4 M 4 M 3 " f " 2 R 2,4 " 0 R 0,3 M 4 R 5,4 M 2 " 5 M 3 R 0,2 R 3,3 " 3 • Transfer$method$directly$affects$CMDP$state$representation$and$transition$ function • CMDP$states$represent$“states$of$knowledge,”$where$knowledge$represented$as$ VF,$shaping$reward,$etc.$ • Similar$process$can$be$done$if$knowledge$parameterizable University$of$Texas$at$Austin Sanmit$Narvekar 18

  19. Experimental(Results • Evaluate$whether$curriculum$ policies$can$be$learned • Grid$world • Multiple$base$agents • Multiple$CMDP$state$ representations • Pacman • Multiple$transfer$learning$ algorithms • How$long$to$train$on$sources? University$of$Texas$at$Austin Sanmit$Narvekar 19

  20. Grid(world(Setup Agent'Types • Basic$Agent • State:$Sensors$on$4$sides$that$measure$distance$to$keys,$locks,$etc. • Actions:$Move$in$4$directions,$pickup$key,$unlock$lock • ActionOdependent$Agent$ • State$difference:$weights on$features$are$shared over$4$directions • Rope$Agent • Action$difference:$Like$basic,$but$can$use$rope$action$to$negate$a$pit CMDP'Representations • Finite$State$Representation • For$discrete$domains,$groups$and$normalizes$raw$weights$stateObyOstate$to$form$CMDP$features • Continuous$State$Representation • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent University$of$Texas$at$Austin Sanmit$Narvekar 20

  21. Basic(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 21

  22. ActionIDependent(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 22

  23. Rope(Agent(Results University$of$Texas$at$Austin Sanmit$Narvekar 23

  24. Pacman(Setup Agent'Representation • ActionOdependent$egocentric$features CMDP'Representation • Continuous$State$Representation • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent Transfer'Methods • Value$Function$Transfer • Reward$Shaping$Transfer How'long'to'train'on'a'source'task? University$of$Texas$at$Austin Sanmit$Narvekar 24

  25. Pacman(Value(Function(Transfer − 50000 Cost to Learn Target Task − 100000 − 150000 no curriculum − 200000 continuous state representation naive length 1 representation naive length 2 representation − 250000 0 100 200 300 400 500 600 700 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 25

  26. Pacman(Reward(Shaping(Transfer 0 − 500 Cost to Learn Target Task − 1000 − 1500 − 2000 − 2500 no curriculum Svetlik et al. (2017) − 3000 continuous state representation naive length 2 representation − 3500 0 100 200 300 400 500 600 700 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 26

  27. How(long(to(train? 0 Cost to Learn Target Task -100000 -200000 -300000 reward shaping (return-based) reward shaping (small fixed) -400000 value function (return-based) value function (small fixed) -500000 0 100 200 300 400 500 CMDP Episodes University$of$Texas$at$Austin Sanmit$Narvekar 27

  28. Related(Work Restrictions'on'source'tasks • Florensa et$al.$2018,$Riedmiller et$al.$2018,$Sukhbaatar et$al.$2017 Heuristic'based'sequencing • Da$Silva$et$al.$2018,$Svetlik et$al.$2017 MDP/POMDP'based'sequencing • Matiisen et$al.$2017,$Narvekar$et$al.$2017$ CL'for'supervised'learning • Bengio et$al.$2009,$Fan$et$al.$2018,$Graves$et$al.$2017 University$of$Texas$at$Austin Sanmit$Narvekar 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend