Learning(Curriculum(Policies(for( Reinforcement(Learning - - PowerPoint PPT Presentation

learning curriculum policies for reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Learning(Curriculum(Policies(for( Reinforcement(Learning - - PowerPoint PPT Presentation

Learning(Curriculum(Policies(for( Reinforcement(Learning Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu Successes(of(Reinforcement(Learning


slide-1
SLIDE 1

Learning(Curriculum(Policies(for( Reinforcement(Learning

Sanmit'Narvekar and$Peter$Stone Department$of$Computer$Science University$of$Texas$at$Austin {sanmit,$pstone}$@cs.utexas.edu

slide-2
SLIDE 2

Successes(of(Reinforcement(Learning

University$of$Texas$at$Austin Sanmit$Narvekar 2

Approaching$or$passing$human$level$performance BUT Can$take$millions of$episodes!$People$learn$this$MUCH faster

slide-3
SLIDE 3

People(Learn(via(Curricula

University$of$Texas$at$Austin Sanmit$Narvekar 3

People$are$able$to$learn$a$lot$of$complex$tasks$very$efficiently$

slide-4
SLIDE 4

Example:(Quick(Chess

  • Quickly$learn$the$

fundamentals$of$chess

  • 5$x$6$board$
  • Fewer$pieces$per$type
  • No$castling
  • No$enOpassant$

4 Sanmit$Narvekar University$of$Texas$at$Austin

slide-5
SLIDE 5

Example:(Quick(Chess

5

.$.$.$.$.$.

Sanmit$Narvekar University$of$Texas$at$Austin

slide-6
SLIDE 6

Task(Space

6

Empty$task Pawns$only Pawns$+$King One$piece$per$type Target$task

  • Quick$Chess$is$a$curriculum$designed$for$people
  • We$want$to$do$something$similar$automatically for$autonomous$agents

Sanmit$Narvekar University$of$Texas$at$Austin

slide-7
SLIDE 7

Curriculum(Learning

7

  • Curriculum$learning$is$a$complex$problem$that$ties$task$creation,$sequencing,$

and$transfer$learning

Task'Creation Sequencing Transfer'Learning

Sanmit$Narvekar University$of$Texas$at$Austin

Environment Agent Action State Reward

Task$=$MDP Assume$Given This$work:$2$types

slide-8
SLIDE 8

Value(Function(Transfer

  • Initialize Q$function$in$target$task$using$values$learned$in$a$

source task

  • Assumptions:
  • Tasks$have$overlapping state$and$action$spaces$
  • OR$an$interOtask$mapping is$provided
  • Existing$related$work$on$learning$mappings

University$of$Texas$at$Austin Sanmit$Narvekar 8

Qsource(s,a)

Image$credit:$Taylor$and$Stone,$JMLR$2009

slide-9
SLIDE 9

Reward(Shaping(Transfer

  • Reward$function$in$target$task$augmented with$a$shaping$reward$

f:

  • PotentialObased$advice$restricts$f$to$be$difference$of$potential$

functions:

  • Use$the$value$function$of$the$source as$the$potential$function:

University$of$Texas$at$Austin Sanmit$Narvekar 9

New$Reward Old$Reward Shaping$Reward

slide-10
SLIDE 10

The(Problem:(Autonomous(Sequencing

University$of$Texas$at$Austin Sanmit$Narvekar 10

  • Existing'work'heuristic=based,$such$as$examining$performance$on$the$

target$task,$and$using$heuristics$to$select$next$task$

  • In$this$work,$we$use$learning'to'do'sequencing
slide-11
SLIDE 11

Sequencing(as(an(MDP

University$of$Texas$at$Austin Sanmit$Narvekar 11

Environment RL$ Agent Action State Reward

Task$1

Environment Action State Reward

Task$2

RL$ Agent Environment Action State Reward

Task$N

RL$ Agent Curriculum Agent

Curriculum$Task

Curriculum$Action Curriculum$State Curriculum$Reward Curriculum Agent

slide-12
SLIDE 12

Sequencing(as(an(MDP

University$of$Texas$at$Austin Sanmit$Narvekar 12

!0 !3 !1 !2 !5 !4 !f M1 M2 M3 M3 M3 M4 M4 M4 R0,1 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • State'space'SC:$All$policies$!i an$agent$can$represent
  • Action'space'AC:$Different$tasks$Mj an$agent$can$train$on
  • Transition'function'pC(sC,aC):$Learning$task$aC transforms$an$agent’s$policy$sC
  • Reward'function'rC(sC,aC):$Cost$in$time$steps$to$learn$task$aC given$policy$sC
slide-13
SLIDE 13

Sequencing(as(an(MDP

University$of$Texas$at$Austin Sanmit$Narvekar 13

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • A$policy !C:$SC ! AC on$this$curriculum$MDP$(CMDP)$specifies$which$task$to$

train$on$given$learning$agent$policy$!i

  • Essentially$training$a$teacher
  • How$to$do$learning$over$CMDP?
  • How$does$CMDP$change$when$transfer$method$changes?
slide-14
SLIDE 14

Learning(in(Curriculum(MDPs

University$of$Texas$at$Austin Sanmit$Narvekar 14

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • Express$raw$CMDP$state$using$the$weights$of$base$agent’s$VF/policy
  • Extract$features so$that$similar$policies$(CMDP$states)$are$“close”$in$feature$

space

Extract$Raw$CMDP$ State$Variables Extract$Features Function$Approximation$ and$Learning [0,0,0,…0] [1,3,4,…0] [1,2,3,…9] [1,2,3,…0]

slide-15
SLIDE 15

CMDP'State'1 Left Right Policy State$1 0.3 0.7 ! State$2 0.1 0.9 ! State$3 0.4 0.6 ! State$4 0.0 1.0$ ! CMDP'State'2 Left Right Policy State$1 0.2 0.8 ! State$2 0.2 0.8 ! State$3 0.2 0.8 ! State$4 0.3 0.7 ! CMDP'State'3 Left Right Policy State$1 0.7 0.3 " State$2 0.9 0.1 " State$3 0.6 0.4 " State$4 0.0 1.0$ !

State' 1

State'2 State'3 State'4

Example:(Discrete(Representations

  • CMDP$states$1$and$2$encode$very$similar$policies,$and$should$be$close$in$

CMDP$representation$space$$

slide-16
SLIDE 16

Example:(Discrete(Representations

  • One$approach:$use$tile$coding$
  • Create$a$separate$tiling$on$a$stateObyOstate$level
  • When$comparing$CMDP$states,$the$more$similar$the$policies are$in$a$

primitive$state,$the$more$common$tiles$will$be$activated

  • Each$primitive$state$contributes$equally$towards$the$similarity$of$the$

CMDP$state

University$of$Texas$at$Austin Sanmit$Narvekar 16 Normalized Q(State51,5Right) Normalized Q(State51,5Left)

State'1

Normalized Q(State52,5Right) Normalized Q(State52,5Left)

State'2

slide-17
SLIDE 17

Continuous(CMDP(Representations

  • In$continuous$domains,$weights$

are$not$local$to$a$state

  • Needs$to$be$done$separately$for$

each$domain

  • Neural$networks
  • Tile$coding
  • Etc…
  • If$the$base$agent$uses$a$linear$

function$approximator,$one$can$ use$tile$coding$over$the$ parameters$as$before

University$of$Texas$at$Austin Sanmit$Narvekar 17

slide-18
SLIDE 18

Changes(in(Transfer(Algorithm

University$of$Texas$at$Austin Sanmit$Narvekar 18

"0 "3 "1 "2 "5 "4 "f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

  • Transfer$method$directly$affects$CMDP$state$representation$and$transition$

function

  • CMDP$states$represent$“states$of$knowledge,”$where$knowledge$represented$as$

VF,$shaping$reward,$etc.$

  • Similar$process$can$be$done$if$knowledge$parameterizable
slide-19
SLIDE 19

Experimental(Results

  • Evaluate$whether$curriculum$

policies$can$be$learned

  • Grid$world
  • Multiple$base$agents
  • Multiple$CMDP$state$

representations

  • Pacman
  • Multiple$transfer$learning$

algorithms

  • How$long$to$train$on$sources?

University$of$Texas$at$Austin Sanmit$Narvekar 19

slide-20
SLIDE 20

Grid(world(Setup

Agent'Types

  • Basic$Agent
  • State:$Sensors$on$4$sides$that$measure$distance$to$keys,$locks,$etc.
  • Actions:$Move$in$4$directions,$pickup$key,$unlock$lock
  • ActionOdependent$Agent$
  • State$difference:$weights on$features$are$shared over$4$directions
  • Rope$Agent
  • Action$difference:$Like$basic,$but$can$use$rope$action$to$negate$a$pit

CMDP'Representations

  • Finite$State$Representation
  • For$discrete$domains,$groups$and$normalizes$raw$weights$stateObyOstate$to$form$CMDP$features
  • Continuous$State$Representation
  • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent

University$of$Texas$at$Austin Sanmit$Narvekar 20

slide-21
SLIDE 21

Basic(Agent(Results

University$of$Texas$at$Austin Sanmit$Narvekar 21

slide-22
SLIDE 22

ActionIDependent(Agent(Results

University$of$Texas$at$Austin Sanmit$Narvekar 22

slide-23
SLIDE 23

Rope(Agent(Results

University$of$Texas$at$Austin Sanmit$Narvekar 23

slide-24
SLIDE 24

Pacman(Setup

Agent'Representation

  • ActionOdependent$egocentric$features

CMDP'Representation

  • Continuous$State$Representation
  • Directly$uses$raw$weights$of$learning$agent$as$features$for$CMDP$agent

Transfer'Methods

  • Value$Function$Transfer
  • Reward$Shaping$Transfer

How'long'to'train'on'a'source'task?

University$of$Texas$at$Austin Sanmit$Narvekar 24

slide-25
SLIDE 25

Pacman(Value(Function(Transfer

University$of$Texas$at$Austin Sanmit$Narvekar 25 100 200 300 400 500 600 700

CMDP Episodes

−250000 −200000 −150000 −100000 −50000

Cost to Learn Target Task

no curriculum continuous state representation naive length 1 representation naive length 2 representation

slide-26
SLIDE 26

Pacman(Reward(Shaping(Transfer

University$of$Texas$at$Austin Sanmit$Narvekar 26 100 200 300 400 500 600 700

CMDP Episodes

−3500 −3000 −2500 −2000 −1500 −1000 −500

Cost to Learn Target Task

no curriculum Svetlik et al. (2017) continuous state representation naive length 2 representation

slide-27
SLIDE 27

How(long(to(train?

University$of$Texas$at$Austin Sanmit$Narvekar 27

100 200 300 400 500

CMDP Episodes

  • 500000
  • 400000
  • 300000
  • 200000
  • 100000

Cost to Learn Target Task

reward shaping (return-based) reward shaping (small fixed) value function (return-based) value function (small fixed)

slide-28
SLIDE 28

Related(Work

Restrictions'on'source'tasks

  • Florensa et$al.$2018,$Riedmiller et$al.$2018,$Sukhbaatar et$al.$2017

Heuristic'based'sequencing

  • Da$Silva$et$al.$2018,$Svetlik et$al.$2017

MDP/POMDP'based'sequencing

  • Matiisen et$al.$2017,$Narvekar$et$al.$2017$

CL'for'supervised'learning

  • Bengio et$al.$2009,$Fan$et$al.$2018,$Graves$et$al.$2017

University$of$Texas$at$Austin Sanmit$Narvekar 28

slide-29
SLIDE 29

Summary

  • Generalize/Formulate$curriculum$

generation$as$an$MDP

  • Demonstrate$curriculum$policies$can$be$

learned,$and$is$robust$to:

  • Learning$agent$state/action$representation
  • CMDP$representations
  • Transfer$algorithm$used

29 Sanmit$Narvekar University$of$Texas$at$Austin

!0 !3 !1 !2 !5 !4 !f M1 R0,1 M2 M3 M3 M3 M4 M4 M4 R0,3 R0,2 R1,3 R2,4 R3,3 R4,4 R5,4

100 200 300 400 500 600 700

CMDP Episodes

−250000 −200000 −150000 −100000 −50000

Cost to Learn Target Task

no curriculum continuous state representation naive length 1 representation naive length 2 representation