Understanding Human Teaching Modalities in Reinforcement Learning - - PowerPoint PPT Presentation

understanding human teaching modalities in reinforcement
SMART_READER_LITE
LIVE PREVIEW

Understanding Human Teaching Modalities in Reinforcement Learning - - PowerPoint PPT Presentation

Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone Knowledge! Desires! Current


slide-1
SLIDE 1

Understanding Human Teaching Modalities in Reinforcement Learning Environments

  • W. Bradley Knox

and Peter Stone

A Preliminary Report

Slides available on the Program page of the ALIHT website.

Matthew E. Taylor

slide-2
SLIDE 2

Knowledge! Desires!

slide-3
SLIDE 3

Current state of interactive learning evaluation

Beats hand- coded! Better than RL! Nice demo!

slide-4
SLIDE 4

1st 3rd 2nd

slide-5
SLIDE 5

Reinforcement learning tasks

  • Learn from limited feedback
  • Delayed reward
  • Very general
  • Possibly slow learning
  • Human end-user cannot determine correct

behavior

Environment Agent

Action State Reward

slide-6
SLIDE 6

Learning from demonstration (LfD)

  • Goal: reproduce behavior / policy
  • generalizing effectively to unseen situations
  • Argall, Chernova, Veloso and Browning. A Survey of

Robot Learning from Demonstration. RAS, 2009.

Lockerd & Breazeal Nicolescu & Matarić Grollman & Jenkins Argall, Browning & Veloso

slide-7
SLIDE 7

Learning from feedback (interactive shaping)

TAMER

Knox and Stone, K-CAP 2009

Key insight: trainer evaluates behavior using a model of its long-term quality

slide-8
SLIDE 8

Learning from feedback (interactive shaping)

TAMER

Learn a model of human reinforcement Directly exploit the model to determine action

If greedy: Knox and Stone, K-CAP 2009

slide-9
SLIDE 9

Learning from feedback (interactive shaping)

Training:

slide-10
SLIDE 10

Learning from feedback (interactive shaping)

After training:

slide-11
SLIDE 11

Learning from feedback (interactive shaping)

Training:

slide-12
SLIDE 12

LfD and LfF vs. RL

  • Noisy
  • Limited by human ability
  • Requires human’s time
  • Faster learning
  • Empowers humans to define task
slide-13
SLIDE 13

And out come the contendas!!

Just do as I do. Good robot! Learning from Demonstration (LfD) Learning from Feedback (LfF)

VS.

slide-14
SLIDE 14

An a priori comparison

Interface

  • LfD interface may be

familiar to video game players

  • LfF interface is simpler

and task-independent

Demonstration more specifically points to the correct action

slide-15
SLIDE 15

An a priori comparison

Expression of learned model during training:

LfF? yes. LfD? generally no.

  • LfD - better initial training

performance

  • LfF - can observe and

address model’s weaknesses

  • LfF - training and testing

performance match up better

Painted with MLDemos software

slide-16
SLIDE 16

An a priori comparison

Task expertise

  • LfF - easier to judge than to

control

  • Easier for human to increase

expertise while training with LfD

Cognitive load - less for LfF

slide-17
SLIDE 17

An a priori comparison

General hypothesis LfD generally performs better, but situation-dependent

slide-18
SLIDE 18

Pilot study

slide-19
SLIDE 19

Pilot study

16 undergraduates Cart Pole first, then Mountain Car

  • Practice and test rounds
  • Randomized: LfF or LfD first
  • Unbalanced result: LfF was first for 87.5%
  • f CP and 69% of MC

Keyboard interface

  • LfD: j, k, l
  • LfF: z, /
slide-20
SLIDE 20

Pilot study

Main result

!"##$ !%##$ !&##$ #$ &$ "$ '$ ($ )$ &&$ &"$ &'$ !"#$%&"'#&(%)"&%")*+,("%

  • )*+,("%

.$/*$"%0"&1,&2#$3"4%!5%

*+,$ *+-$ #$ %#$ .#$ /#$ 0#$ &##$ &$ %$ "$ .$ '$ /$ ($ 0$ )$ &#$ !"#$%&"'#&(%)"&%")*+,("%

  • )*+,("%

.$/*$"%)"&1,&2#$3"4%50%

*+,$ *+-$ !"#$% !""$% !&'$% !&($% &% )% *% +% ,% &&% &)% &*% !"#$%&"'#&(%)"&%")*+,("%

  • )*+,("%

.,/*01%)"&2,&3#$0"4%!5%

  • ./%
  • .0%

$% "$% ($% #$% '$% &% "% )% (% *% #% +% '% ,% &$% !"#$%&"'#&(%)"&%")*+,("%

  • )*+,("%

.,/*01%."&2,&3#$0"4%5.%

  • ./%
  • .0%
slide-21
SLIDE 21

!" #!" $!!" $" %" !"#$%&"'#&(%)"&% ")*+,("%

  • "#./*$0%,&("&%

12".3+%,4%3"#./*$0%,&("&5%67%

&'(" &')" *%+!" *%%!" *$,!" *$-!" $" %" !"#$%&"'#&(%)"&% ")*+,("%

  • "#./*$0%,&("&%

12".3+%,4%3"#./*$0%,&("&5%!6%

&'(" &')"

Pilot study

Interaction effects

slide-22
SLIDE 22

!" #!" $!" %!" &!" '()*)+,-" .'/)01/" !"#$%&"'#&(%)"&% ")*+,("%

  • $+.&/01,$%+".%

23"0.+%,4%*$+.&/01,$+%05#$6"7%89%

234" 235" 6#$!" 6#!!" 67%!" 67#!" '()*)+,-" .'/)01/" !"#$%&"'#&(%)"&% ")*+,("%

  • $+.&/01,$%+".%

23"0.%,4%*$+.&/01,$+%05#$6"7%!8%

234" 235"

Pilot study

Interaction effects

Added a verbal instruction to give frequent feedback for LfF.

slide-23
SLIDE 23

!"#$% !"&$% !""$% !'($% !')$% !'*$% !'$$% '% *% &% +% (% ''% '*% '&% !"#$%&"'#&(%)"&%")*+,("%

  • )*+,("%
  • ."/0%,1%"2)"&*3"$0#4%+"05)%,$%,$4*$"%6178%!9%

,-./%0122345% 6757% ,-8/%0122345% 6757% ,-./%97:5%6757%

Pilot study

Interaction effects

Previous experiment differed:

  • more subject preparation
  • announced high scores in progress
  • ...
slide-24
SLIDE 24

!"#$ !%#$ %#$ "#$ %$ &$ "$ '$ ($ )$ *$ +$ ,$ %#$ !"#$%&"'#&(%)"&% ")*+,("%

  • )*+,("%

.,/*01%)"&2,&3#$0"%4%,$/*$"% )"&2,&3#$0"5%6.%

  • ./$
  • .0$

!+#$ !'#$ #$ '#$ +#$

%$ "$ ($ *$ ,$ %%$ %"$ %($ !"#$%&"'#&(%)"&% ")*+,("%

  • )*+,("%

.,/*01%)"&2,&3#$0"%4%,$/*$"% )"&2,&3#$0"5%!6%

  • ./$
  • .0$

Pilot study

Online vs. offline performance

slide-25
SLIDE 25

Pilot study

Tentative takeaways from performance comparisons

LfD was better in our experiments. But both were sensitive to the experimental setup.

slide-26
SLIDE 26

Pilot study

Tentative takeaways from performance comparisons

Subjects need more preparation for LfF.

  • With zero task expertise, LfD still allows

learning on the job

  • LfF vs. LfD interfaces
slide-27
SLIDE 27

Pilot study

Tentative takeaways from performance comparisons

LfD’s offline, learned performance is generally worse than its training samples. LfF’s offline, learned performance is generally as good or better than during training.

slide-28
SLIDE 28

To conclude,

Results

  • LfD was better.
  • But performance was situational.
  • LfF needed more subject preparation.
  • LfF models compared better to training

performance.

slide-29
SLIDE 29

To conclude,

Near future work

  • More subjects
  • More balanced conditions
  • More interesting manipulations (e.g., model

representation and control interface quality)

  • Aim for crossover interactions
  • Learn from both LfD and LfF!

Performance

B

Condition

A

slide-30
SLIDE 30

1st 3rd 2nd

To conclude,

slide-31
SLIDE 31

To conclude,

slide-32
SLIDE 32

To conclude,

slide-33
SLIDE 33

To conclude,