Understanding Human Teaching Modalities in Reinforcement Learning Environments
- W. Bradley Knox
and Peter Stone
A Preliminary Report
Slides available on the Program page of the ALIHT website.
Matthew E. Taylor
Understanding Human Teaching Modalities in Reinforcement Learning - - PowerPoint PPT Presentation
Understanding Human Teaching Modalities in Reinforcement Learning Environments A Preliminary Report Slides available on the Program page of the ALIHT website. W. Bradley Knox Matthew E. Taylor and Peter Stone Knowledge! Desires! Current
Understanding Human Teaching Modalities in Reinforcement Learning Environments
and Peter Stone
A Preliminary Report
Slides available on the Program page of the ALIHT website.
Matthew E. Taylor
Knowledge! Desires!
Current state of interactive learning evaluation
Beats hand- coded! Better than RL! Nice demo!
Reinforcement learning tasks
behavior
Environment Agent
Action State Reward
Learning from demonstration (LfD)
Robot Learning from Demonstration. RAS, 2009.
Lockerd & Breazeal Nicolescu & Matarić Grollman & Jenkins Argall, Browning & Veloso
Learning from feedback (interactive shaping)
Knox and Stone, K-CAP 2009
Key insight: trainer evaluates behavior using a model of its long-term quality
Learning from feedback (interactive shaping)
Learn a model of human reinforcement Directly exploit the model to determine action
If greedy: Knox and Stone, K-CAP 2009
Learning from feedback (interactive shaping)
Training:
Learning from feedback (interactive shaping)
After training:
Learning from feedback (interactive shaping)
Training:
LfD and LfF vs. RL
And out come the contendas!!
Just do as I do. Good robot! Learning from Demonstration (LfD) Learning from Feedback (LfF)
An a priori comparison
Interface
familiar to video game players
and task-independent
Demonstration more specifically points to the correct action
An a priori comparison
Expression of learned model during training:
LfF? yes. LfD? generally no.
performance
address model’s weaknesses
performance match up better
Painted with MLDemos software
An a priori comparison
Task expertise
control
expertise while training with LfD
Cognitive load - less for LfF
An a priori comparison
General hypothesis LfD generally performs better, but situation-dependent
Pilot study
Pilot study
16 undergraduates Cart Pole first, then Mountain Car
Keyboard interface
Pilot study
Main result
!"##$ !%##$ !&##$ #$ &$ "$ '$ ($ )$ &&$ &"$ &'$ !"#$%&"'#&(%)"&%")*+,("%
.$/*$"%0"&1,&2#$3"4%!5%
*+,$ *+-$ #$ %#$ .#$ /#$ 0#$ &##$ &$ %$ "$ .$ '$ /$ ($ 0$ )$ &#$ !"#$%&"'#&(%)"&%")*+,("%
.$/*$"%)"&1,&2#$3"4%50%
*+,$ *+-$ !"#$% !""$% !&'$% !&($% &% )% *% +% ,% &&% &)% &*% !"#$%&"'#&(%)"&%")*+,("%
.,/*01%)"&2,&3#$0"4%!5%
$% "$% ($% #$% '$% &% "% )% (% *% #% +% '% ,% &$% !"#$%&"'#&(%)"&%")*+,("%
.,/*01%."&2,&3#$0"4%5.%
!" #!" $!!" $" %" !"#$%&"'#&(%)"&% ")*+,("%
12".3+%,4%3"#./*$0%,&("&5%67%
&'(" &')" *%+!" *%%!" *$,!" *$-!" $" %" !"#$%&"'#&(%)"&% ")*+,("%
12".3+%,4%3"#./*$0%,&("&5%!6%
&'(" &')"
Pilot study
Interaction effects
!" #!" $!" %!" &!" '()*)+,-" .'/)01/" !"#$%&"'#&(%)"&% ")*+,("%
23"0.+%,4%*$+.&/01,$+%05#$6"7%89%
234" 235" 6#$!" 6#!!" 67%!" 67#!" '()*)+,-" .'/)01/" !"#$%&"'#&(%)"&% ")*+,("%
23"0.%,4%*$+.&/01,$+%05#$6"7%!8%
234" 235"
Pilot study
Interaction effects
Added a verbal instruction to give frequent feedback for LfF.
!"#$% !"&$% !""$% !'($% !')$% !'*$% !'$$% '% *% &% +% (% ''% '*% '&% !"#$%&"'#&(%)"&%")*+,("%
,-./%0122345% 6757% ,-8/%0122345% 6757% ,-./%97:5%6757%
Pilot study
Interaction effects
Previous experiment differed:
!"#$ !%#$ %#$ "#$ %$ &$ "$ '$ ($ )$ *$ +$ ,$ %#$ !"#$%&"'#&(%)"&% ")*+,("%
.,/*01%)"&2,&3#$0"%4%,$/*$"% )"&2,&3#$0"5%6.%
!+#$ !'#$ #$ '#$ +#$
%$ "$ ($ *$ ,$ %%$ %"$ %($ !"#$%&"'#&(%)"&% ")*+,("%
.,/*01%)"&2,&3#$0"%4%,$/*$"% )"&2,&3#$0"5%!6%
Pilot study
Online vs. offline performance
Pilot study
Tentative takeaways from performance comparisons
LfD was better in our experiments. But both were sensitive to the experimental setup.
Pilot study
Tentative takeaways from performance comparisons
Subjects need more preparation for LfF.
learning on the job
Pilot study
Tentative takeaways from performance comparisons
LfD’s offline, learned performance is generally worse than its training samples. LfF’s offline, learned performance is generally as good or better than during training.
To conclude,
Results
performance.
To conclude,
Near future work
representation and control interface quality)
Performance
B
Condition
A
To conclude,
To conclude,
To conclude,
To conclude,