Survey: Leveraging Human Guidance for Deep Reinforcement Learning - - PowerPoint PPT Presentation

survey leveraging human guidance for deep reinforcement
SMART_READER_LITE
LIVE PREVIEW

Survey: Leveraging Human Guidance for Deep Reinforcement Learning - - PowerPoint PPT Presentation

Survey: Leveraging Human Guidance for Deep Reinforcement Learning Tasks Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone University of Texas at Austin Presented by Lin Guan Lin Guan (UT Austin) Paper#10921 1 / 33 A


slide-1
SLIDE 1

Survey: Leveraging Human Guidance for Deep Reinforcement Learning Tasks

Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone

University of Texas at Austin Presented by Lin Guan

Lin Guan (UT Austin) Paper#10921 1 / 33

slide-2
SLIDE 2

A Reinforcement Learning Problem: Montezuma’s Revenge

Lin Guan (UT Austin) Paper#10921 2 / 33

slide-3
SLIDE 3

A Reinforcement Learning Problem: Montezuma’s Revenge

Lin Guan (UT Austin) Paper#10921 3 / 33

slide-4
SLIDE 4

A Reinforcement Learning Problem: Montezuma’s Revenge

Lin Guan (UT Austin) Paper#10921 4 / 33

slide-5
SLIDE 5

Learning Objective

Find an optimal policy, i.e., the action to take in an observed state that maximizes expected longterm reward

Lin Guan (UT Austin) Paper#10921 5 / 33

slide-6
SLIDE 6

Montezuma’s Revenge: Imitation Learning

Lin Guan (UT Austin) Paper#10921 6 / 33

slide-7
SLIDE 7

Survey Scope

64 papers, 5 types of human guidance that...

Lin Guan (UT Austin) Paper#10921 7 / 33

slide-8
SLIDE 8

Survey Scope

64 papers, 5 types of human guidance that... Are beyond conventional step-by-step action demonstrations

Lin Guan (UT Austin) Paper#10921 7 / 33

slide-9
SLIDE 9

Survey Scope

64 papers, 5 types of human guidance that... Are beyond conventional step-by-step action demonstrations Have shown promising results in training agents to solve deep reinforcement learning tasks

Lin Guan (UT Austin) Paper#10921 7 / 33

slide-10
SLIDE 10

1

Introduction

2

Learning from Human Evaluative Feedback

3

Learning from Human Preference

4

Hierarchical Imitation

5

Imitation from Observation

6

Learning Attention from Human

7

Conclusion

Lin Guan (UT Austin) Paper#10921 8 / 33

slide-11
SLIDE 11

Montezuma’s Revenge: Evaluative Feedback

Lin Guan (UT Austin) Paper#10921 9 / 33

slide-12
SLIDE 12

Motivation

While the true reward is delayed and sparse, human evaluative feedback is immediate and dense.

Lin Guan (UT Austin) Paper#10921 10 / 33

slide-13
SLIDE 13

Representative Works

Interpreting human feedback as: Reward function, replacing reward provided by the environment TAMER: Training an agent manually via evaluative reinforcement [Knox and Stone, 2009, Warnell et al., 2018]

Lin Guan (UT Austin) Paper#10921 11 / 33

slide-14
SLIDE 14

Representative Works

Interpreting human feedback as: Direct policy labels

Advise [Griffith et al., 2013, Cederborg et al., 2015]

Lin Guan (UT Austin) Paper#10921 12 / 33

slide-15
SLIDE 15

Representative Works

Interpreting human feedback as: Direct policy labels

Advise [Griffith et al., 2013, Cederborg et al., 2015]

Advantage function

COACH: Convergent actor-critic by humans [MacGlashan et al., 2017] This interpretation explains human feedback behaviors better in several tasks Still an unresolved issue that requires carefully designed human studies

Lin Guan (UT Austin) Paper#10921 12 / 33

slide-16
SLIDE 16

1

Introduction

2

Learning from Human Evaluative Feedback

3

Learning from Human Preference

4

Hierarchical Imitation

5

Imitation from Observation

6

Learning Attention from Human

7

Conclusion

Lin Guan (UT Austin) Paper#10921 13 / 33

slide-17
SLIDE 17

Montezuma’s Revenge: Human Preference

Lin Guan (UT Austin) Paper#10921 14 / 33

slide-18
SLIDE 18

Motivation

Ranking behaviors is easier than rating them. And sometimes the ranking can only be provided at the end of a behavior trajectory.

Lin Guan (UT Austin) Paper#10921 15 / 33

slide-19
SLIDE 19

Representative Works

[Christiano et al., 2017]: As an inverse reinforcement learning problem, i.e., learn human reward function from human preference rather than from demonstration

Lin Guan (UT Austin) Paper#10921 16 / 33

slide-20
SLIDE 20

Representative Works

[Christiano et al., 2017]: As an inverse reinforcement learning problem, i.e., learn human reward function from human preference rather than from demonstration Query selection? Preference elicitation [Zintgraf et al., 2018]

Lin Guan (UT Austin) Paper#10921 16 / 33

slide-21
SLIDE 21

Representative Works

[Christiano et al., 2017]: As an inverse reinforcement learning problem, i.e., learn human reward function from human preference rather than from demonstration Query selection? Preference elicitation [Zintgraf et al., 2018] Many good works on preference-based reinforcement learning [Wirth et al., 2017]

Lin Guan (UT Austin) Paper#10921 16 / 33

slide-22
SLIDE 22

1

Introduction

2

Learning from Human Evaluative Feedback

3

Learning from Human Preference

4

Hierarchical Imitation

5

Imitation from Observation

6

Learning Attention from Human

7

Conclusion

Lin Guan (UT Austin) Paper#10921 17 / 33

slide-23
SLIDE 23

Montezuma’s Revenge: Hierarchical Imitation

Lin Guan (UT Austin) Paper#10921 18 / 33

slide-24
SLIDE 24

Motivation

Human is good at specifying high-level abstract goals while the agent is good at performing low-level fine-grained controls.

Lin Guan (UT Austin) Paper#10921 19 / 33

slide-25
SLIDE 25

Representative Works

High-level+low-level demonstrations [Le et al., 2018]

Lin Guan (UT Austin) Paper#10921 20 / 33

slide-26
SLIDE 26

Representative Works

High-level+low-level demonstrations [Le et al., 2018] High-level demonstrations only [Andreas et al., 2017]

Lin Guan (UT Austin) Paper#10921 20 / 33

slide-27
SLIDE 27

Representative Works

High-level+low-level demonstrations [Le et al., 2018] High-level demonstrations only [Andreas et al., 2017] A promising combination:

High-level: Imitation learning, e.g., DAgger [Ross et al., 2011] Low-level: Reinforcement learning, e.g., DQN [Mnih et al., 2015]

Lin Guan (UT Austin) Paper#10921 20 / 33

slide-28
SLIDE 28

1

Introduction

2

Learning from Human Evaluative Feedback

3

Learning from Human Preference

4

Hierarchical Imitation

5

Imitation from Observation

6

Learning Attention from Human

7

Conclusion

Lin Guan (UT Austin) Paper#10921 21 / 33

slide-29
SLIDE 29

Montezuma’s Revenge: Imitation from Observation

Lin Guan (UT Austin) Paper#10921 22 / 33

slide-30
SLIDE 30

Motivation

To utilize a large amount of human demonstration data that do not have action labels, e.g., YouTube videos

Lin Guan (UT Austin) Paper#10921 23 / 33

slide-31
SLIDE 31

Representative Works

Challenge 1: Perception

Viewpoint [Liu et al., 2018, Stadie et al., 2017] Embodiment [Gupta et al., 2018, Sermanet et al., 2018]

Lin Guan (UT Austin) Paper#10921 24 / 33

slide-32
SLIDE 32

Representative Works

Challenge 1: Perception

Viewpoint [Liu et al., 2018, Stadie et al., 2017] Embodiment [Gupta et al., 2018, Sermanet et al., 2018]

Challenge 2: Control

Model-based: Infer the missing action given a state transitions (s, s′) by learning an inverse dynamics model [Nair et al., 2017, Torabi et al., 2018a] Model-free: e.g., bring the state distribution of the imitator closer to that of the trainer using generative adversarial learning [Merel et al., 2017, Torabi et al., 2018b]

Lin Guan (UT Austin) Paper#10921 24 / 33

slide-33
SLIDE 33

Representative Works

Challenge 1: Perception

Viewpoint [Liu et al., 2018, Stadie et al., 2017] Embodiment [Gupta et al., 2018, Sermanet et al., 2018]

Challenge 2: Control

Model-based: Infer the missing action given a state transitions (s, s′) by learning an inverse dynamics model [Nair et al., 2017, Torabi et al., 2018a] Model-free: e.g., bring the state distribution of the imitator closer to that of the trainer using generative adversarial learning [Merel et al., 2017, Torabi et al., 2018b]

Please see paper#10945: Recent Advances in Imitation Learning from Observation [Torabi et al., 2019]

Lin Guan (UT Austin) Paper#10921 24 / 33

slide-34
SLIDE 34

1

Introduction

2

Learning from Human Evaluative Feedback

3

Learning from Human Preference

4

Hierarchical Imitation

5

Imitation from Observation

6

Learning Attention from Human

7

Conclusion

Lin Guan (UT Austin) Paper#10921 25 / 33

slide-35
SLIDE 35

Montezuma’s Revenge: Human Attention

Lin Guan (UT Austin) Paper#10921 26 / 33

slide-36
SLIDE 36

Motivation

Human visual attention provides additional information on why a particular decision is made, e.g., by indicating the current object

  • f interest.

Lin Guan (UT Austin) Paper#10921 27 / 33

slide-37
SLIDE 37

Representative Works

AGIL: Attention-guided imitation learning [Zhang et al., 2018] Including attention does lead to higher accuracy in imitating human actions

Lin Guan (UT Austin) Paper#10921 28 / 33

slide-38
SLIDE 38

Representative Works

(a) Cooking [Li et al., 2018] (b) Driving [Palazzi et al., 2018, Xia et al., 2019]

Lin Guan (UT Austin) Paper#10921 29 / 33

slide-39
SLIDE 39

Survey Scope

An agent can learn... From human evaluative feedback From human preference From high-level goals specified by humans By observing human performing the task From human visual attention

Lin Guan (UT Austin) Paper#10921 30 / 33

slide-40
SLIDE 40

Future Directions

Shared datasets and reproducibility Understanding human trainers’ behaviors, e.g.,[Thomaz and Breazeal, 2008] A unified lifelong learning framework [Abel et al., 2017]

Lin Guan (UT Austin) Paper#10921 31 / 33

slide-41
SLIDE 41

Survey: Leveraging Human Guidance for Deep Reinforcement Learning Tasks

Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone

University of Texas at Austin Presented by Lin Guan

Thank You!

Lin Guan (UT Austin) Paper#10921 32 / 33

slide-42
SLIDE 42

References

Abel, D., Salvatier, J., Stuhlm¨ uller, A., and Evans, O. (2017). Agent-agnostic human-in-the-loop reinforcement learning. NeurIPS Workshop on the Future of Interactive Learning Machines. Andreas, J., Klein, D., and Levine, S. (2017). Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 166–175. JMLR. org. Cederborg, T., Grover, I., Isbell, C. L., and Thomaz, A. L. (2015). Policy shaping with human teachers. In Twenty-Fourth International Joint Conference on Artificial Intelligence. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307. Griffith, S., Subramanian, K., Scholz, J., Isbell, C. L., and Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in neural information processing systems, pages 2625–2633. Gupta, A., Devin, C., Liu, Y., Abbeel, P., and Levine, S. (2018). Learning invariant feature spaces to transfer skills with reinforcement learning. In International Conference on Learning Representations. Knox, W. B. and Stone, P. (2009). Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pages 9–16. ACM. Le, H., Jiang, N., Agarwal, A., Dudik, M., Yue, Y., and Daum´ e, H. (2018). Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, pages 2923–2932. Li, Y., Liu, M., and Rehg, J. M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. Lin Guan (UT Austin) Paper#10921 33 / 33