Intrinsically Motivated Exploration for Reinforcement Learning in - - PowerPoint PPT Presentation

intrinsically motivated exploration for reinforcement
SMART_READER_LITE
LIVE PREVIEW

Intrinsically Motivated Exploration for Reinforcement Learning in - - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Intrinsically Motivated Exploration for Reinforcement Learning in Robotics University of Hamburg Faculty of Mathematics, Informatics and Natural Science Department of Informatics Technical Aspects of


slide-1
SLIDE 1

v v

10.12.18 | ANTON WIEHE| 8WIEHE@INFORMATIK

MIN Faculty Department of Informatics

University of Hamburg Faculty of Mathematics, Informatics and Natural Science Department of Informatics Technical Aspects of Multimodal Systems

Intrinsically Motivated Exploration for Reinforcement Learning in Robotics

slide-2
SLIDE 2

v v MIN Faculty Department of Informatics

2

Intro to RL: successes and problems

Directed exploration and why RL in Robotics needs it

Three recent approaches:

  • 1. Intrinsic Curiosity Module (ICM)
  • 2. Random Network Distillation (RND)
  • 3. Episodic Curiosity Through Reachability (EC)

Discussion

Outline Outline

slide-3
SLIDE 3

v v MIN Faculty Department of Informatics

3

Algorithms maximize discounted cumulative reward

Exploration essential, usually used: epsilon-greedy

Reinforcement Learning - Introduction

From: “Reinforcement Learning: An Introduction” by Sutton and Barto [1]

slide-4
SLIDE 4

v v MIN Faculty Department of Informatics

4

RL – Successes

https://foreignpolicy.com/2016/03/18/china-go-chess- west-east-technology-artificial-intelligence-google/ https://blog.openai.com/openai-five/

AlphaGo Learning Dexterous In-Hand Manipulation OpenAI Five

From the whitepaper [4]

slide-5
SLIDE 5

v v MIN Faculty Department of Informatics

5

RL – Limitations

https://foreignpolicy.com/2016/03/18/china-go-chess- west-east-technology-artificial-intelligence-google/ https://blog.openai.com/openai-five/

World known + Self-play Domain Randomization + Simulation Self-play + Simulation

From the whitepaper [4]

slide-6
SLIDE 6

v v MIN Faculty Department of Informatics

6

RL – Problems in Robotics

Ideally learn without simulation but:

  • Sparse Rewards:
  • necessary, but diffjcult to reach
  • Sample Effjciency:
  • hardware limits

https://blog.openai.com/faulty-reward-functions/

slide-7
SLIDE 7

v v MIN Faculty Department of Informatics

7

Directed Exploration - Introduction

  • Helps with sparse rewards
  • Makes exploration effjcient

In general: T

  • tal reward =

intrinsic + extrinsic reward Environment → extrinsic reward Exploration Algorithm → intrinsic reward

Comparison of TRPO+VIME (red) and TRPO (blue) on MountainCar: visited states until convergence. Source: “VIME: Variational Information Maximizing Exploration” [2]

slide-8
SLIDE 8

v v MIN Faculty Department of Informatics

8

Intrinsic Curiosity Module (ICM) - Overview

  • Train world model:
  • Predicts next state from current state
  • Magnitude of prediction error of this model = intrinsic reward
  • World model predicts relevant features
  • Use features that are necessary for inverse dynamics
slide-9
SLIDE 9

v v MIN Faculty Department of Informatics

9

Intrinsic Curiosity Module (ICM) - Details

slide-10
SLIDE 10

v v MIN Faculty Department of Informatics

10

Intrinsic Curiosity Module (ICM) - Demo

See: https://pathak22.github.io/noreward-rl/

slide-11
SLIDE 11

v v MIN Faculty Department of Informatics

11

Intrinsic Curiosity Module (ICM) - Problems

  • Four factors that infmuence predictability of next states:

1) States similar to next state not yet encountered often 2) Stochastic environment 3) World model is too weak 4) Partial observability

  • Only fjrst one is a desired source of unpredictability
slide-12
SLIDE 12

v v MIN Faculty Department of Informatics

12

Intrinsic Curiosity Module (ICM) – Problems

Problems can be mitigated: large models, Bayesian networks, LSTM

https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/

“Montezuma’s Revenge” is a diffjcult atari game:

slide-13
SLIDE 13

v v MIN Faculty Department of Informatics

13

Random Network Distillation (RND) - Motivation

Deals with three previous problems by only using current state

https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/

slide-14
SLIDE 14

v v MIN Faculty Department of Informatics

14

Random Network Distillation (RND) - Overview

  • Initialize Random Network (RN) and Predictor Network (PN) with

random weights

  • PN and RN have the same architecture and map the state

representation to a vector

  • PN is trained to predict output of RN for current state:
  • The prediction error is the intrinsic reward
slide-15
SLIDE 15

v v MIN Faculty Department of Informatics

15

Random Network Distillation (RND) - Results

From the whitepaper [6]

slide-16
SLIDE 16

v v MIN Faculty Department of Informatics

16

Random Network Distillation (RND) - Drawbacks

  • Simple, but not fmexible
  • No evidence for sample effjciency (trained for 1.6 Billion frames)
  • No fjltering of irrelevant state features
  • Does not return to states it has seen before within episode
slide-17
SLIDE 17

v v MIN Faculty Department of Informatics

17

Episodic Curiosity Through Reachability (EC) - Idea

All figures from the whitepaper [7]

Incorporates acting into curiosity

slide-18
SLIDE 18

v v MIN Faculty Department of Informatics

18

Episodic Curiosity Through Reachability (EC) - Overview

All figures from the whitepaper [7]

slide-19
SLIDE 19

v v MIN Faculty Department of Informatics

19

Episodic Curiosity Through Reachability (EC) – How to Embed and Compare

All figures from the whitepaper [7]

slide-20
SLIDE 20

v v MIN Faculty Department of Informatics

20

Episodic Curiosity Through Reachability (EC) – Results on VizDoom

Very sparse rewards Sparse rewards Dense rewards

All figures from the whitepaper [7]

slide-21
SLIDE 21

v v MIN Faculty Department of Informatics

21

Episodic Curiosity Through Reachability (EC) – Reward visualization

https://www.youtube.com/watch?v=mphIRR6VsbM&feature=youtu.be

slide-22
SLIDE 22

v v MIN Faculty Department of Informatics

22

Conclusion of these approaches

  • ICM:
  • Works on state predictability
  • Requires powerful world model
  • RND:
  • Uses form of pseudo-state-count
  • Simple
  • Not fmexible
  • EC:
  • Uses episodic memory to determine reachability
  • Incorporates acting in curiosity
  • Has many moving parts
slide-23
SLIDE 23

v v MIN Faculty Department of Informatics

23

Drawbacks of Intrinsic Motivation in Robotics in General

Safety: It might be interesting for the robot to destroy parts of the environment, itself, or possibly humans. Maybe fjxable by:

  • letting robots experience pain on extremities [3]
  • training supervisor agent that identifjes unsafe behavior

Complex intrinsic motivation might lead to unexpectable behavior:

http://terminator.wikia.com/wiki/Skynet

slide-24
SLIDE 24

v v MIN Faculty Department of Informatics

24

Outlook and Final Conclusion

Intrinsic motivation important for real intelligence, as

  • btaining extrinsic reward is “only” optimization problem.

Unclear which motivation is best! Combine motivation approaches? What are your intrinsic motivations? Is there high and low-level curiosity?

slide-25
SLIDE 25

v v MIN Faculty Department of Informatics

25

Thank you for listening! Any Questions?

slide-26
SLIDE 26

v v MIN Faculty Department of Informatics

26

[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2017

[2] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational information maximizing exploration. In NIPS, 2016.

[3] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J.

  • A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior:

From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991.

[4] A. Gupta, C. Eppner, S. Levine, and P. Abbeel, “Learning dexterous manipulation for a soft robotic hand from human demonstrations,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016, pp. 3786–3793, 2016

[5] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.

[6] Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by random network distillation. ArXiv preprint arXiv:1810.12894

[7] Savinov, N., Raichuk, A., Marinier R., Vincent, D., Pollefeys, M., Lillicrap, T. Episodic Curiosity through

  • Reachability. ArXiv preprint arXiv:1810.02274

References