Intrinsically Motivated Exploration for Reinforcement Learning in - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Intrinsically Motivated Exploration for Reinforcement Learning in Robotics University of Hamburg Faculty of Mathematics, Informatics and Natural Science Department of Informatics Technical Aspects of Multimodal Systems v v 10.12.18 | ANTON WIEHE| 8WIEHE@INFORMATIK

MIN Faculty Department of Informatics Outline Outline Intro to RL: successes and problems  Directed exploration and why RL in Robotics needs it  Three recent approaches:  1. Intrinsic Curiosity Module (ICM) 2. Random Network Distillation (RND) 3. Episodic Curiosity Through Reachability (EC) Discussion  v v 2

MIN Faculty Department of Informatics Reinforcement Learning - Introduction Algorithms maximize discounted cumulative reward  Exploration essential, usually used: epsilon-greedy  From: “Reinforcement Learning: An Introduction” by Sutton and Barto [1] v v 3

MIN Faculty Department of Informatics RL – Successes AlphaGo OpenAI Five https://foreignpolicy.com/2016/03/18/china-go-chess- https://blog.openai.com/openai-five/ west-east-technology-artificial-intelligence-google/ Learning Dexterous In-Hand Manipulation v v 4 From the whitepaper [4]

MIN Faculty Department of Informatics RL – Limitations World known + Self-play Self-play + Simulation https://foreignpolicy.com/2016/03/18/china-go-chess- https://blog.openai.com/openai-five/ west-east-technology-artificial-intelligence-google/ Domain Randomization + Simulation v v 5 From the whitepaper [4]

MIN Faculty Department of Informatics RL – Problems in Robotics Ideally learn without simulation but: • Sparse Rewards: - necessary, but diffjcult to reach https://blog.openai.com/faulty-reward-functions/ • Sample Effjciency: - hardware limits v v 6

MIN Faculty Department of Informatics Directed Exploration - Introduction ● Helps with sparse rewards ● Makes exploration effjcient In general: T otal reward = intrinsic + extrinsic reward Comparison of TRPO+VIME (red) and TRPO (blue) on MountainCar: visited states until convergence. Source: “VIME: Variational Information Environment → extrinsic reward Maximizing Exploration” [2] Exploration Algorithm → intrinsic reward v v 7

MIN Faculty Department of Informatics Intrinsic Curiosity Module (ICM) - Overview ● Train world model: - Predicts next state from current state ● Magnitude of prediction error of this model = intrinsic reward ● World model predicts relevant features - Use features that are necessary for inverse dynamics v v 8

MIN Faculty Department of Informatics Intrinsic Curiosity Module (ICM) - Details v v 9

MIN Faculty Department of Informatics Intrinsic Curiosity Module (ICM) - Demo See: https://pathak22.github.io/noreward-rl/ v v 10

MIN Faculty Department of Informatics Intrinsic Curiosity Module (ICM) - Problems ● Four factors that infmuence predictability of next states: 1) States similar to next state not yet encountered often 2) Stochastic environment 3) World model is too weak 4) Partial observability ● Only fjrst one is a desired source of unpredictability v v 11

MIN Faculty Department of Informatics Intrinsic Curiosity Module (ICM) – Problems “Montezuma’s Revenge” is a diffjcult atari game: https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/ Problems can be mitigated: large models, Bayesian networks, LSTM v v 12

MIN Faculty Department of Informatics Random Network Distillation (RND) - Motivation Deals with three previous problems by only using current state https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/ v v 13

MIN Faculty Department of Informatics Random Network Distillation (RND) - Overview ● Initialize Random Network (RN) and Predictor Network (PN) with random weights ● PN and RN have the same architecture and map the state representation to a vector ● PN is trained to predict output of RN for current state: - The prediction error is the intrinsic reward v v 14

MIN Faculty Department of Informatics Random Network Distillation (RND) - Results v v 15 From the whitepaper [6]

MIN Faculty Department of Informatics Random Network Distillation (RND) - Drawbacks ● Simple, but not fmexible ● No evidence for sample effjciency (trained for 1.6 Billion frames) ● No fjltering of irrelevant state features ● Does not return to states it has seen before within episode v v 16

MIN Faculty Department of Informatics Episodic Curiosity Through Reachability (EC) - Idea Incorporates acting into curiosity v v 17 All figures from the whitepaper [7]

MIN Faculty Department of Informatics Episodic Curiosity Through Reachability (EC) - Overview v v 18 All figures from the whitepaper [7]

MIN Faculty Department of Informatics Episodic Curiosity Through Reachability (EC) – How to Embed and Compare v v 19 All figures from the whitepaper [7]

MIN Faculty Department of Informatics Episodic Curiosity Through Reachability (EC) – Results on VizDoom Very sparse rewards Sparse rewards Dense rewards v v 20 All figures from the whitepaper [7]

MIN Faculty Department of Informatics Episodic Curiosity Through Reachability (EC) – Reward visualization https://www.youtube.com/watch?v=mphIRR6VsbM&feature=youtu.be v v 21

MIN Faculty Department of Informatics Conclusion of these approaches ● ICM: ● Works on state predictability - Requires powerful world model ● RND: ● Uses form of pseudo-state-count ● Simple - Not fmexible ● EC: ● Uses episodic memory to determine reachability ● Incorporates acting in curiosity - Has many moving parts v v 22

MIN Faculty Department of Informatics Drawbacks of Intrinsic Motivation in Robotics in General Safety: It might be interesting for the robot to destroy parts of the environment, itself, or possibly humans. Maybe fjxable by: - letting robots experience pain on extremities [3] - training supervisor agent that identifjes unsafe behavior Complex intrinsic motivation might lead to unexpectable behavior: v v 23 http://terminator.wikia.com/wiki/Skynet

MIN Faculty Department of Informatics Outlook and Final Conclusion Intrinsic motivation important for real intelligence, as obtaining extrinsic reward is “only” optimization problem. Unclear which motivation is best! Combine motivation approaches? What are your intrinsic motivations? Is there high and low-level curiosity? v v 24

MIN Faculty Department of Informatics Thank you for listening! Any Questions? v v 25

MIN Faculty Department of Informatics References [1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2017  [2] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational  information maximizing exploration. In NIPS, 2016. [3] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J.  A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. [4] A. Gupta, C. Eppner, S. Levine, and P. Abbeel, “Learning dexterous manipulation for a soft robotic hand  from human demonstrations,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016, pp. 3786–3793, 2016 [5] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-  supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017. [6] Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2018). Exploration by random network distillation. ArXiv  preprint arXiv:1810.12894 [7] Savinov, N., Raichuk, A., Marinier R., Vincent, D., Pollefeys, M., Lillicrap, T. Episodic Curiosity through  Reachability. ArXiv preprint arXiv:1810.02274 v v 26

Intrinsically Motivated Exploration for Reinforcement Learning in - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Intrinsically Motivated Exploration for Reinforcement Learning in Robotics University of Hamburg Faculty of Mathematics, Informatics and Natural Science Department of Informatics Technical Aspects of

Motivated Motivated Motivated Motivated Incompetent Competent Incompetent

Intrinsically Motivated Autonomy in Human-Robot Interaction: Human Perception of Predictive

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

SDT Ultrasound Solutions Now Available Intrinsically Safe! Powerful ultrasound for PdM

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

PRESENTATION Mostostal Puawy S.A. reinforcement reinforcement Puawy, 2015 AGENDA I.

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Variational Option Discovery Algorithms Achiam, Edwards, Amodei, Abbeel Topic: Hierarchical

2015 Physical Activity Forum Speaker: Dr. Wendy Rodgers May 2015 Lets get physically

D3-branes, Strings and F-Theory in Various Dimensions 1601.02015 (JHEP) with Sakura Sch

le learnin ing to le let go Teacher Roles and Student Motivation in EFL Ben Shearon, Tohoku

Undergraduates and Public Service Motivation Student Motivation Literature Student

P U B L I C P O L I C Y F O R FA I R N E S S & E F F I C I E N C Y I MPA 612: Economy,

1 Competency represents that point where a learner has acquired enough understanding, skill,

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Intrinsically Motivated Exploration for Reinforcement Learning in - PowerPoint PPT Presentation

MIN Faculty Department of Informatics Intrinsically Motivated Exploration for Reinforcement Learning in Robotics University of Hamburg Faculty of Mathematics, Informatics and Natural Science Department of Informatics Technical Aspects of

Motivated Motivated Motivated Motivated Incompetent Competent Incompetent

Intrinsically Motivated Autonomy in Human-Robot Interaction: Human Perception of Predictive

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

SDT Ultrasound Solutions Now Available Intrinsically Safe! Powerful ultrasound for PdM

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Distributional Reinforcement Learning for Efficient Exploration Hengshuai Yao Huawei Hi-Silicon

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

PRESENTATION Mostostal Puawy S.A. reinforcement reinforcement Puawy, 2015 AGENDA I.

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Variational Option Discovery Algorithms Achiam, Edwards, Amodei, Abbeel Topic: Hierarchical

2015 Physical Activity Forum Speaker: Dr. Wendy Rodgers May 2015 Lets get physically

D3-branes, Strings and F-Theory in Various Dimensions 1601.02015 (JHEP) with Sakura Sch

le learnin ing to le let go Teacher Roles and Student Motivation in EFL Ben Shearon, Tohoku

Undergraduates and Public Service Motivation Student Motivation Literature Student

P U B L I C P O L I C Y F O R FA I R N E S S &amp; E F F I C I E N C Y I MPA 612: Economy,

1 Competency represents that point where a learner has acquired enough understanding, skill,

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

P U B L I C P O L I C Y F O R FA I R N E S S & E F F I C I E N C Y I MPA 612: Economy,