CSC2547 Presentation: Curiosity-driven exploration Count-based VS - PowerPoint PPT Presentation

CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students)

1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016)

Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion

Background Intrinsic reward /exploration RL+Curiosity bonus History: action Next Extrinsic state reward

What is exploration? Intrinsic motivation: - Reducing the agent’s uncertainty over the environment’s dynamics. [Plan] [VIME] [CTS] Count-based - Use (pseudo) visitation counts to guide agents to unvisited states.

Why exploration useful? DEMO Sparse Reward Problem Our original plot & demo Montezuma’s revenge DQN DQN + Exploration bonus Z-axis Intrinsic Reward Y-axis Intrinsic Reward function timestamp /Training Timestamp T s . … , 3 s , 2 s , 1 S s i x a - X

Related work (Timeline) Pseudocount in 2016 The notion of Intrinsic L2 prediction error Pseudocount + Pixel CNN still achieves SOTA for Motivation using neural networks Montezuma’s revenge” 2010 Formal Theory 2015 Incentivizing 2019 On Bonus 2017 Count-Based of Creativity, Fun, Exploration In Based Exploration Exploration with and Intrinsic Reinforcement Methods In The Neural Density Motivation (1990- Learning With Deep Arcade Learning Models 2010) Predictive Models Environment 2018 Exploration by 2011 2016 Random Network PLAN VIME CTS Distillation Bayesian Approximate Pseudocount Distillation error as a Optimal “PLAN” exploration quantification of uncertainty Exploration

[PLAN] contribution Dynamics model Bayes update for posterior distribution of the dynamics model Optimal Bayesian Exploration based on: Expected cumulative Expected cumulative info Expected one-step info gain info gain fo tau steps gain for tau-1 steps if if performing this performing this next action action

[PLAN] Quantify “surprise” with info gain p 𝜄

[PLAN] 1-step expected information gain “1-step expected info gain” “expected immediate info gain” NOTE: VIME uses this as the Intrinsic reward! “Mutual info between next state distribution & model parameter”

[PLAN] “Planning to be surprised” Cumulative steps info gain Curious Q-value Perform an action “Planning tau steps” Follow a policy because not actually observed yet

[PLAN] Optimal Bayesian Exploration policy [Method1] Computing optimal curiosity-Q backwards for tau steps [Method2] Policy Iteration Policy evaluation Repeat applying Policy improvement

[Plan] Non-triviality of curious Q-value Cumulative information gain fluctuates! Info gain additive in expectation! Cumulative != Sum

[Plan] Results . . 50 states . Random Greedy w.r.t expected one-step info gain Q-learning using one-step info gain Policy iteration (Dynamic programming approximation to optimal bayesian exploration)

[Plan] Results

[VIME] contribution Dynamic s model Variational inference for posterior distribution of dynamics model 1-step exploration bonus

[VIME] Quantify the information gained Reminder: PLAN cumulative info gain

[VIME] Variational Bayes What’s hard? Computing posterior for highly parameterized models (e.g. neural networks) Approximate posterior by minimizing Minimize negative ELBO

[VIME] Optimization for variational bayes How to minimize negative ELBO? Take an efficient single second-order (Newton) update step to minimize negative ELBO:

[VIME] Estimate 1-step expected info gain What’s hard? Computing the exact one-step expected info-gain. High- dimensional states → Monte-carlo estimation.

[VIME] Results (Walker-2D) Average extrinsic return Dense reward RL algorithm: TRPO

[VIME] Results (Swimmer-Gather) Average extrinsic return Sparse reward RL algorithm: TRPO

[CTS] contribution States Density model Pseudo-count 1-step exploration bonus

[CTS] Count state visitation Empirical count Empirical distribution These two are different states! But we want to increment visitation counts for both when visiting either one. Pixel difference

[CTS] Introduce state density model p p s s x=s1 s2 X =s1 s2

How to update CTS density model? Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182 This was the difficulty of reading this paper as it only shows a bayes rule Remark: For pixel-cnn density update for mixture of density models model in “Count-based (e.g. CTS). exploration with neural density model ”, just backprop .

[CTS] Derive pseudo-count from density model Two constraints: Linear system Solve linear system Pseudo-count derived!

[CTS] Results (Montezuma’s Revenge) State: 84x84x4 # Actions: 18 RL algorithm: Double DQN

Outline Motivation, Related Works and Demo Planning to Be Surprised Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Summary, Comparisons and Discussion

Deriving posterior dynami ct model/ density model VIME PLAN CTS Bayes rule Variational inference Bayes rule

Derive exploratory policy Policy trained with the reward augmented by intrinsic reward. [VIME] 1-step Information gain [CTS] Pseudo-count [PLAN] Directly argmax(curiosity Q)

Pseudo-count VS Intrinsic Motivation Mixture model “Unifying count-based exploration and intrinsic motivations”!

Limitations & Future Directions → Intractable posterior & use dynamics model for expectation PLAN Difficult to be scaled outside Tabular RL. VIME → Currently maximize sum of 1-step info gain. CTS → which density model leads to better generalization over states? Learning rates of policy network VS Updating dynamic model/density model.

Thank you! (Appendix)

Our derivation for “Additive in expectation” h’’ contains h’

CSC2547 Presentation: Curiosity-driven exploration Count-based VS - PowerPoint PPT Presentation

CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students) 1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016) Outline Motivation, Related Works and

Meta-reasoning CSC2547 Presentation Supervising Strong Learners by Amplifying Weak Experts Michal

the early modern era Research is feeding curiosity and answering questions The Guardian 14

Cabinets of Curiosity What are Cabinets of Curiosity? Background Context -Renaissance -The

CUSTOMER CURIOSITY EXPERIENCE People stop and look at things that pique their curiosity every

Advance Space Exploration : Mars Science Laboratory/Curiosity J. Douglas McCuistion Director,

Curiosity-driven Exploration by Self-supervised Prediction Author: Deepak Pathak, Pulkit Agrawal,

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty Youngjin Kim 1 4 , Wontae

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

False fasting is driven by pride False fasting is driven by pride False fasting is

Beam Search Shahrzad Kiani and Zihao Chen CSC2547 Presentation Beam Search Greedy Search: Always

Theorem-Proving Environments Nathan Ng CSC2547: Learning to Search Theorem Proving What is a

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Direct Optimization CSC2547 Adamo Young, Dami Choi, Sepehr Abbasi Zadeh Direct Optimization

CSC2547: Learning to Search Intro Lecture Sept 13, 2019 This week Course structure

CS780 Discrete-State Models Instructor: Peter Kemper R 006, phone 221-3462, email:kemper@cs.wm.edu

Team Semantics for the Specification and Verification of Jonni Virtema Hyperproperties

not/configure portability without pain Alternative universes What and why? Plan 9 and Inferno

MetricsMaTr10 Evaluation Overview & Summary of Results Kay Peterson & Mark Przybocki

ASIC Physical Design Standard-Cell Design Flow Using the Cadence Innovus Digital Implementation

A Common Terminology Services (CTS) Back-end to Protg Harold R Solbrig Christopher G Chute,

CapiTainS Guidelines From digital edition to machine actionable edition Thibault Clrice, PhD

Managing App Testing Device Clouds: Issues and Opportunities Mattia Fazzini Alessandro Orso

CSC2547 Presentation: Curiosity-driven exploration Count-based VS - PowerPoint PPT Presentation

CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students) 1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016) Outline Motivation, Related Works and

Meta-reasoning CSC2547 Presentation Supervising Strong Learners by Amplifying Weak Experts Michal

the early modern era Research is feeding curiosity and answering questions The Guardian 14

Cabinets of Curiosity What are Cabinets of Curiosity? Background Context -Renaissance -The

CUSTOMER CURIOSITY EXPERIENCE People stop and look at things that pique their curiosity every

Advance Space Exploration : Mars Science Laboratory/Curiosity J. Douglas McCuistion Director,

Curiosity-driven Exploration by Self-supervised Prediction Author: Deepak Pathak, Pulkit Agrawal,

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty Youngjin Kim 1 4 , Wontae

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

False fasting is driven by pride False fasting is driven by pride False fasting is

Beam Search Shahrzad Kiani and Zihao Chen CSC2547 Presentation Beam Search Greedy Search: Always

Theorem-Proving Environments Nathan Ng CSC2547: Learning to Search Theorem Proving What is a

CSC2547: Learning to Search Lecture 2: Background and gradient esitmators Sept 20, 2019

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Direct Optimization CSC2547 Adamo Young, Dami Choi, Sepehr Abbasi Zadeh Direct Optimization

CSC2547: Learning to Search Intro Lecture Sept 13, 2019 This week Course structure

CS780 Discrete-State Models Instructor: Peter Kemper R 006, phone 221-3462, email:kemper@cs.wm.edu

Team Semantics for the Specification and Verification of Jonni Virtema Hyperproperties

not/configure portability without pain Alternative universes What and why? Plan 9 and Inferno

MetricsMaTr10 Evaluation Overview &amp; Summary of Results Kay Peterson &amp; Mark Przybocki

ASIC Physical Design Standard-Cell Design Flow Using the Cadence Innovus Digital Implementation

A Common Terminology Services (CTS) Back-end to Protg Harold R Solbrig Christopher G Chute,

CapiTainS Guidelines From digital edition to machine actionable edition Thibault Clrice, PhD

Managing App Testing Device Clouds: Issues and Opportunities Mattia Fazzini Alessandro Orso

MetricsMaTr10 Evaluation Overview & Summary of Results Kay Peterson & Mark Przybocki