CSC2547 Presentation: Curiosity-driven exploration Count-based VS - - PowerPoint PPT Presentation

csc2547 presentation curiosity driven exploration
SMART_READER_LITE
LIVE PREVIEW

CSC2547 Presentation: Curiosity-driven exploration Count-based VS - - PowerPoint PPT Presentation

CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master students) 1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016) Outline Motivation, Related Works and


slide-1
SLIDE 1

CSC2547 Presentation: Curiosity-driven exploration

Count-based VS Info gain-based

Sheng Jia, Tinglin Duan (First year master students)

slide-2
SLIDE 2

1. PLAN (2011) 1. VIME (NeurIPS2016) 1. CTS (NeurIPS2016)

slide-3
SLIDE 3

Outline

Motivation, Related Works and Demo Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion Planning to Be Surprised Variational Information Maximizing Exploration

slide-4
SLIDE 4

Outline

Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion Planning to Be Surprised Variational Information Maximizing Exploration

Motivation, Related Works and Demo

slide-5
SLIDE 5

Background

RL+Curiosity

Next state Extrinsic reward Intrinsic reward /exploration bonus action History:

slide-6
SLIDE 6

What is exploration?

  • Reducing the agent’s uncertainty over the environment’s dynamics.

[VIME] [Plan] Intrinsic motivation: [CTS] Count-based

  • Use (pseudo) visitation counts to guide agents to unvisited states.
slide-7
SLIDE 7

Why exploration useful?

DEMO Our original plot & demo

X

  • a

x i s S 1 , s 2 , s 3 , … . s T Y-axis Intrinsic Reward function timestamp /Training Timestamp Z-axis Intrinsic Reward Sparse Reward Problem Montezuma’s revenge DQN DQN + Exploration bonus

slide-8
SLIDE 8

Related work (Timeline)

2019 On Bonus Based Exploration Methods In The Arcade Learning Environment 2016 VIME CTS Pseudocount in 2016 still achieves SOTA for Montezuma’s revenge” Distillation error as a quantification of uncertainty 2011 PLAN 2018 Exploration by Random Network Distillation 2017 Count-Based Exploration with Neural Density Models 2015 Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models 2010 Formal Theory

  • f Creativity, Fun,

and Intrinsic Motivation (1990- 2010) The notion of Intrinsic Motivation L2 prediction error using neural networks Pseudocount + Pixel CNN Bayesian Optimal Exploration Approximate “PLAN” Pseudocount exploration

slide-9
SLIDE 9

Outline

Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion Motivation, Related Works and Demo Variational Information Maximizing Exploration

Planning to Be Surprised

slide-10
SLIDE 10

[PLAN] contribution

Dynamics model Bayes update for posterior distribution of the dynamics model Optimal Bayesian Exploration based on:

Expected cumulative info gain fo tau steps if performing this action

Expected one-step info gain

Expected cumulative info gain for tau-1 steps if performing this next action

slide-11
SLIDE 11

[PLAN] Quantify “surprise” with info gain

p 𝜄

slide-12
SLIDE 12

[PLAN] 1-step expected information gain

NOTE: VIME uses this as the Intrinsic reward! “1-step expected info gain” “expected immediate info gain” “Mutual info between next state distribution & model parameter”

slide-13
SLIDE 13

[PLAN] “Planning to be surprised”

Curious Q-value

Perform an action Follow a policy

“Planning tau steps” because not actually

  • bserved yet

Cumulative steps info gain

slide-14
SLIDE 14

[PLAN] Optimal Bayesian Exploration policy

[Method1] Computing optimal curiosity-Q backwards for tau steps [Method2] Policy Iteration

Repeat applying Policy evaluation Policy improvement

slide-15
SLIDE 15

[Plan] Non-triviality of curious Q-value

Cumulative information gain fluctuates! Cumulative != Sum

Info gain additive in expectation!

slide-16
SLIDE 16

[Plan] Results

Random Greedy w.r.t expected one-step info gain Policy iteration (Dynamic programming approximation to optimal bayesian exploration) Q-learning using one-step info gain

. . . 50 states

slide-17
SLIDE 17

[Plan] Results

slide-18
SLIDE 18

Outline

Unifying Count-Based Exploration and Intrinsic Motivation Comparisons and Discussion Motivation, Related Works and Demo Planning to Be Surprised

Variational Information Maximizing Exploration

slide-19
SLIDE 19

[VIME] contribution

Dynamic s model Variational inference for posterior distribution of dynamics model

1-step exploration bonus

slide-20
SLIDE 20

[VIME] Quantify the information gained

Reminder: PLAN cumulative info gain

slide-21
SLIDE 21

[VIME] Variational Bayes

What’s hard?

Minimize negative ELBO Computing posterior for highly parameterized models (e.g. neural networks) Approximate posterior by minimizing

slide-22
SLIDE 22

[VIME] Optimization for variational bayes

How to minimize negative ELBO? Take an efficient single second-order (Newton) update step to minimize negative ELBO:

slide-23
SLIDE 23

[VIME] Estimate 1-step expected info gain

What’s hard? Computing the exact one-step expected info-gain. High- dimensional states → Monte-carlo estimation.

slide-24
SLIDE 24

[VIME] Results (Walker-2D)

Average extrinsic return Dense reward RL algorithm: TRPO

slide-25
SLIDE 25

[VIME] Results (Swimmer-Gather)

Average extrinsic return Sparse reward RL algorithm: TRPO

slide-26
SLIDE 26

Outline

Variational Information Maximizing Exploration Comparisons and Discussion Motivation, Related Works and Demo Planning to Be Surprised

Unifying Count-Based Exploration and Intrinsic Motivation

slide-27
SLIDE 27

[CTS] contribution

States Density model Pseudo-count 1-step exploration bonus

slide-28
SLIDE 28

[CTS] Count state visitation

Empirical distribution These two are different states! But we want to increment visitation counts for both when visiting either one. Pixel difference

Empirical count

slide-29
SLIDE 29

[CTS] Introduce state density model

x=s1 s2 s2 X =s1 p s p s

slide-30
SLIDE 30

How to update CTS density model?

Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182 This was the difficulty of reading this paper as it only shows a bayes rule update for mixture of density models (e.g. CTS). Remark: For pixel-cnn density model in “Count-based exploration with neural density model”, just backprop.

slide-31
SLIDE 31

[CTS] Derive pseudo-count from density model

Two constraints: Linear system Pseudo-count derived! Solve linear system

slide-32
SLIDE 32

[CTS] Results (Montezuma’s Revenge)

State: 84x84x4 # Actions: 18 RL algorithm: Double DQN

slide-33
SLIDE 33

Summary, Comparisons and Discussion

Outline

Variational Information Maximizing Exploration Unifying Count-Based Exploration and Intrinsic Motivation Motivation, Related Works and Demo Planning to Be Surprised

slide-34
SLIDE 34

Deriving posterior dynamict model/ density model

PLAN CTS VIME

Bayes rule Variational inference Bayes rule

slide-35
SLIDE 35

Derive exploratory policy

[VIME] 1-step Information gain [CTS] Pseudo-count Policy trained with the reward augmented by intrinsic reward. [PLAN] Directly argmax(curiosity Q)

slide-36
SLIDE 36

Pseudo-count VS Intrinsic Motivation

Mixture model

“Unifying count-based exploration and intrinsic motivations”!

slide-37
SLIDE 37

Limitations & Future Directions

→ Intractable posterior & use dynamics model for expectation Difficult to be scaled outside Tabular RL. → Currently maximize sum of 1-step info gain. → which density model leads to better generalization over states? Learning rates of policy network VS Updating dynamic model/density model. PLAN VIME CTS

slide-38
SLIDE 38

Thank you! (Appendix)

slide-39
SLIDE 39

Our derivation for “Additive in expectation”

h’’ contains h’

slide-40
SLIDE 40