Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, - PowerPoint PPT Presentation

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner

Reinforcement Learning objective sensor input algorithm motor output unknown learning agent environment

Reinforcement Learning Intrinsic Motivation objective objective sensor sensor input input algorithm algorithm motor motor output output unknown unknown learning agent learning agent environment environment

Many Intrinsic Objectives Information gain e.g. Lindley 1956, Sun 2011, Houthooft 2017 Prediction error e.g. Schmidhuber 1991, Bellemare 2016, Pathak 2017 Empowerment e.g. Klyubin 2005, Tishby 2011, Gregor 2016 Skill discovery e.g. Eysenbach 2018, Sharma 2020, Co-Reyes 2018 Surprise minimization e.g. Schrödinger 1944, Friston 2013, Berseth 2020 Bayes-adaptive RL e.g. Gittins 1979, Duff 2002, Ross 2007

Information Gain Without rewards, the agent can only learn about the environment.

Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction

Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. p( W )

Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X )

Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X ) To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W : Both W and X are max a I( X ; W | A=a ) random variables

Information Gain Without rewards, the agent can only learn about the environment. A model W represents our knowledge. E.g.: input density, forward prediction Need to represent uncertainty about W to tell how much we have learned. data collection p( W ) p( W | X ) To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W : Both W and X are max a I( X ; W | A=a ) = ? random variables

Retrospective Infogain Expected Infogain e.g. VIME, ICM, RND e.g. MAX, PETS-ET, LD KL[p(W|X,A=a) p(W|A=a)] | | I( X ; W | A=a ) Collect episodes, train world model, Need to search for actions that will record improvement, reward the lead to high information gain without controller by this improvement additional environment interaction Infogain depends on agent's Learn a forward model of the knowledge that keeps changing, environment to search for actions by making it a non-stationary objective planning or learning in imagination The learned controller will lag behind Computing the expected information and go to states that were previously gain requires computing entropies of novel but are not anymore a model with uncertainty estimates

Retrospective Novelty Episode 1 Everything unknown

Retrospective Novelty Episode 1 Random behavior

Retrospective Novelty Episode 1 High novelty

Retrospective Novelty Episode 1 Reinforce behavior

Retrospective Novelty Episode 2 Repeat behavior

Retrospective Novelty Episode 2 Reach similar states

Retrospective Novelty Episode 2 Not surprising anymore :(

Retrospective Novelty Episode 2 Unlearn behavior

Retrospective Novelty Episode 3 Repeat behavior

Retrospective Novelty Episode 3 Still not novel

Retrospective Novelty Episode 3 Unlearn behavior

Retrospective Novelty The agent builds a map of where it was already and avoids those states. Episode 4 Back to random behavior

Expected Novelty Episode 1 Everything unknown

Expected Novelty Episode 1 Consider options

Expected Novelty Episode 1 Execute plan

Expected Novelty Episode 1 Observe new data

Expected Novelty Episode 2 Consider options

Expected Novelty Episode 2 Execute plan

Expected Novelty Episode 2 Observe new data

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise Wide predictions mean high expected noise Overlapping modes means less total uncertainty

Ensemble of Dynamics Models Learn dynamics both to represent knowledge and to plan for expected infogain Capture uncertainty as an ensemble of non-linear Gaussian predictors I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty Information gain targets uncertain trajectories with low expected noise Wide predictions mean high expected noise Narrow predictions mean low expected noise Overlapping modes means less total uncertainty Distant modes means large total uncertainty

Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty

Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members:

Expected Infogain Approximation I( X ; W | A=a ) = H( X | A=a ) − H( X | W , A=a ) epistemic uncertainty total uncertainty aleatoric uncertainty p(X | W=w k , A=a) Ensemble members: p(X | A=a) = 1/K Σ p(X | W=w k , A=a) Aggregate prediction: 1/K Σ H(p(X | W=w k , A=a)) Aleatoric uncertainty: H(1/K Σ p(X | W=w k , A=a)) Total uncertainty: Gaussian entropy has a closed form, so we can compute the aleatoric uncertainty. GMM entropy does not, sample it or switch to Renyi entropy that has a closed form.

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, - PowerPoint PPT Presentation

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner Reinforcement Learning objective sensor input algorithm motor output unknown learning agent

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Partnership event 21 st November 2019 Welcome #ActiveBradford Active Bradford Members Active

MAC. SKE in Practice. Lecture 5 Active Adversary Active Adversary An active adversary can

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Active Threat on Campus Prevention & Response Active threat defined An active threat can be

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Active Transport Active Transport Requires Energy Why does active transport require energy?

Peer-to-Peer Data Integration with Active XML Tova Milo Tel-Aviv University Tova Milo Tel

Active Adversary Lecture 7 CCA Security MAC Active Adversary An active adversary can inject

The Active Versus Passive Management Debate In Defense of Active Management Thierry Roncalli

PVMD Delft University of Technology Chalcogenide solar cells 2 I II III IV V VI He 5 6

Applications 28 th 31 th May 2017 Ringberg Castle Status DHPT 1.2b Leonard Germic, B.

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder

CPSC 533 Philosophical Foundations of Artificial Intelligence Presented by: Arthur Fischer

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San

Adaptive multi scale scheme based on numerical density of entropy production for conservation

MobilityFirst Architecture Summary WINLAB Research Review May 14, 2012 Contact: D. Raychaudhuri

Storage ring-based Coherent THz Synchrotron Radiation Source Research at MIT-Bates Fuhua Wang

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, - PowerPoint PPT Presentation

Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner Reinforcement Learning objective sensor input algorithm motor output unknown learning agent

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Meta-Reinforcement Learning of Structured Exploration Strategies Abhishek Gupta , Russell

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Partnership event 21 st November 2019 Welcome #ActiveBradford Active Bradford Members Active

MAC. SKE in Practice. Lecture 5 Active Adversary Active Adversary An active adversary can

MEAP and ENB Exploration Exploration in MEAP Genesis of Exploration New Business

Active Threat on Campus Prevention &amp; Response Active threat defined An active threat can be

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

in Advanced . Exploration 1 . Note 1 : Advanced Exploration: Defined as confirmed

Exploration Strategy Exploration Strategy Workshop Workshop Scott Doc Horowitz Scott

Active Transport Active Transport Requires Energy Why does active transport require energy?

Peer-to-Peer Data Integration with Active XML Tova Milo Tel-Aviv University Tova Milo Tel

Active Adversary Lecture 7 CCA Security MAC Active Adversary An active adversary can inject

The Active Versus Passive Management Debate In Defense of Active Management Thierry Roncalli

PVMD Delft University of Technology Chalcogenide solar cells 2 I II III IV V VI He 5 6

Applications 28 th 31 th May 2017 Ringberg Castle Status DHPT 1.2b Leonard Germic, B.

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder

CPSC 533 Philosophical Foundations of Artificial Intelligence Presented by: Arthur Fischer

Generative Adversarial Networks (GANs) Ian Goodfellow, Research Scientist MLSLP Keynote, San

Adaptive multi scale scheme based on numerical density of entropy production for conservation

MobilityFirst Architecture Summary WINLAB Research Review May 14, 2012 Contact: D. Raychaudhuri

Storage ring-based Coherent THz Synchrotron Radiation Source Research at MIT-Bates Fuhua Wang

Active Threat on Campus Prevention & Response Active threat defined An active threat can be