A Case Against s t-1 State s t State s t+1 m t-1 Generative - PowerPoint PPT Presentation

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 … m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n Reinforcement Learning? o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Shakir Mohamed Critic Option KB n o t i p O Generative models shakir@deepmind.com r e n n a l P for RL workshop DALI 2018 @shakir_za

2 Observation/ Environment Sensation Action Environment Brain Primary Motor Primary Sensory Cortex Cortex Premotor Sensory Cortex Association Cortex Prefrontal Posterior Assoc. Cortex Cortex Perception and Action

3 Observation/ Environment Sensation Action Environment Brain Primary Motor Primary Sensory Cortex Cortex Premotor Sensory Cortex Association Cortex External Environment Prefrontal Posterior Assoc. Observation/ Cortex Cortex Action Sensation Environment Agent Internal Environment Option KB Critic State Repr. State Option Embedding Planner Perception and Action

4 What makes something Don’t know how to usefully a generative model ? learn hierarchical models . Any hierarchical reasoning will be hard to do. Much emphasis on generative models of models of images . False sense of progress? Inference is hard. Are we attempting to solve a more Anything ever truly di ffi cult problem first, instead unsupervised? Always a context, of solving directly? and contextual models other than labels is hard. Generative RL

5 Environment is the generative process: An unknown likelihood; Action Prior Not known analytically; p(a) Only able to observe its outcomes. Prior over actions Environment Interaction only or Model p(R(s,a)) Long-term reward All the key inferential questions can now be asked in this simple framework. Generative Processes

6 Simplest question: What is the posterior distribution over actions? Maximising the probability of the return log p(R) . Variational inference in the hierarchical model: Action Prior p(a) Recover policy search methods: Uniform prior over distributions Environment Continuous policy parameters or Model p(R(s,a)) Can evaluate environment, but not di ff erentiate. Planning-as-Inference

7 Free Energy Policy gradient using score-function gradient Action Prior Appearance of the entropy penalty is natural and p(a) alternative priors easy to consider. Can easily incorporate prior knowledge of the action space. Use any of the tools of probabilistic inference available. Environment or Model Easily handle stochastic and deterministic policies. p(R(s,a)) Planning-as-Inference

8 With a more realistic expansion as Action Prior graphical model p(a) A c Derive Bellman’s equation as a t i o n E x t e r n a l E di ff erent writing of message passing. n v i r o n m e n t Application of the EM algorithm for O b s e r v a t policy search becomes possible. i o S n I n / e t n e s r a n t a i o l n E n v Environment i r o Easily consider other variational Environment n m e n t O p t i Agent o n or Model K B methods, like EP . O p C t i r o i t n i c S p(R(s,a)) t a Both model-free and model-based t e R e p r . methods emerge. S t a t e P l E a m n n b e e r d d i n g Planning-as-Inference

9 Inference is already hard. Quantification of uncertainty Do we gain additional benefit? helps drive natural exploration. But uncertainty often not used; easy to obtain in other ways. Can be computationally more demanding. Parameter inference is already di ffi cult in non- RL settings. Any hyperparameters can be learnt. Simpler and competitive methods exist. Generative RL

10 Model-based RL

11 Learn a model of the environment and Action Action Action a t-1 a t a t+! use that in all planning. Internal simulator - limit State State State … s t-1 s t s t+1 interactions with env for safety and planning. m t-1 m t m t+1 Long-term predictions allow for Data Data Data better planning. x t-1 x t x t+1 Data e ffi cient learning, especially when experience is expensive to obtain. Prior knowledge of the environment can be easily incorporated. Chiappa et al. (2017) Model-based RL

12 Exploration Physical and temporal consistency Model-based RL

13 Agent only as good as the model that is learnt. Arguments in favour rely For the most part limited to small on linear models, or in domains, limited complexity, Two sources of error - low-dimensions. limited consistency. model and any value, estimate. Computationally more Hard to specify model Even harder to learn models in expensive that model-free that best captures data. partially-observed scenarios. methods. Need highly-flexible models to account for di ff erent regimes, and di ffi cult to develop a general-purpose model learning Finding the best solution in an approach. environment requires continuous exploration, Learn models based on changing Need to also learn reward continuous data collection policies from which the data is model -very hard. and continuous model- obtained. updating. Generative RL

14 Trend is to use lots of computation, coupled with environments that can parallelised. OpenAI evolution strategies (2017) Model-error propagates: When model-learning succeeds To learn robust models and we often use model-free reduce uncertainty requires a lots methods to train initially and of data. Works against the data- provide good data. e ffi ciency argument. Data- e ffi cient Learning

15 Generative models to drive behaviour in the absence of rewards: Complex probabilistic quantities, such as information gain or mutual information. Mohamed and Rezende (2015) Intrinsic Motivation

16 Generative models to drive behaviour in the absence of rewards Computation of complex Add to the burden of probabilistic quantities involves data needed to learn the approximations that impact policy reward structure. learning, data-e ffi ciency, safety. Require learning of environment Simplistic applications of models themselves, with all the these approaches at present. di ffi culty entailed. Generative RL

17 Arguments rely on the di ffi culty of using generative models and learning complex probabilistic quantities given current tools. Stronger support for models-free methods since they side-step many of the challenges of model-learning. Serious challenges to learning reliable, rapidly adapting, data-e ffi cient, and general-purpose models for use in practice. Uncertainty used in limited ways, but adds a great deal of complexity. Integrated systems. Types of environments and problems that are being addressed matter. Valid Critique?

18 Action a t-1 Action a t Action State a t+! s t-1 State s t State Not possible to argue against s t+1 … m t-1 m t / probabilistic approaches to RL. n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Our challenge is to show that principles we Critic Option KB develop have a rich theory that apply in practice and can be deployed in flexible ways. n o t i p O r e n n a l P

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 … m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t n Reinforcement Learning? o i t c A Data x t+1 t n e m n o r v i n E l a n r e t n I State Repr. Shakir Mohamed Critic Option KB n o t i p O Generative models shakir@deepmind.com r e n n a l P for RL workshop DALI 2018 @shakir_za

A Case Against s t-1 State s t State s t+1 m t-1 Generative - PowerPoint PPT Presentation

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t

blood, but against the rulers, against the authorities, against the powers of this dark world and

Ephesians 6:12 (NIV) Our struggle is not against flesh and blood, but against the rulers,

Cultural relativism Case against Case against Because two societies Because two societies do

Case Comparisons Department of Government London School of Economics and Political Science Uses

The case against doubling aid to Africa Common ground Although I want to argue against large

VIOLENCE AGAINST WOMEN 1. Description The violence against women will be the subject of the

LOCAL ACTION PLANS FOR FIGHT AGAINST FIGHT AGAINST CORRUPTIOON AT THE LOCAL LEVEL ZRENJANIN

Kashmir Fact Sheet Kashmir Fact Sheet By US India Relations War Against Terrorism War Against

CASE PRESENTATION CASE PRESENTATION Prepared by: Dr. Lina Raffa Case Report p 14 year old

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

The Montreal case The Montreal case 1 The Montreal case The Montreal case 2 Montreal 1992

CASE PROJECT CASE PROJECT IMPLEMENTATION 2017 AND PLANS FOR 2018 Antoine Zannotti CASE Project

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

About Us & Our Services Fifty Shades of Purple Against Bullying aka FSP Against Bullying

Immunity against rotavirus Immunity against rotavirus disease disease How effective is

Project LIFE09 NAT/ES/000533 INNOVATION AGAINST POISON Innovative actions against illegal

Young Adult Brain Development Raphael Mizrahi, B.S. Lets start with the basics Brain

D O N O T C O Aaron Boes, MD, PhD P Sidney R. Baer Clinical Neuroscience Fellow, Y

Tuning tuning curves So far: Receptive fields Representation of stimuli Population vectors

How drug-dependence impacts decision-making Christina M. Gremel, PhD University of California, San

COGS 105 Final Content Week: Brains and Clinical, Part I wikipedia Whats Next? Gerhard et

CORTEX FOR DECISION MAKING Sandrine Duverne Paris, Dec 02, 2019 Decision making A SERIES OF

Auditory System Whats the frequency Kenneth? Overview Intro Physical Stimulus: Sound

1 Current interest in neurobiology of emotion: Buklina ( Neurosci. Behav. Physiology ,

A Case Against s t-1 State s t State s t+1 m t-1 Generative - PowerPoint PPT Presentation

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t

blood, but against the rulers, against the authorities, against the powers of this dark world and

Ephesians 6:12 (NIV) Our struggle is not against flesh and blood, but against the rulers,

Cultural relativism Case against Case against Because two societies Because two societies do

Case Comparisons Department of Government London School of Economics and Political Science Uses

The case against doubling aid to Africa Common ground Although I want to argue against large

VIOLENCE AGAINST WOMEN 1. Description The violence against women will be the subject of the

LOCAL ACTION PLANS FOR FIGHT AGAINST FIGHT AGAINST CORRUPTIOON AT THE LOCAL LEVEL ZRENJANIN

Kashmir Fact Sheet Kashmir Fact Sheet By US India Relations War Against Terrorism War Against

CASE PRESENTATION CASE PRESENTATION Prepared by: Dr. Lina Raffa Case Report p 14 year old

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

The Montreal case The Montreal case 1 The Montreal case The Montreal case 2 Montreal 1992

CASE PROJECT CASE PROJECT IMPLEMENTATION 2017 AND PLANS FOR 2018 Antoine Zannotti CASE Project

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

About Us &amp; Our Services Fifty Shades of Purple Against Bullying aka FSP Against Bullying

Immunity against rotavirus Immunity against rotavirus disease disease How effective is

Project LIFE09 NAT/ES/000533 INNOVATION AGAINST POISON Innovative actions against illegal

Young Adult Brain Development Raphael Mizrahi, B.S. Lets start with the basics Brain

D O N O T C O Aaron Boes, MD, PhD P Sidney R. Baer Clinical Neuroscience Fellow, Y

Tuning tuning curves So far: Receptive fields Representation of stimuli Population vectors

How drug-dependence impacts decision-making Christina M. Gremel, PhD University of California, San

COGS 105 Final Content Week: Brains and Clinical, Part I wikipedia Whats Next? Gerhard et

CORTEX FOR DECISION MAKING Sandrine Duverne Paris, Dec 02, 2019 Decision making A SERIES OF

Auditory System Whats the frequency Kenneth? Overview Intro Physical Stimulus: Sound

1 Current interest in neurobiology of emotion: Buklina ( Neurosci. Behav. Physiology ,

About Us & Our Services Fifty Shades of Purple Against Bullying aka FSP Against Bullying