A Case Against s t-1 State s t State s t+1 m t-1 Generative - - PowerPoint PPT Presentation

a case against
SMART_READER_LITE
LIVE PREVIEW

A Case Against s t-1 State s t State s t+1 m t-1 Generative - - PowerPoint PPT Presentation

Action a t-1 Action a t Action State a t+! A Case Against s t-1 State s t State s t+1 m t-1 Generative Models for m t / n o i t a v r e s b O t n e m t+1 m n o r i v n E Data a l n r e t x E x t-1 Data x t


slide-1
SLIDE 1

Data

xt-1 State st State st-1 Action at-1 mt

Data

xt State st+1 mt+1

Data

xt+1 Action at Action at+!

mt-1

E x t e r n a l E n v i r

  • n

m e n t I n t e r n a l E n v i r

  • n

m e n t P l a n n e r

Option KB Critic State Repr.

O p t i

  • n

O b s e r v a t i

  • n

/ A c t i

  • n

A Case Against Generative Models for Reinforcement Learning?

Generative models for RL workshop DALI 2018 @shakir_za shakir@deepmind.com

Shakir Mohamed

slide-2
SLIDE 2

Perception and Action

2

Environment Primary Sensory Cortex Observation/ Sensation Action

Environment

Sensory Association Cortex Posterior Assoc. Cortex Primary Motor Cortex Premotor Cortex Prefrontal Cortex

Brain

slide-3
SLIDE 3

Perception and Action

3

External Environment Internal Environment Planner

Option KB Critic State Repr.

State Embedding Option Observation/ Sensation Action

Agent Environment

Environment Primary Sensory Cortex Observation/ Sensation Action

Environment

Sensory Association Cortex Posterior Assoc. Cortex Primary Motor Cortex Premotor Cortex Prefrontal Cortex

Brain
slide-4
SLIDE 4

Generative RL

4

What makes something a generative model? Much emphasis on generative models of models of images. False sense of progress? Don’t know how to usefully learn hierarchical models. Any hierarchical reasoning will be hard to do. Anything ever truly unsupervised? Always a context, and contextual models other than labels is hard. Inference is hard. Are we attempting to solve a more difficult problem first, instead

  • f solving directly?
slide-5
SLIDE 5

Generative Processes

5

Environment

  • r Model

p(R(s,a))

Action Prior p(a)

Environment is the generative process: An unknown likelihood; Not known analytically; Only able to observe its outcomes.

All the key inferential questions can now be asked in this simple framework.

Prior over actions Interaction only Long-term reward

slide-6
SLIDE 6

6

Planning-as-Inference

What is the posterior distribution over actions? Maximising the probability of the return log p(R).

Simplest question:

Recover policy search methods: Uniform prior over distributions Continuous policy parameters Can evaluate environment, but not differentiate.

Environment

  • r Model

p(R(s,a))

Action Prior p(a)

Variational inference in the hierarchical model:

slide-7
SLIDE 7

Planning-as-Inference

7

Appearance of the entropy penalty is natural and alternative priors easy to consider. Can easily incorporate prior knowledge of the action space. Use any of the tools of probabilistic inference available. Easily handle stochastic and deterministic policies.

Environment

  • r Model

p(R(s,a))

Action Prior p(a)

Free Energy Policy gradient using score-function gradient

slide-8
SLIDE 8

E x t e r n a l E n v i r

  • n

m e n t I n t e r n a l E n v i r

  • n

m e n t P l a n n e r

O p t i

  • n

K B C r i t i c S t a t e R e p r .

S t a t e E m b e d d i n g O p t i

  • n

O b s e r v a t i

  • n

/ S e n s a t i

  • n

A c t i

  • n

Agent Environment

8

Planning-as-Inference

Environment

  • r Model

p(R(s,a))

Action Prior p(a)

With a more realistic expansion as graphical model Derive Bellman’s equation as a different writing of message passing. Application of the EM algorithm for policy search becomes possible. Easily consider other variational methods, like EP . Both model-free and model-based methods emerge.

slide-9
SLIDE 9

Generative RL

9

Inference is already hard. Do we gain additional benefit?

Any hyperparameters can be learnt. Simpler and competitive methods exist. Quantification of uncertainty helps drive natural exploration. But uncertainty often not used; easy to obtain in other ways.

Parameter inference is already difficult in non- RL settings. Can be computationally more demanding.

slide-10
SLIDE 10

Model-based RL

10

slide-11
SLIDE 11

Model-based RL

11

Learn a model of the environment and use that in all planning. Internal simulator - limit interactions with env for safety and planning. Long-term predictions allow for better planning. Data efficient learning, especially when experience is expensive to obtain. Prior knowledge of the environment can be easily incorporated.

Chiappa et al. (2017)

Data

xt-1 State st State st-1 Action at-1 mt

Data

xt State st+1 mt+1

Data

xt+1 Action at Action at+!

mt-1

slide-12
SLIDE 12

12

Model-based RL

Exploration Physical and temporal consistency

slide-13
SLIDE 13

13

Generative RL

Need highly-flexible models to account for different regimes, and difficult to develop a general-purpose model learning approach. Hard to specify model that best captures data. For the most part limited to small domains, limited complexity, limited consistency. Arguments in favour rely

  • n linear models, or in

low-dimensions. Finding the best solution in an environment requires continuous exploration, continuous data collection and continuous model- updating. Even harder to learn models in partially-observed scenarios. Learn models based on changing policies from which the data is

  • btained.

Agent only as good as the model that is learnt. Computationally more expensive that model-free methods. Two sources of error - model and any value, estimate. Need to also learn reward model -very hard.

slide-14
SLIDE 14

Data-efficient Learning

14

Trend is to use lots of computation, coupled with environments that can parallelised.

OpenAI evolution strategies (2017)

Model-error propagates: To learn robust models and reduce uncertainty requires a lots

  • f data. Works against the data-

efficiency argument. When model-learning succeeds we often use model-free methods to train initially and provide good data.

slide-15
SLIDE 15

Intrinsic Motivation

15

Mohamed and Rezende (2015)

Generative models to drive behaviour in the absence of rewards: Complex probabilistic quantities, such as information gain

  • r mutual information.
slide-16
SLIDE 16

Generative RL

16

Generative models to drive behaviour in the absence of rewards Computation of complex probabilistic quantities involves approximations that impact policy learning, data-efficiency, safety. Require learning of environment models themselves, with all the difficulty entailed. Add to the burden of data needed to learn the reward structure. Simplistic applications of these approaches at present.

slide-17
SLIDE 17

Valid Critique?

17

Arguments rely on the difficulty of using generative models and learning complex probabilistic quantities given current tools. Stronger support for models-free methods since they side-step many of the challenges of model-learning. Serious challenges to learning reliable, rapidly adapting, data-efficient, and general-purpose models for use in practice. Uncertainty used in limited ways, but adds a great deal of complexity. Types of environments and problems that are being addressed matter. Integrated systems.

slide-18
SLIDE 18

18

Not possible to argue against probabilistic approaches to RL.

Our challenge is to show that principles we develop have a rich theory that apply in practice and can be deployed in flexible ways.

Data

xt-1 State st State st-1 Action at-1 mt

Data

xt State st+1 mt+1

Data

xt+1 Action at Action at+!

mt-1

E x t e r n a l E n v i r

  • n

m e n t I n t e r n a l E n v i r

  • n

m e n t P l a n n e r

Option KB Critic State Repr.

O p t i

  • n

O b s e r v a t i

  • n

/ A c t i

  • n
slide-19
SLIDE 19

Data

xt-1 State st State st-1 Action at-1 mt

Data

xt State st+1 mt+1

Data

xt+1 Action at Action at+!

mt-1

E x t e r n a l E n v i r

  • n

m e n t I n t e r n a l E n v i r

  • n

m e n t P l a n n e r

Option KB Critic State Repr.

O p t i

  • n

O b s e r v a t i

  • n

/ A c t i

  • n

A Case Against Generative Models for Reinforcement Learning?

Generative models for RL workshop DALI 2018 @shakir_za shakir@deepmind.com

Shakir Mohamed