Task-agnostic priors for reinforcement learning Karthik Narasimhan - - PowerPoint PPT Presentation

task agnostic priors for reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Task-agnostic priors for reinforcement learning Karthik Narasimhan - - PowerPoint PPT Presentation

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT) State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no


slide-1
SLIDE 1

Task-agnostic priors for reinforcement learning

Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT)

slide-2
SLIDE 2

State of RL

~1100 PF/s days ~800 PF/s days or 45000 years

(source: OpenAI)

Little to no transfer of knowledge

Sample efficiency Generalizability

slide-3
SLIDE 3

Current approaches

  • Multi-task policy learning (Parisotto et al., 2015; Rusu et al., 2016; Andreas et al., 2017; Espeholt et al.,

2018, …)

  • Meta-learning (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Al-Shedivat et al., 2018; Nagabandi

et al., 2018, …)

  • Bayesian RL (Ghavamzadeh et al., 2015; … )
  • Successor representations (Dayan, 1993; Kulkarni et al., 2016, Barreto et al., 2017; …)
  • Policies are closely coupled with tasks & not well suited for transfer to new domains
  • Model-based RL?
slide-4
SLIDE 4

Bootstrapping model learning with task- agnostic priors

Policy

  • Model of the environment is more

transferrable than policy

  • Expensive to learn a model of the

environment from scratch - use priors!

  • Task-agnostic priors for models

+ Generalizable, easier to acquire

  • May be sub-optimal for specific task

Model Prior Policy Model

… Task 1 Task 2

slide-5
SLIDE 5

‘Universal’ priors

Physics

[DN19]

Language

[NBJ18]

slide-6
SLIDE 6

Task-agnostic dynamics priors for RL

(Yilun Du and Karthik Narasimhan, ICML 2019)

Key questions:

  • Can we learn physics in a task-

agnostic fashion?

  • Does it help sample efficiency
  • f RL?
  • Does it help transfer of learned

representations to new tasks?

slide-7
SLIDE 7

Dynamics model for RL

  • Frame prediction: Oh et al.(2015), Finn et al.(2016), Weber et al. (2017), …
  • Lack generalization since model is learned for a specific task (i.e. action-conditioned)
  • Parameterized physics models: Cutler et al. (2014), Scholz et al.(2014), Zhu et al. (2018), …
  • Require manual specification
  • Our work: Learn prior from task-independent data, decouple model and policy
slide-8
SLIDE 8

Overall approach

  • Pre-train a frame predictor on physics

videos

  • Initialize dynamics model and use it to

learn policy that makes use of future state predictions

  • Simultaneously fine-tune dynamics

model on target environment

SpatialNet

slide-9
SLIDE 9

SpatialNet

  • Two key operations:
  • Isolation of dynamics of each entity
  • Accurate modeling of local spaces around each

entity

  • Spatial memory: Use convolutions and residual

connections to better capture local dynamics (instead

  • f additive updates in the ConvLSTM model (Xingjian

et al., 2015)

  • No action-conditioning

SpatialNet

ˆ T(s0|s, a)

slide-10
SLIDE 10

Experimental setup

  • PhysVideos, 625k frames of video containing moving objects of

various shapes and sizes (rendered with a physics engine)

  • PhysWorld: Collection of 2D physics-centric games - navigation,
  • bject gathering and shooting tasks
  • Atari: Stochastic version with sticky actions
  • RL agent: Use predicted future frames as input to policy

network

  • Same pre-trained dynamics prior used in all RL

experiments PhysVideos PhysWorld Atari

Agent

Pre-trained dynamics model

slide-11
SLIDE 11

Frame predictions

Pixel prediction accuracy

PhysShooter

slide-12
SLIDE 12

Predicting physical parameters

Predicted frames are indicative of physical parameters of environment

PhysShooter

slide-13
SLIDE 13

Policy learning: PhysWorld

slide-14
SLIDE 14

Policy learning: Atari

Same pre-trained model as before (from PhysVideos)

slide-15
SLIDE 15

Transfer Learning

Model transfer > policy transfer > no transfer

No transfer Pre-trained dynamics predictor Policy transfer (PhysForage) Model transfer (PhysForage) Reward 25 33.75 42.5 51.25 60

53.66 40.40 42.27 35.42

Target env: PhysShooter

slide-16
SLIDE 16

Beyond physics

  • Not all environments are physical
  • How do we encode knowledge of the environment?
slide-17
SLIDE 17
  • Knowledge of environment: transitions and rewards
  • Need some anchor to re-use acquired information
  • Incorrect mapping will lead to negative transfer

0.4 0.4 0.2

s1 s2 s3 s4 s5

Environment 1 Environment 2

u1 u2 u3 u4 u5 u6

ADD DOTA Image

slide-18
SLIDE 18

Environment 1 Environment 2 0.4 0.4 0.2

s1 s2 s3 s4 s5 u1 u2 u3 u4 u5 u6

0.35 0.45 0.2

s1 is similar to u1 s2 is similar to u3 …

slide-19
SLIDE 19

Scor%ions chase you and kill you on touch Spiders are chasers and can be dest5oyed by an ex%losion

  • Text descriptions associated with objects/entities

Grounding language for transfer in RL

slide-20
SLIDE 20

Language as a bridge

Language

  • Language as task-invariant and accessible medium
  • Traditional approaches: direct transfer of policy (e.g. instruction following)
  • This work: transfer ‘model’ of the environment using text descriptions

(Narasimhan, Barzilay, Jaakkola, JAIR 2018)

slide-21
SLIDE 21

Model-based reinforcement learning

State s Ac)on a

s0

1

s0

n

Transition distribution and reward function

T(s0|s, a)

R(s, a)

+10

slide-22
SLIDE 22

State s Text-conditioned transition distribution

T(s0|s, a, z) Scor%ions chase you and kill you on touch

Ac)on a

s0

1

s0

n

Text z

Model-based reinforcement learning

slide-23
SLIDE 23

Bootstrap learning through text

Scor%ions chase you and kill you on touch

z1

  • Appropriate representation to incorporate language
  • Partial text descriptions

T1(s0|s, a, z1) R1(s, a, z1)

Transfer

z2

Spiders are chasers and can be dest5oyed

ˆ T2(u0|u, a, z2) ˆ R2(u, a, z2)

slide-24
SLIDE 24

Reward max

R V Q V

Convolu)onal neural network (CNN)

T

Value k step recurrence

φ(s)

Differentiable value iteration

Q(s, a) = R(s, a) + γ X

s0

T(s0|s, a)V (s0)

V (s) = max

a

Q(s, a)

(Value Iteration Network, Tamar et al., 2016)

Observations
 +
 Descriptions

slide-25
SLIDE 25

Experiments

  • 2-D game environments from the GVGAI

framework
 (each with different layouts, different entity sets, etc.)

  • Text descriptions from Amazon Mechanical Turk
  • Transfer setup: train on multiple source tasks, and

use learned parameters to initialize for target tasks

  • Baselines: DQN (Mnih et al., 2015), text-DQN,

Actor-Mimic (Parisotto et al., 2016)

  • Evaluation: Jumpstart, average and asymptotic

reward

Environment stats: 
 Source and target game instances for transfer

slide-26
SLIDE 26

Average reward

Reward

0.2 0.4 0.6 0.8

No transfer DQN Actor Mimic text-DQN text-VIN

0.73 0.33 0.08 0.21 0.22

F&E-1 to Freeway

slide-27
SLIDE 27

Transfer results

slide-28
SLIDE 28

Conclusions

  • Model-based RL is sample efficient but learning a model is expensive
  • Task-agnostic priors over models provide a solution for both sample efficiency

and generalization

  • Two common priors applicable to a variety of tasks: classical mechanics and

language Questions?

slide-29
SLIDE 29

Challenger: Joshua Zhanson <jzhanson@andrew.cmu.edu>

  • After the success of deep learning, we are now seeing a push into middle-

level intelligence, such as

  • cross-domain reasoning, e.g., visual question-answering or language-

grounding,

  • using knowledge from a different tasks and domains to aid learning, e.g.,

learning skills from video or demonstration, or learning to learn in general.

  • What do you see as the end goal of such mid-level intelligence, especially

since the space of mid-level tasks is so much more complex and varied?

  • What are the greatest obstacles on the path to mid-level intelligence?
slide-30
SLIDE 30