Task-agnostic priors for reinforcement learning Karthik Narasimhan - PowerPoint PPT Presentation

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT)

State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no transfer of knowledge Sample efficiency Generalizability

Current approaches • Multi-task policy learning (Parisotto et al., 2015; Rusu et al., 2016; Andreas et al., 2017; Espeholt et al., 2018, …) • Meta-learning (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Al-Shedivat et al., 2018; Nagabandi et al., 2018, …) • Bayesian RL (Ghavamzadeh et al., 2015; … ) • Successor representations (Dayan, 1993; Kulkarni et al., 2016, Barreto et al., 2017; …) • Policies are closely coupled with tasks & not well suited for transfer to new domains • Model-based RL?

Bootstrapping model learning with task- agnostic priors • Model of the environment is more Prior transferrable than policy • Expensive to learn a model of the environment from scratch - use priors! Model Model • Task-agnostic priors for models … + Generalizable, easier to acquire Policy Policy - May be sub-optimal for specific task Task 2 Task 1

‘Universal’ priors Language Physics [NBJ18] [DN19]

Task-agnostic dynamics priors for RL Key questions: • Can we learn physics in a task- agnostic fashion? • Does it help sample efficiency of RL? • Does it help transfer of learned representations to new tasks? (Yilun Du and Karthik Narasimhan, ICML 2019)

Dynamics model for RL • Frame prediction : Oh et al.(2015), Finn et al.(2016), Weber et al. (2017), … • Lack generalization since model is learned for a specific task (i.e. action-conditioned) • Parameterized physics models : Cutler et al. (2014), Scholz et al.(2014), Zhu et al. (2018), … • Require manual specification • Our work : Learn prior from task-independent data, decouple model and policy

Overall approach • Pre-train a frame predictor on physics videos • Initialize dynamics model and use it to learn policy that makes use of future state predictions • Simultaneously fine-tune dynamics model on target environment SpatialNet

SpatialNet • Two key operations: • Isolation of dynamics of each entity • Accurate modeling of local spaces around each entity • Spatial memory : Use convolutions and residual connections to better capture local dynamics (instead of additive updates in the ConvLSTM model (Xingjian et al., 2015) • No action-conditioning ˆ T ( s 0 | s, a ) SpatialNet

Experimental setup • PhysVideos, 625k frames of video containing moving objects of various shapes and sizes (rendered with a physics engine) PhysVideos • PhysWorld : Collection of 2D physics-centric games - navigation, object gathering and shooting tasks Pre-trained dynamics • Atari: Stochastic version with sticky actions model PhysWorld • RL agent: Use predicted future frames as input to policy Agent network Atari • Same pre-trained dynamics prior used in all RL experiments …

Frame predictions PhysShooter Pixel prediction accuracy

Predicting physical parameters PhysShooter Predicted frames are indicative of physical parameters of environment

Policy learning: PhysWorld

Policy learning: Atari Same pre-trained model as before (from PhysVideos)

Transfer Learning 35.42 No transfer 42.27 Pre-trained dynamics predictor 40.40 Policy transfer (PhysForage) 53.66 Model transfer (PhysForage) 25 33.75 42.5 51.25 60 Reward Target env: PhysShooter Model transfer > policy transfer > no transfer

Beyond physics • Not all environments are physical • How do we encode knowledge of the environment?

Environment 1 Environment 2 0.4 u 1 u 2 u 3 s 2 s 3 s 1 ADD DOTA Image u 6 0.2 u 5 0.4 s 5 s 4 u 4 • Knowledge of environment: transitions and rewards • Need some anchor to re-use acquired information • Incorrect mapping will lead to negative transfer

Environment 1 Environment 2 0.35 0.4 u 1 u 2 u 3 s 2 s 3 s 1 0.45 u 6 0.2 u 5 0.4 s 5 s 4 u 4 0.2 s1 is similar to u1 s2 is similar to u3 …

Grounding language for transfer in RL Spiders are chasers and Scor%ions chase you can be dest5oyed by an and kill you on touch ex%losion • Text descriptions associated with objects/entities

Language as a bridge Language • Language as task-invariant and accessible medium • Traditional approaches : direct transfer of policy (e.g. instruction following) • This work : transfer ‘model’ of the environment using text descriptions (Narasimhan, Barzilay, Jaakkola, JAIR 2018)

Model-based reinforcement learning T ( s 0 | s, a ) R ( s, a ) Transition distribution and reward function s 0 1 State s Ac)on a +10 s 0 n

Model-based reinforcement learning T ( s 0 | s, a, z ) Text-conditioned transition distribution State s s 0 1 Ac)on a Scor%ions chase you Text z s 0 and kill you on touch n

Bootstrap learning through text Transfer Scor%ions chase you Spiders are chasers z 2 z 1 and kill you on touch and can be dest5oyed ˆ ˆ T 2 ( u 0 | u, a, z 2 ) T 1 ( s 0 | s, a, z 1 ) R 1 ( s, a, z 1 ) R 2 ( u, a, z 2 ) • Appropriate representation to incorporate language • Partial text descriptions

Differentiable value iteration X T ( s 0 | s, a ) V ( s 0 ) V ( s ) = max Q ( s, a ) Q ( s, a ) = R ( s, a ) + γ a s 0 Convolu)onal neural network (CNN) Reward R max Observations   +   Descriptions T Q V φ ( s ) V Value k step recurrence (Value Iteration Network, Tamar et al., 2016)

Experiments • 2-D game environments from the GVGAI framework   (each with different layouts, different entity sets, etc.) Environment stats:   • Text descriptions from Amazon Mechanical Turk Source and target game instances for transfer • Transfer setup : train on multiple source tasks, and use learned parameters to initialize for target tasks • Baselines : DQN (Mnih et al., 2015), text-DQN, Actor-Mimic (Parisotto et al., 2016) • Evaluation : Jumpstart, average and asymptotic reward

Average reward 0.8 F&E-1 to Freeway 0.73 0.6 Reward 0.4 0.33 0.2 0.22 0.21 0.08 0 No transfer DQN Actor Mimic text-DQN text-VIN

Transfer results

Conclusions • Model-based RL is sample efficient but learning a model is expensive • Task-agnostic priors over models provide a solution for both sample efficiency and generalization • Two common priors applicable to a variety of tasks: classical mechanics and language Questions?

Challenger : Joshua Zhanson <jzhanson@andrew.cmu.edu> ● After the success of deep learning, we are now seeing a push into middle- level intelligence, such as ● cross-domain reasoning, e.g., visual question-answering or language- grounding, ● using knowledge from a different tasks and domains to aid learning, e.g., learning skills from video or demonstration, or learning to learn in general. ● What do you see as the end goal of such mid-level intelligence, especially since the space of mid-level tasks is so much more complex and varied? ● What are the greatest obstacles on the path to mid-level intelligence?

Task-agnostic priors for reinforcement learning Karthik Narasimhan - PowerPoint PPT Presentation

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT) State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no

Task-Agnostic Dynamics Priors for Deep Reinforcement Learning Yilun Du 1 , Karthik Narasimhan 2

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

LANGUAGE-AGNOSTIC INJECTION LANGUAGE-AGNOSTIC INJECTION DETECTION DETECTION Lars Hermerschmidt,

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price,

Pool-based Agnostic Pool-based Agnostic Experiment Design Experiment Design in Linear

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

North Dakota Tier II Instructions If you have already filed a Tier II report for a previous year,

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

MA/CSSE 474 Theory of Computation Nondeterminism NFSMs Your Questions? Previous class

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Todays topics More network flow reductions CSE 421 Airplane scheduling Image

Computer Vision and stuff Willie Brink Applied Mathematics, Stellenbosch University

Vibrational Gyroscopes in Instrumentation and in Creation Robert Leland Oral Roberts University

Activity-Centered Domain Characterization Liz Marai Electronic Visualization Laboratory