task agnostic priors for reinforcement learning
play

Task-agnostic priors for reinforcement learning Karthik Narasimhan - PowerPoint PPT Presentation

Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT) State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no


  1. Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT)

  2. State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no transfer of knowledge Sample efficiency Generalizability

  3. Current approaches • Multi-task policy learning (Parisotto et al., 2015; Rusu et al., 2016; Andreas et al., 2017; Espeholt et al., 2018, …) • Meta-learning (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Al-Shedivat et al., 2018; Nagabandi et al., 2018, …) • Bayesian RL (Ghavamzadeh et al., 2015; … ) • Successor representations (Dayan, 1993; Kulkarni et al., 2016, Barreto et al., 2017; …) • Policies are closely coupled with tasks & not well suited for transfer to new domains • Model-based RL?

  4. Bootstrapping model learning with task- agnostic priors • Model of the environment is more Prior transferrable than policy • Expensive to learn a model of the environment from scratch - use priors! Model Model • Task-agnostic priors for models … + Generalizable, easier to acquire Policy Policy - May be sub-optimal for specific task Task 2 Task 1

  5. ‘Universal’ priors Language Physics [NBJ18] [DN19]

  6. Task-agnostic dynamics priors for RL Key questions: • Can we learn physics in a task- agnostic fashion? • Does it help sample efficiency of RL? • Does it help transfer of learned representations to new tasks? (Yilun Du and Karthik Narasimhan, ICML 2019)

  7. Dynamics model for RL • Frame prediction : Oh et al.(2015), Finn et al.(2016), Weber et al. (2017), … • Lack generalization since model is learned for a specific task (i.e. action-conditioned) • Parameterized physics models : Cutler et al. (2014), Scholz et al.(2014), Zhu et al. (2018), … • Require manual specification • Our work : Learn prior from task-independent data, decouple model and policy

  8. Overall approach • Pre-train a frame predictor on physics videos • Initialize dynamics model and use it to learn policy that makes use of future state predictions • Simultaneously fine-tune dynamics model on target environment SpatialNet

  9. SpatialNet • Two key operations: • Isolation of dynamics of each entity • Accurate modeling of local spaces around each entity • Spatial memory : Use convolutions and residual connections to better capture local dynamics (instead of additive updates in the ConvLSTM model (Xingjian et al., 2015) • No action-conditioning ˆ T ( s 0 | s, a ) SpatialNet

  10. Experimental setup • PhysVideos, 625k frames of video containing moving objects of various shapes and sizes (rendered with a physics engine) PhysVideos • PhysWorld : Collection of 2D physics-centric games - navigation, object gathering and shooting tasks Pre-trained dynamics • Atari: Stochastic version with sticky actions model PhysWorld • RL agent: Use predicted future frames as input to policy Agent network Atari • Same pre-trained dynamics prior used in all RL experiments …

  11. Frame predictions PhysShooter Pixel prediction accuracy

  12. Predicting physical parameters PhysShooter Predicted frames are indicative of physical parameters of environment

  13. Policy learning: PhysWorld

  14. Policy learning: Atari Same pre-trained model as before (from PhysVideos)

  15. Transfer Learning 35.42 No transfer 42.27 Pre-trained dynamics predictor 40.40 Policy transfer (PhysForage) 53.66 Model transfer (PhysForage) 25 33.75 42.5 51.25 60 Reward Target env: PhysShooter Model transfer > policy transfer > no transfer

  16. Beyond physics • Not all environments are physical • How do we encode knowledge of the environment?

  17. Environment 1 Environment 2 0.4 u 1 u 2 u 3 s 2 s 3 s 1 ADD DOTA Image u 6 0.2 u 5 0.4 s 5 s 4 u 4 • Knowledge of environment: transitions and rewards • Need some anchor to re-use acquired information • Incorrect mapping will lead to negative transfer

  18. Environment 1 Environment 2 0.35 0.4 u 1 u 2 u 3 s 2 s 3 s 1 0.45 u 6 0.2 u 5 0.4 s 5 s 4 u 4 0.2 s1 is similar to u1 s2 is similar to u3 …

  19. Grounding language for transfer in RL Spiders are chasers and Scor%ions chase you can be dest5oyed by an and kill you on touch ex%losion • Text descriptions associated with objects/entities

  20. Language as a bridge Language • Language as task-invariant and accessible medium • Traditional approaches : direct transfer of policy (e.g. instruction following) • This work : transfer ‘model’ of the environment using text descriptions (Narasimhan, Barzilay, Jaakkola, JAIR 2018)

  21. Model-based reinforcement learning T ( s 0 | s, a ) R ( s, a ) Transition distribution and reward function s 0 1 State s Ac)on a +10 s 0 n

  22. Model-based reinforcement learning T ( s 0 | s, a, z ) Text-conditioned transition distribution State s s 0 1 Ac)on a Scor%ions chase you Text z s 0 and kill you on touch n

  23. Bootstrap learning through text Transfer Scor%ions chase you Spiders are chasers z 2 z 1 and kill you on touch and can be dest5oyed ˆ ˆ T 2 ( u 0 | u, a, z 2 ) T 1 ( s 0 | s, a, z 1 ) R 1 ( s, a, z 1 ) R 2 ( u, a, z 2 ) • Appropriate representation to incorporate language • Partial text descriptions

  24. Differentiable value iteration X T ( s 0 | s, a ) V ( s 0 ) V ( s ) = max Q ( s, a ) Q ( s, a ) = R ( s, a ) + γ a s 0 Convolu)onal neural network (CNN) Reward R max Observations 
 + 
 Descriptions T Q V φ ( s ) V Value k step recurrence (Value Iteration Network, Tamar et al., 2016)

  25. Experiments • 2-D game environments from the GVGAI framework 
 (each with different layouts, different entity sets, etc.) Environment stats: 
 • Text descriptions from Amazon Mechanical Turk Source and target game instances for transfer • Transfer setup : train on multiple source tasks, and use learned parameters to initialize for target tasks • Baselines : DQN (Mnih et al., 2015), text-DQN, Actor-Mimic (Parisotto et al., 2016) • Evaluation : Jumpstart, average and asymptotic reward

  26. Average reward 0.8 F&E-1 to Freeway 0.73 0.6 Reward 0.4 0.33 0.2 0.22 0.21 0.08 0 No transfer DQN Actor Mimic text-DQN text-VIN

  27. Transfer results

  28. Conclusions • Model-based RL is sample efficient but learning a model is expensive • Task-agnostic priors over models provide a solution for both sample efficiency and generalization • Two common priors applicable to a variety of tasks: classical mechanics and language Questions?

  29. Challenger : Joshua Zhanson <jzhanson@andrew.cmu.edu> ● After the success of deep learning, we are now seeing a push into middle- level intelligence, such as ● cross-domain reasoning, e.g., visual question-answering or language- grounding, ● using knowledge from a different tasks and domains to aid learning, e.g., learning skills from video or demonstration, or learning to learn in general. ● What do you see as the end goal of such mid-level intelligence, especially since the space of mid-level tasks is so much more complex and varied? ● What are the greatest obstacles on the path to mid-level intelligence?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend