Task-agnostic priors for reinforcement learning Karthik Narasimhan - - PowerPoint PPT Presentation
Task-agnostic priors for reinforcement learning Karthik Narasimhan - - PowerPoint PPT Presentation
Task-agnostic priors for reinforcement learning Karthik Narasimhan Princeton Collaborators: Yilun Du (MIT), Regina Barzilay (MIT), Tommi Jaakkola (MIT) State of RL ~1100 PF/s days ~800 PF/s days or 45000 years (source: OpenAI) Little to no
State of RL
~1100 PF/s days ~800 PF/s days or 45000 years
(source: OpenAI)
Little to no transfer of knowledge
Sample efficiency Generalizability
Current approaches
- Multi-task policy learning (Parisotto et al., 2015; Rusu et al., 2016; Andreas et al., 2017; Espeholt et al.,
2018, …)
- Meta-learning (Duan et al., 2016; Wang et al., 2016; Finn et al., 2017; Al-Shedivat et al., 2018; Nagabandi
et al., 2018, …)
- Bayesian RL (Ghavamzadeh et al., 2015; … )
- Successor representations (Dayan, 1993; Kulkarni et al., 2016, Barreto et al., 2017; …)
- Policies are closely coupled with tasks & not well suited for transfer to new domains
- Model-based RL?
Bootstrapping model learning with task- agnostic priors
Policy
- Model of the environment is more
transferrable than policy
- Expensive to learn a model of the
environment from scratch - use priors!
- Task-agnostic priors for models
+ Generalizable, easier to acquire
- May be sub-optimal for specific task
Model Prior Policy Model
… Task 1 Task 2
‘Universal’ priors
Physics
[DN19]
Language
[NBJ18]
Task-agnostic dynamics priors for RL
(Yilun Du and Karthik Narasimhan, ICML 2019)
Key questions:
- Can we learn physics in a task-
agnostic fashion?
- Does it help sample efficiency
- f RL?
- Does it help transfer of learned
representations to new tasks?
Dynamics model for RL
- Frame prediction: Oh et al.(2015), Finn et al.(2016), Weber et al. (2017), …
- Lack generalization since model is learned for a specific task (i.e. action-conditioned)
- Parameterized physics models: Cutler et al. (2014), Scholz et al.(2014), Zhu et al. (2018), …
- Require manual specification
- Our work: Learn prior from task-independent data, decouple model and policy
Overall approach
- Pre-train a frame predictor on physics
videos
- Initialize dynamics model and use it to
learn policy that makes use of future state predictions
- Simultaneously fine-tune dynamics
model on target environment
SpatialNet
SpatialNet
- Two key operations:
- Isolation of dynamics of each entity
- Accurate modeling of local spaces around each
entity
- Spatial memory: Use convolutions and residual
connections to better capture local dynamics (instead
- f additive updates in the ConvLSTM model (Xingjian
et al., 2015)
- No action-conditioning
SpatialNet
ˆ T(s0|s, a)
Experimental setup
- PhysVideos, 625k frames of video containing moving objects of
various shapes and sizes (rendered with a physics engine)
- PhysWorld: Collection of 2D physics-centric games - navigation,
- bject gathering and shooting tasks
- Atari: Stochastic version with sticky actions
- RL agent: Use predicted future frames as input to policy
network
- Same pre-trained dynamics prior used in all RL
experiments PhysVideos PhysWorld Atari
…
Agent
Pre-trained dynamics model
Frame predictions
Pixel prediction accuracy
PhysShooter
Predicting physical parameters
Predicted frames are indicative of physical parameters of environment
PhysShooter
Policy learning: PhysWorld
Policy learning: Atari
Same pre-trained model as before (from PhysVideos)
Transfer Learning
Model transfer > policy transfer > no transfer
No transfer Pre-trained dynamics predictor Policy transfer (PhysForage) Model transfer (PhysForage) Reward 25 33.75 42.5 51.25 60
53.66 40.40 42.27 35.42
Target env: PhysShooter
Beyond physics
- Not all environments are physical
- How do we encode knowledge of the environment?
- Knowledge of environment: transitions and rewards
- Need some anchor to re-use acquired information
- Incorrect mapping will lead to negative transfer
0.4 0.4 0.2
s1 s2 s3 s4 s5
Environment 1 Environment 2
u1 u2 u3 u4 u5 u6
ADD DOTA Image
Environment 1 Environment 2 0.4 0.4 0.2
s1 s2 s3 s4 s5 u1 u2 u3 u4 u5 u6
0.35 0.45 0.2
s1 is similar to u1 s2 is similar to u3 …
Scor%ions chase you and kill you on touch Spiders are chasers and can be dest5oyed by an ex%losion
- Text descriptions associated with objects/entities
Grounding language for transfer in RL
Language as a bridge
Language
- Language as task-invariant and accessible medium
- Traditional approaches: direct transfer of policy (e.g. instruction following)
- This work: transfer ‘model’ of the environment using text descriptions
(Narasimhan, Barzilay, Jaakkola, JAIR 2018)
Model-based reinforcement learning
State s Ac)on a
s0
1
s0
n
Transition distribution and reward function
T(s0|s, a)
R(s, a)
+10
State s Text-conditioned transition distribution
T(s0|s, a, z) Scor%ions chase you and kill you on touch
Ac)on a
s0
1
s0
n
Text z
Model-based reinforcement learning
Bootstrap learning through text
Scor%ions chase you and kill you on touch
z1
- Appropriate representation to incorporate language
- Partial text descriptions
T1(s0|s, a, z1) R1(s, a, z1)
Transfer
z2
Spiders are chasers and can be dest5oyed
ˆ T2(u0|u, a, z2) ˆ R2(u, a, z2)
Reward max
R V Q V
Convolu)onal neural network (CNN)
T
Value k step recurrence
φ(s)
Differentiable value iteration
Q(s, a) = R(s, a) + γ X
s0
T(s0|s, a)V (s0)
V (s) = max
a
Q(s, a)
(Value Iteration Network, Tamar et al., 2016)
Observations + Descriptions
Experiments
- 2-D game environments from the GVGAI
framework (each with different layouts, different entity sets, etc.)
- Text descriptions from Amazon Mechanical Turk
- Transfer setup: train on multiple source tasks, and
use learned parameters to initialize for target tasks
- Baselines: DQN (Mnih et al., 2015), text-DQN,
Actor-Mimic (Parisotto et al., 2016)
- Evaluation: Jumpstart, average and asymptotic
reward
Environment stats: Source and target game instances for transfer
Average reward
Reward
0.2 0.4 0.6 0.8
No transfer DQN Actor Mimic text-DQN text-VIN
0.73 0.33 0.08 0.21 0.22
F&E-1 to Freeway
Transfer results
Conclusions
- Model-based RL is sample efficient but learning a model is expensive
- Task-agnostic priors over models provide a solution for both sample efficiency
and generalization
- Two common priors applicable to a variety of tasks: classical mechanics and
language Questions?
Challenger: Joshua Zhanson <jzhanson@andrew.cmu.edu>
- After the success of deep learning, we are now seeing a push into middle-
level intelligence, such as
- cross-domain reasoning, e.g., visual question-answering or language-
grounding,
- using knowledge from a different tasks and domains to aid learning, e.g.,
learning skills from video or demonstration, or learning to learn in general.
- What do you see as the end goal of such mid-level intelligence, especially
since the space of mid-level tasks is so much more complex and varied?
- What are the greatest obstacles on the path to mid-level intelligence?