Differentiable Tree Planning for Deep RL Greg Farquhar 1 In - - PowerPoint PPT Presentation

differentiable tree planning for deep rl
SMART_READER_LITE
LIVE PREVIEW

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In - - PowerPoint PPT Presentation

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In Collaboration With Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson Greg Farquhar 2 / 35 Overview Reinforcement learning Model-based RL and online planning


slide-1
SLIDE 1

Differentiable Tree Planning for Deep RL

Greg Farquhar

1

slide-2
SLIDE 2

Greg Farquhar

In Collaboration With

Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson

2 / 35

slide-3
SLIDE 3

Greg Farquhar

Overview

  • Reinforcement learning
  • Model-based RL and online planning
  • TreeQN and ATreeC (ICLR 2018)
  • Results
  • Future work

3 / 35

slide-4
SLIDE 4

Greg Farquhar

Planning and Learning for Control

4 / 35

slide-5
SLIDE 5

Greg Farquhar

Reinforcement Learning

5 / 35

slide-6
SLIDE 6

Greg Farquhar

Reinforcement Learning

  • Specify the reward, learn the solution
  • Very general framework
  • Problem is hard:

○ Rewards are sparse ○ Credit assignment ○ Exploration and exploitation ○ Large state/action spaces ○ Approximation and generalisation

6 / 35

slide-7
SLIDE 7

Greg Farquhar

RL Key Concepts

  • State (Observation)
  • Action
  • Transition
  • Reward
  • Policy: states → actions

7 / 35

slide-8
SLIDE 8

Greg Farquhar

Model-free RL: Value Functions

  • Learn without a model of the environment
  • Value function
  • Optimal value function
  • Policy evaluation + improvement

8 / 35

slide-9
SLIDE 9

Greg Farquhar

The Bellman Equation

  • Temporal (Markov) structure
  • Bellman optimality equation
  • Q-learning
  • Backups

9 / 35

slide-10
SLIDE 10

Greg Farquhar

Deep RL

  • Q → deep neural network
  • Q-learning as regression
  • Stability is hard

○ Target networks ○ Replay memory ○ Parallel environment threads

10 / 35

slide-11
SLIDE 11

Greg Farquhar

Deep RL - Encode and Evaluate?

11 / 35

slide-12
SLIDE 12

Greg Farquhar

Model-based RL

12 / 35

slide-13
SLIDE 13

Greg Farquhar

Online Planning with Tree Search

13 / 35

slide-14
SLIDE 14

Greg Farquhar

Environment Models

  • State transition + reward
  • Can be hard to learn

○ Complex ○ Generalise poorly to new parts of the state space

  • Need very good fidelity for planning
  • Standard approach: predictive error on observations

14 / 35

slide-15
SLIDE 15

Greg Farquhar

Model fidelity in complex visual spaces is too low for effective planning

15 / 35 Action-conditional video prediction using deep networks in atari games (Oh et. al 2015)

slide-16
SLIDE 16

Greg Farquhar

Model fidelity in complex visual spaces is too low for effective planning

16 / 35 Action-conditional video prediction using deep networks in atari games (Oh et. al 2015)

slide-17
SLIDE 17

Greg Farquhar

Another Way to Learn Models

  • Optimise the true objective downstream of the model

○ Value prediction ○ Performance on the real task

  • Our approach: integrate differentiable model into differentiable planner, learn

end to end.

17 / 35

slide-18
SLIDE 18

Greg Farquhar

TreeQN: Encode

18 / 35

slide-19
SLIDE 19

Greg Farquhar

TreeQN: Tree Expansion

19 / 35

slide-20
SLIDE 20

Greg Farquhar

TreeQN: Evaluation

20 / 35

slide-21
SLIDE 21

Greg Farquhar

TreeQN: Tree Backup

21 / 35

slide-22
SLIDE 22

Greg Farquhar

TreeQN

22 / 35

slide-23
SLIDE 23

Greg Farquhar

Architecture Details

  • Two-step transition function
  • Residual connections
  • State normalisation
  • Soft backups

23 / 35

a1 a2 a3 shared normalise action conditional

slide-24
SLIDE 24

Greg Farquhar

Training

  • Optimise end-to-end with primary RL objective
  • Parameter sharing
  • N-step Q-learning with parallel environment threads
  • Batch thread data together for GPU
  • Increase virtual batch size during tree expansion for efficient computation

24 / 35

slide-25
SLIDE 25

Greg Farquhar

Grounding the Transition Model

  • Observations
  • Latent states
  • Rewards

○ Inside true targets

25 / 35

slide-26
SLIDE 26

Greg Farquhar

ATreeC

  • Use tree architecture for policy
  • Linear critic
  • Train with policy gradient

26 / 35

slide-27
SLIDE 27

Greg Farquhar

Results: Grounding

  • Grounding weakly (just reward function) works best
  • Maybe joint training of auxiliary objectives is wrong

27 / 35

slide-28
SLIDE 28

Greg Farquhar

Results: Box Pushing

  • TreeQN helps!
  • Extra depth can help in some situations

28 / 35

slide-29
SLIDE 29

Greg Farquhar

Results: Atari

  • Good performance
  • Makes use of depth

(vs DQN-Deep)

  • Main benefit from

depth-1

○ Reward + value ○ Auxiliary loss ○ Parameter sharing

29 / 35

slide-30
SLIDE 30

Greg Farquhar

Results: ATreeC

  • Works -- easy as a

drop-in replacement

  • Smaller benefits than

TreeQN

  • Limited by quality of

critic?

30 / 35

slide-31
SLIDE 31

Greg Farquhar

Just for fun

31 / 35

slide-32
SLIDE 32

Greg Farquhar

Interpretability

  • Sometimes (?)
  • Firmly on model-free end of spectrum
  • Grounding is an open question

○ Better auxiliary tasks? ○ Pre-training? ○ Different environments?

32 / 35

slide-33
SLIDE 33

Greg Farquhar

Future Work

  • Lessons learnt for model-free RL:

○ Depth ○ Structure ○ Auxiliary Tasks

  • Online planning:

○ Need more grounded models to use more refined planning algorithms

33 / 35

slide-34
SLIDE 34

Greg Farquhar

Summary

  • Combining online planning with deep RL is a key challenge
  • We can use a differentiable model inside a differentiable planner and train

end-to-end

  • Tree-structured models can encode a valuable inductive bias
  • More work is needed to effectively learn and use grounded models

34 / 35

slide-35
SLIDE 35

Greg Farquhar

Thank you!

35 / 35