differentiable tree planning for deep rl
play

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In - PowerPoint PPT Presentation

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In Collaboration With Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson Greg Farquhar 2 / 35 Overview Reinforcement learning Model-based RL and online planning


  1. Differentiable Tree Planning for Deep RL Greg Farquhar 1

  2. In Collaboration With Tim Rocktaschel, Maximilian Igl, & Shimon Whiteson Greg Farquhar 2 / 35

  3. Overview ● Reinforcement learning ● Model-based RL and online planning ● TreeQN and ATreeC (ICLR 2018) ● Results ● Future work Greg Farquhar 3 / 35

  4. Planning and Learning for Control Greg Farquhar 4 / 35

  5. Reinforcement Learning Greg Farquhar 5 / 35

  6. Reinforcement Learning ● Specify the reward, learn the solution ● Very general framework ● Problem is hard: ○ Rewards are sparse ○ Credit assignment ○ Exploration and exploitation ○ Large state/action spaces ○ Approximation and generalisation Greg Farquhar 6 / 35

  7. RL Key Concepts ● State (Observation) ● Action ● Transition ● Reward ● Policy: states → actions Greg Farquhar 7 / 35

  8. Model-free RL: Value Functions ● Learn without a model of the environment ● Value function ● Optimal value function ● Policy evaluation + improvement Greg Farquhar 8 / 35

  9. The Bellman Equation ● Temporal (Markov) structure ● Bellman optimality equation ● Q-learning ● Backups Greg Farquhar 9 / 35

  10. Deep RL ● Q → deep neural network ● Q-learning as regression ● Stability is hard ○ Target networks ○ Replay memory ○ Parallel environment threads Greg Farquhar 10 / 35

  11. Deep RL - Encode and Evaluate? Greg Farquhar 11 / 35

  12. Model-based RL Greg Farquhar 12 / 35

  13. Online Planning with Tree Search Greg Farquhar 13 / 35

  14. Environment Models ● State transition + reward ● Can be hard to learn ○ Complex ○ Generalise poorly to new parts of the state space ● Need very good fidelity for planning ● Standard approach: predictive error on observations Greg Farquhar 14 / 35

  15. Model fidelity in complex visual spaces is too low for effective planning Action-conditional video prediction using deep networks in atari games (Oh et. al 2015) Greg Farquhar 15 / 35

  16. Model fidelity in complex visual spaces is too low for effective planning Action-conditional video prediction using deep networks in atari games (Oh et. al 2015) Greg Farquhar 16 / 35

  17. Another Way to Learn Models ● Optimise the true objective downstream of the model ○ Value prediction ○ Performance on the real task ● Our approach: integrate differentiable model into differentiable planner, learn end to end. Greg Farquhar 17 / 35

  18. TreeQN: Encode Greg Farquhar 18 / 35

  19. TreeQN: Tree Expansion Greg Farquhar 19 / 35

  20. TreeQN: Evaluation Greg Farquhar 20 / 35

  21. TreeQN: Tree Backup Greg Farquhar 21 / 35

  22. TreeQN Greg Farquhar 22 / 35

  23. Architecture Details ● Two-step transition function ● Residual connections a 1 ● State normalisation a 2 a 3 ● Soft backups action shared normalise conditional Greg Farquhar 23 / 35

  24. Training ● Optimise end-to-end with primary RL objective ● Parameter sharing ● N-step Q-learning with parallel environment threads ● Batch thread data together for GPU ● Increase virtual batch size during tree expansion for efficient computation Greg Farquhar 24 / 35

  25. Grounding the Transition Model ● Observations ● Latent states ● Rewards ○ Inside true targets Greg Farquhar 25 / 35

  26. ATreeC ● Use tree architecture for policy ● Linear critic ● Train with policy gradient Greg Farquhar 26 / 35

  27. Results: Grounding ● Grounding weakly (just reward function) works best ● Maybe joint training of auxiliary objectives is wrong Greg Farquhar 27 / 35

  28. Results: Box Pushing ● TreeQN helps! ● Extra depth can help in some situations Greg Farquhar 28 / 35

  29. Results: Atari ● Good performance ● Makes use of depth (vs DQN-Deep) ● Main benefit from depth-1 ○ Reward + value ○ Auxiliary loss ○ Parameter sharing Greg Farquhar 29 / 35

  30. Results: ATreeC ● Works -- easy as a drop-in replacement ● Smaller benefits than TreeQN ● Limited by quality of critic? Greg Farquhar 30 / 35

  31. Just for fun Greg Farquhar 31 / 35

  32. Interpretability ● Sometimes (?) ● Firmly on model-free end of spectrum ● Grounding is an open question ○ Better auxiliary tasks? ○ Pre-training? ○ Different environments? Greg Farquhar 32 / 35

  33. Future Work ● Lessons learnt for model-free RL: ○ Depth ○ Structure ○ Auxiliary Tasks ● Online planning: ○ Need more grounded models to use more refined planning algorithms Greg Farquhar 33 / 35

  34. Summary ● Combining online planning with deep RL is a key challenge ● We can use a differentiable model inside a differentiable planner and train end-to-end ● Tree-structured models can encode a valuable inductive bias ● More work is needed to effectively learn and use grounded models Greg Farquhar 34 / 35

  35. Thank you! Greg Farquhar 35 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend