deep reinforcement learning introduction and state of the
play

Deep Reinforcement Learning Introduction and State-of-the-art Arjun - PowerPoint PPT Presentation

Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 24 October 2017 https://join.slack.com/t/deep-rl-tutorial/signup The


  1. Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 24 October 2017 https://join.slack.com/t/deep-rl-tutorial/signup

  2. The Plan • Some history • RL and Deep RL in a nutshell • Deep RL Toolbox • Challenges and State-of-the-art • Data E ffi ciency • Exploration • Temporal Abstractions • Generalisation

  3. https://vimeo.com/20042665

  4. Rich Su;on et al. Brief History Vlad Mnih et. al. Stanford late 1980s http://heli.stanford.edu/ RL for robots using 2004 NNs, L-J Lin. PhD 1993, CMU 2013 — David Silver et. al. Google DeepMind Gerald Tesauro 1995 2015 —

  5. Problem Characteristics dynamic uncertainty /volatility uncharted/ unimagined / exception laden delayed consequences requires strategy Image credit: http://wonderfulengineering.com/inside-the-data-center-where-google-stores-all-its-data-pictures/

  6. Solution machine with agency which learn , plan , and act to find a strategy for solving the problem autonomous to some extent probe and learn from feedback focus on the long-term objective explore and exploit

  7. Reinforcement Learning observation and feedback on actions Model Problem/ Agent Goal Environment π /Q action Model dynamics model maximise return E{R} π /Q policy/value function Goal

  8. The MDP game! observation and feedback on actions interact Model to maximise Problem/ long term Agent Goal Environment reward π /Q action maximise return E{R} Goal Inspired by Prof. Rich Sutton's tutorial: https://www.youtube.com/watch?v=ggqnxyjaKe4

  9. The MDP (S,A,P,R, ϒ ) R: immediate reward function R(s, a) P: state transition probability P(s’|s, a) R =-10±3 1 P=0.99 R=10±3 R=40±3 2 P=1.00 P=0.01 A B R=-10±3 R=20±3 P=0.01 1 P=0.01 R=40±3 2 R=20±3 P=0.99 P=0.99 https://github.com/traai/basic-rl

  10. Terminology state or action value function policy dynamics model reward home goal

  11. Terminology state or action Q value function Q Q V Q(s,a) V(s) Q policy dynamics model reward home goal goal

  12. Terminology state or action value function π (s|a) policy π (s) dynamics model reward home goal

  13. Terminology If I go South, I will meet state or action value function policy dynamics model reward home goal

  14. Terminology 0 state or action value function policy dynamics model reward home goal

  15. Terminology 0 state or action value function 0 policy dynamics model 10 reward home goal

  16. Deep Reinforcement Learning observation and observation feedback on actions Model Problem/ Agent Goal Environment action action Model dynamics model maximise return E{R} π /Q policy/value function Goal

  17. Deep Reinforcement Learning World Sensors Perception Planning Control Action Model prediction/physics low level controller pixels vision/detection motion planner motor sim/kinematics set torques abstractions ~ info loss (manual craft) Deep Neural Networks Sensors Action (abstractions/representation adapted to task)

  18. Explaining How a Deep Neural Network Trained with End-to-End SL + RL Learning Steers a Car , Bojarski et. al., https://arxiv.org/pdf/1704.07911.pdf 2017 https://www.youtube.com/watch?v=KnPiP9PkLAs https://www.youtube.com/watch?v=NJU9ULQUwng data mismatch

  19. Toolbox Standard algorithms to give you a flavour of the norm !

  20. DQN image score change on action Buffer Agent Goal NN action Human-level control through deep reinforcement learning , Mnih et. al., Nature 518, Feb 2015

  21. experience replay buffer s t r t+1 s t+1 a t randomly sample save transition in from memory memory for training = i.i.d

  22. freeze target freeze

  23. https://storage.googleapis.com/deepmind-media/dqn/ DQNNaturePaper.pdf Human-level control through deep reinforcement learning , Mnih et. al., Nature 518, Feb 2015

  24. prioritised experience replay sample from memory based on surprise Prioritised Experience Replay, Schaul et. al., ICLR 2016

  25. dueling architecture Q V Q A Q(s, a) = V(s) + A (s, a) Dueling Network Architectures for Deep RL Wang et. al., ICML 2016

  26. however training is SLO O O O Oo…. W

  27. Parallel Asynchronous Training value and policy based methods https://youtu.be/0xo1Ldx3L5Q https://youtu.be/Ajjc08-iPx8 https://youtu.be/nMR5mjCFZCw parallel shared lock-free agents parameters updates Asynchronous Methods for Deep Reinforcement Learning , Mnih et. al., ICML 2016

  28. H OGWILD ! updates Agent Agent Agent Agent Agent Copy Copy Copy Copy Copy parallel learners params shared Agent H OGWILD ! updates Agent Agent Agent Agent Agent Copy Copy Copy Copy Copy https://github.com/traai/async-deep-rl

  29. So 2016… Can we train even faster?

  30. PAAC ( P arallel A dvantage A ctor- C ritic) 1 GPU/CPU Reduced training time SOTA performance https://github.com/alfredvc/paac Efficient Parallel Methods for Deep Reinforcement Learning, Alfredo A. V. Clemente, H. N. Castejón, and A. Chandra, RLDM 2017 Clemente

  31. Challenges and SOTA Data E ffi ciency Exploration Temporal Abstractions Generalisation

  32. Data Efficiency

  33. Demonstrations observation and feedback on action Buffer past observations, Agent Goal action, NN feedback action Learning from Demonstrations for Real World Reinforcement Learning , Hester et. al., arXiv e-print, Jul 2017

  34. https://www.youtube.com/watch?v=JR6wmLaYuu4

  35. https://www.youtube.com/watch?v=1wsCZk0Im54

  36. https://www.youtube.com/watch?v=B3pf7NJFtHE

  37. Deep RL with Unsupervised Auxiliary Tasks Use observation and replay buffer feedback on actions wisely Buffer Problem/ Agent Goal Environment action Reinforcement Learning with Unsupervised Auxiliary Tasks , Jaderberg et. al. ICML 2017

  38. Reinforcement Learning with Unsupervised Auxiliary Tasks , Jaderberg et. al. ICML 2017

  39. learn to act to a ff ect pixels e.g. if grabbing fruit makes it disappear, agent would do it

  40. predict short term reward e.g. replay pick key series of frames

  41. predict long term reward

  42. 10x less data! ~0.25 Reinforcement Learning with Unsupervised Auxiliary Tasks , Jaderberg et. al. ICML 2017

  43. https://deepmind.com/blog/reinforcement-learning- unsupervised-auxiliary-tasks/

  44. Distributional RL observation and feedback on actions Buffer Problem/ Environment Agent Goal Q(s, a) action A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

  45. Normal DQN target : [sample reward after step + discounted previous return estimate from then on] BUT this: [ fuse R with discounted previous return distribution] A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

  46. “If I shoot now, it is game over for me” A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

  47. A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

  48. wrong/fatal actions bimodal under pressure A Distributional Perspective on Reinforcement Learning, Bellemare et. al., ICML 2017

  49. Exploration

  50. Curiosity Driven Exploration observation and feedback on action Model Agent Goal NN action

  51. Curiosity Driven Exploration curiosity as next state prediction error next state prediction next state state action prediction action action state … only focus on relevant parts of state Curiosity-driven Exploration by Self-supervised Prediction , Pathak, Agrawal et al., ICML 2017.

  52. https://github.com/pathak22/noreward-rl https://pathak22.github.io/noreward-rl/

  53. Temporal Abstractions

  54. HRL with pre-set Goals meta-controller MC chooses goals state select goals action C controller chooses select primitive actions actions Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et. al. NIPS 2016

  55. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et. al. NIPS 2016

  56. pre-defined goal selected by meta-controller Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation, T. D. Kulkarni, K. R. Narasimhan et. al. NIPS 2016

  57. F e U dal N etworks for HRL manager tries to finds M - good directions state set direction action W worker primitive tries to achieve actions to direction them FeUdal Networks for Hierarchical Reinforcement Learning , Vezhnevets et. al. ICML 2017

  58. FeUdal Networks for Hierarchical Reinforcement Learning , Vezhnevets et. al. ICML 2017

  59. Generalisation

  60. Meta-learning (Learn to Learn) Versatile agents! Transfer Good features for learning works decision making ? with images http://www.derinogrenme.com/2015/07/29/makale-imagenet-large-scale-visual- recognition-challenge/

  61. learn to learn reduce learning to go East time to go to X

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend