deep reinforcement learning building blocks
play

Deep Reinforcement Learning Building Blocks Arjun Chandra Research - PowerPoint PPT Presentation

Deep Reinforcement Learning Building Blocks Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 8 November 2017 https://join.slack.com/t/deep-rl-tutorial/signup The Plan The Problem


  1. our toy problem lookup table N S E W 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 home 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0

  2. our toy problem lookup table 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  3. reward structure? 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 7 7 8 9 0 0 0 0 0 0 home 0 0 0 move… to 7/home: out of bounds: to 5: to any cell except 5 and 7: 10 -5 -10 -1

  4. let’s fix 𝛽 = 0.1, 𝛿 = 0.5 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 0 0 0

  5. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0 say 𝜁 -greedy policy… episode 1 begins...

  6. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -1 0 0 0 1 2 3 0 0 ? 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  7. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  8. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  9. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -5 ? 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  10. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  11. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  12. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -1 ? 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  13. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  14. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  15. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -10 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 ? 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  16. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  17. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  18. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 -1 0 ? 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  19. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  20. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

  21. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 10 7 8 9 0 0 ? 0 0 0 home 0 0 0

  22. a + Ξ³ max a ' Q ( s ', a ') βˆ’ Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + Ξ± ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 episode 1 ends.

  23. let’s work out the next episode, star<ng at state 4 -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 go WEST and then SOUTH how does the table change?

  24. -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 0 0 0 0 1 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0

  25. and the next episode, star<ng at state 3 go WEST -> SOUTH -> WEST -> SOUTH

  26. -0.5 0 0 1 2 3 0 0 -0.1 0 -0.1 0 -0.1 -1 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 -0.05 0 0 0 1.9 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0 over <me, values will converge to op<mal!

  27. what we just saw was some episodes of Q-learning values update towards value of op+mal policy : target comes from value of assumed next best ac+on off-policy learning

  28. SARSA-learning? values update towards value of current policy : target comes from value of the actual next ac+on on-policy learning

  29. Q SARSA By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons data generated by data not generated by target policy target policy 𝜁 : 0.1 𝛿 : 1.0 SARSA Q Example credit Travis DeWolf : https://studywolf.wordpress.com/ and https://git.io/vFBvv

  30. Problem Decomposition nested sub-problems solution to sub-problem informs solution to whole problem

  31. Bellman Expectation Backup system of linear equations solution: value of policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) βŽ› ⎞ a + Ξ³ a + Ξ³ a v Ο€ ( s ') βˆ‘ βˆ‘ βˆ‘ βˆ‘ v Ο€ ( s ) = Ο€ ( a | s ) r s q Ο€ ( s , a ) = r s Ο€ ( a '| s ') q Ο€ ( s ', a ') P a P ⎜ ⎟ ⎝ ⎠ ss ' ss ' a s ' s ' a ' Bellman expectation equations under a given policy

  32. Bellman Optimality Backup system of non-linear equations solution: value of optimal policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) βŽ› ⎞ a + Ξ³ a + Ξ³ a v * ( s ') βˆ‘ βˆ‘ q * ( s , a ) = r s a v * ( s ) = max a r s P max a ' q * ( s ', a ') P ⎜ ⎟ ⎝ ⎠ ss ' ss ' s ' s ' Bellman optimality equations under optimal policy

  33. Value Based

  34. Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home

  35. Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend