near optimal exploration exploitation in non
play

Near Optimal Exploration-Exploitation in Non-Communicating Markov - PowerPoint PPT Presentation

Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit Matteo Pirotta Alessandro Lazaric SequeL INRIA Lille FAIR Facebook Paris NeurIPS 2018, Montreal, December 5th


  1. Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th

  2. Exploration–exploitation in RL with Misspecified State Space Ronan Fruit † Matteo Pirotta ∗ Alessandro Lazaric ∗ † SequeL – INRIA Lille ∗ FAIR – Facebook Paris NeurIPS 2018, Montreal, December 5th

  3. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  4. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  5. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  6. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  7. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  8. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  9. TUCRL Misspecified states: Examples 1 Breakout [Mnih et al., 2015] Intuitive state space: set of plausible configurations of wall, ball and paddle Cannot be observed! initial state s 1 Non reachable from s 1 Plausible state after some time... Misspecified state space = ∃ states non-observable from initial state + difficult to exclude explicitly from the state space Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 1/5

  10. TUCRL Why is exploration more challenging with a misspecified state space? Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  11. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  12. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  13. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Optimism (UCB, etc.) = Optimal Strategy s r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  14. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s s s ′ r 1 = 1 a 1 , 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  15. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s a 0 , r + s s ′ 0 = 1 = r max r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  16. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  17. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  18. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  19. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  20. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  21. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 Not reachable from s ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

  22. TUCRL Why is exploration more challenging with a misspecified state space? All existing methods known to efficiently balance exploration and exploitation in RL with theoretical guarantees rely on the optimism in the face of uncertainty principle All such methods fail to learn when the state space is misspecified a 0 , r 0 = 0 ❍ Not reachable from s ✟ ✟ ❍ ? a 0 , r + Why not ignore s ′ ? s s ′ 0 = 1 = r max linear regret if s ′ is reachable ? r 1 = 1 a 1 , Optimism 2 Example 1 of Ortner [2008] Problem: The action played keeps changing: it is a 0 half of the time and a 1 the other half = ⇒ linear regret ! Exploration–exploitation in RL with Misspecified State Space - R. Fruit SequeL - 2/5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend