practical open loop optimistic planning
play

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , - PowerPoint PPT Presentation

Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille Nord Europe 2 Renault Group ECML PKDD 2019 W urzburg, September 2019 Motivation Sequential Decision Making action Agent


  1. Practical Open-Loop Optimistic Planning Edouard Leurent 1 , 2 , Odalric-Ambrym Maillard 1 1 SequeL, Inria Lille – Nord Europe 2 Renault Group ECML PKDD 2019 W¨ urzburg, September 2019

  2. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  3. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  4. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  5. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  6. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  7. Motivation — Sequential Decision Making action Agent Environment state, reward Markov Decision Processes 1. Observe state s ∈ S ; 2. Pick a discrete action a ∈ A ; � � 3. Transition to a next state s ′ ∼ P s ′ | s , a ; 4. Receive a bounded reward r ∈ [ 0 , 1 ] drawn from P ( r | s , a ) . Objective: maximise V = E [ � ∞ t = 0 γ t r t ] Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 2/31

  8. Motivation — Example The highway-env environment We want to handle stochasticity. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 3/31

  9. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  10. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  11. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  12. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  13. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  14. Motivation — How to solve MDPs? Online Planning ◮ we have access to a generative model: yields samples of s ′ , r ∼ P ( s ′ , r | s , a ) when queried action Agent Environment state, reward state recommendation Planner Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 4/31

  15. Motivation — How to solve MDPs? Online Planning ◮ fixed budget: the model can only be queried n times Objective: minimize E V ∗ − V ( n ) � �� � Simple Regret r n An exploration-exploitation problem. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 5/31

  16. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  17. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  18. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  19. Optimistic Planning Optimism in the Face of Uncertainty Given a set of options a ∈ A with uncertain outcomes, try the one with the highest possible outcome. ◮ Either you performed well; ◮ or you learned something. Instances ◮ Monte-carlo tree search ( MCTS ) [Coulom 2006]: CrazyStone ◮ Reframed in the bandit setting as UCT [Kocsis and Szepesv´ ari 2006], still very popular (e.g. Alpha Go ). ◮ Proved asymptotic consistency, but no regret bound. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 6/31

  20. Analysis of UCT It was analysed in [Coquelin and Munos 2007] The sample complexity of is lower-bounded by O (exp(exp( D ))) . Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 7/31

  21. Failing cases of UCT Not just a theoretical counter-example. Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 8/31

  22. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

  23. Can we get better guarantees? OPD : Optimistic Planning for Deterministic systems ◮ Introduced by [Hren and Munos 2008] ◮ Another optimistic algorithm ◮ Only for deterministic MDPs Theorem ( OPD sample complexity) � � n − log 1 /γ E r n = O , if κ > 1 log κ OLOP : Open-Loop Optimistic Planning ◮ Introduced by [Bubeck and Munos 2010] ◮ Extends OPD to the stochastic setting ◮ Only considers open-loop policies, i.e. sequences of actions Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 9/31

  24. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  25. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  26. The idea behind OLOP A direct application of Optimism in the Face of Uncertainty 1. We want max V ( a ) a 2. Form upper confidence-bounds of sequence values: V ( a ) ≤ U a w.h.p 3. Sample the sequence with highest UCB: arg max U a a Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 10/31

  27. The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 11/31

  28. The idea behind OLOP Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 12/31

  29. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + 1 : t t = 1 t ≥ h + 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

  30. Under the hood Upper-bounding the value of sequences follow the sequence � �� � act optimally � �� � h � � γ t µ a 1 : t γ t µ a ∗ V ( a ) = + ���� 1 : t ���� t = 1 t ≥ h + 1 ≤ U µ ≤ 1 Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 13/31

  31. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

  32. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

  33. Under the hood OLOP main tool: the Chernoff-Hoeffding deviation inequality � 2 log M def U µ a ( m ) = µ a ( m ) ˆ + � �� � � �� � T a ( m ) � �� � Upper bound Empirical mean Confidence interval OPD : upper-bound all the future rewards by 1 h � γ h + 1 U a ( m ) def γ t U µ = a 1 : t ( m ) + 1 − γ � �� � � �� � t = 1 Past rewards Future rewards Bounds sharpening B a ( m ) def = 1 ≤ t ≤ L U a 1 : t ( m ) inf Practical Open-Loop Optimistic Planning ECML PKDD 2019 - 14/31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend