tighter problem dependent regret bounds in reinforcement
play

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical


  1. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University

  2. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University Exploration in RL = Learn quickly how to play near optimally Setting: episodic tabular RL Goal: automatically inherit instance-dependent regret bounds

  3. State of the Art Regret Bounds for Episodic Tabular MDPs

  4. State of the Art Regret Bounds for Episodic Tabular MDPs No Intelligent Exploration ˜ O ( T ) (naive greedy)

  5. State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ O ( T ) (naive greedy) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

  6. State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

  7. State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

  8. State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

  9. State of the Art Regret Bounds for Episodic Tabular MDPs Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

  10. State of the Art Regret Bounds for Episodic Tabular MDPs Problem Dependent Analysis Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) ˜ ℚ ⋆ SAT ) O ( (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Our work) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

  11. Main Result

  12. Main Result ( s , a )

  13. Main Result t t+1 ( s , a )

  14. Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 )

  15. Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + )

  16. Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣

  17. Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with 
 ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound:

  18. Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with 
 ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound: Technique : exploration bonus which is adaptively adjusted as a function of the problem di ffi culty

  19. Long Horizon MDPs

  20. Long Horizon MDPs Standard Setting r ∈ [0,1]

  21. Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1

  22. Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret 
 in the Goal MDP setting

  23. Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret 
 in the Goal MDP setting Our algorithm yields 
 no horizon dependence in the regret bound for the setting 
 of the COLT conjecture without being informed of the setting.

  24. Effect of MDP Stochasticity

  25. Effect of MDP Stochasticity Stochasticity in the Transition Dynamics

  26. Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Deterministic MDP ˜ O ( SAH 2 )

  27. Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Structure ˜ O ( SAH 2 ) ˜ O ( SAT + [ . . . ])

  28. Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ])

  29. Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ]) Our algorithm matches in dominant terms the best performance for each setting

  30. Related Work (infinite horizon)

  31. Related Work (infinite horizon) In mixing domains: 
 May not improve over worst-case: 
 With domain knowledge: 
 - ( Talebi et al, 2018) 
 -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010) 
 - (Ortner, 2018) - [SCAL] (Fruit et al, 2018)

  32. Related Work (infinite horizon) In mixing domains: 
 May not improve over worst-case: 
 With domain knowledge: 
 - ( Talebi et al, 2018) 
 -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010) 
 - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest

  33. Related Work (infinite horizon) In mixing domains: 
 May not improve over worst-case: 
 With domain knowledge: 
 - ( Talebi et al, 2018) 
 -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010) 
 - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest near-deterministic MDPs Bandit-structure long horizon MDPs limited range of 
 optimal value function limited variability in value function among successor states

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend