SLIDE 1 Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
Andrea Zanette*, Emma Brunskill
Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University
zanette@stanford.edu ebrun@cs.stanford.edu
SLIDE 2 Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
Andrea Zanette*, Emma Brunskill
Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University
zanette@stanford.edu ebrun@cs.stanford.edu
Setting: episodic tabular RL Goal: automatically inherit instance-dependent regret bounds Exploration in RL = Learn quickly how to play near optimally
SLIDE 3
State of the Art Regret Bounds for Episodic Tabular MDPs
SLIDE 4
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(T)
No Intelligent Exploration (naive greedy)
SLIDE 5
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT)
˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010)
SLIDE 6
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT) ˜ O(H SAT)
˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015)
SLIDE 7
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT)
˜ O(S HAT)
˜ O(H SAT)
˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017)
SLIDE 8
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT)
˜ O(S HAT)
˜ O(H SAT)
˜ O( HSAT) ˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017)
SLIDE 9
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT)
˜ O(S HAT)
˜ O(H SAT)
˜ O( HSAT) Ω( HSAT) ˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017) Lower Bound (Lower Bound)
SLIDE 10
State of the Art Regret Bounds for Episodic Tabular MDPs
˜ O(HS AT)
˜ O(S HAT)
˜ O(H SAT)
˜ O( HSAT) Ω( HSAT) ˜ O( ℚ⋆SAT) ˜ O(T)
No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017) Lower Bound Problem Dependent Analysis (Our work) (Lower Bound)
SLIDE 11
Main Result
SLIDE 12
Main Result
(s, a)
SLIDE 13
Main Result
t t+1
(s, a)
SLIDE 14 Main Result
t t+1
V⋆(s+
1 )
V⋆(s+
2 )
V⋆(s+
3 )
(s, a)
SLIDE 15 Main Result
ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)
t t+1
V⋆(s+
1 )
V⋆(s+
2 )
V⋆(s+
3 )
(s, a)
SLIDE 16 Main Result
ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)
t t+1
V⋆(s+
1 )
V⋆(s+
2 )
V⋆(s+
3 )
(s, a)
r1 + r2 + … + rH ≤
SLIDE 17 Main Result
ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)
t t+1
V⋆(s+
1 )
V⋆(s+
2 )
V⋆(s+
3 )
(s, a)
Main Result: An algorithm with
a (high probability) regret bound:
min { ˜ O( ℚ⋆SAT) + [const], ˜ O( 2 H SAT) + [const]}
r1 + r2 + … + rH ≤
SLIDE 18 Main Result
ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)
t t+1
V⋆(s+
1 )
V⋆(s+
2 )
V⋆(s+
3 )
(s, a)
Main Result: An algorithm with
a (high probability) regret bound:
min { ˜ O( ℚ⋆SAT) + [const], ˜ O( 2 H SAT) + [const]}
Technique: exploration bonus which is adaptively adjusted as a function of the problem difficulty
r1 + r2 + … + rH ≤
SLIDE 19
Long Horizon MDPs
SLIDE 20
Long Horizon MDPs
r ∈ [0,1]
Standard Setting
SLIDE 21 Long Horizon MDPs
r ∈ [0,1]
r ≥ 0,
H
∑
t=1
rt ≤ 1
Standard Setting Goal MDP Setting*
* this is a more general setting
SLIDE 22 Long Horizon MDPs
r ∈ [0,1]
r ≥ 0,
H
∑
t=1
rt ≤ 1
Standard Setting Goal MDP Setting*
* this is a more general setting
COLT Conjecture of Jiang & Agarwal, 2018:
Any algorithm must suffer ~H dependence in terms of sample complexity and regret
in the Goal MDP setting
SLIDE 23 Long Horizon MDPs
r ∈ [0,1]
r ≥ 0,
H
∑
t=1
rt ≤ 1
Standard Setting Goal MDP Setting*
* this is a more general setting
Our algorithm yields
no horizon dependence in the regret bound for the setting
- f the COLT conjecture without being informed of the setting.
COLT Conjecture of Jiang & Agarwal, 2018:
Any algorithm must suffer ~H dependence in terms of sample complexity and regret
in the Goal MDP setting
SLIDE 24
Effect of MDP Stochasticity
SLIDE 25
Effect of MDP Stochasticity
Stochasticity in the Transition Dynamics
SLIDE 26
Effect of MDP Stochasticity
Stochasticity in the Transition Dynamics
Deterministic MDP
˜ O(SAH2)
SLIDE 27
Effect of MDP Stochasticity
Bandit Like Structure
Stochasticity in the Transition Dynamics
Deterministic MDP
˜ O(SAH2)
˜ O( SAT + [ . . . ])
SLIDE 28
Effect of MDP Stochasticity
Bandit Like Structure
Stochasticity in the Transition Dynamics
Hard Instances of the Lower Bound Deterministic MDP
˜ O(SAH2) ˜ O( HSAT + [ . . . ])
˜ O( SAT + [ . . . ])
SLIDE 29
Effect of MDP Stochasticity
Bandit Like Structure
Stochasticity in the Transition Dynamics
Hard Instances of the Lower Bound Deterministic MDP
˜ O(SAH2) ˜ O( HSAT + [ . . . ])
˜ O( SAT + [ . . . ])
Our algorithm matches in dominant terms the best performance for each setting
SLIDE 30
Related Work (infinite horizon)
SLIDE 31 Related Work (infinite horizon)
In mixing domains:
- (Talebi et al, 2018)
- (Ortner, 2018)
May not improve over worst-case:
With domain knowledge:
- [REGAL] (Bartlett et al, 2010)
- [SCAL] (Fruit et al, 2018)
SLIDE 32
- Episodic tabular MDP instance dependent bound without knowledge of the environment
- Insights into hardness of RL; provable improvements in many settings of interest
Conclusion Related Work (infinite horizon)
In mixing domains:
- (Talebi et al, 2018)
- (Ortner, 2018)
May not improve over worst-case:
With domain knowledge:
- [REGAL] (Bartlett et al, 2010)
- [SCAL] (Fruit et al, 2018)
SLIDE 33
- Episodic tabular MDP instance dependent bound without knowledge of the environment
- Insights into hardness of RL; provable improvements in many settings of interest
Conclusion
Bandit-structure limited range of
limited variability in value function among successor states near-deterministic MDPs long horizon MDPs
Related Work (infinite horizon)
In mixing domains:
- (Talebi et al, 2018)
- (Ortner, 2018)
May not improve over worst-case:
With domain knowledge:
- [REGAL] (Bartlett et al, 2010)
- [SCAL] (Fruit et al, 2018)