Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - - PowerPoint PPT Presentation

tighter problem dependent regret bounds in reinforcement
SMART_READER_LITE
LIVE PREVIEW

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical


slide-1
SLIDE 1

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Andrea Zanette*, Emma Brunskill

Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University

zanette@stanford.edu ebrun@cs.stanford.edu

slide-2
SLIDE 2

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Andrea Zanette*, Emma Brunskill

Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University

zanette@stanford.edu ebrun@cs.stanford.edu

Setting: episodic tabular RL Goal: automatically inherit instance-dependent regret bounds Exploration in RL = Learn quickly how to play near optimally

slide-3
SLIDE 3

State of the Art Regret Bounds for Episodic Tabular MDPs

slide-4
SLIDE 4

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(T)

No Intelligent Exploration (naive greedy)

slide-5
SLIDE 5

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT)

˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010)

slide-6
SLIDE 6

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT) ˜ O(H SAT)

˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015)

slide-7
SLIDE 7

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT)

˜ O(S HAT)

˜ O(H SAT)

˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017)

slide-8
SLIDE 8

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT)

˜ O(S HAT)

˜ O(H SAT)

˜ O( HSAT) ˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017)

slide-9
SLIDE 9

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT)

˜ O(S HAT)

˜ O(H SAT)

˜ O( HSAT) Ω( HSAT) ˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017) Lower Bound (Lower Bound)

slide-10
SLIDE 10

State of the Art Regret Bounds for Episodic Tabular MDPs

˜ O(HS AT)

˜ O(S HAT)

˜ O(H SAT)

˜ O( HSAT) Ω( HSAT) ˜ O( ℚ⋆SAT) ˜ O(T)

No Intelligent Exploration (naive greedy) Efficient Exploration (UCRL2, Jaksch 2010) (Dann 2015) (Dann 2017) (Azar 2017) Lower Bound Problem Dependent Analysis (Our work) (Lower Bound)

slide-11
SLIDE 11

Main Result

slide-12
SLIDE 12

Main Result

(s, a)

slide-13
SLIDE 13

Main Result

t t+1

(s, a)

slide-14
SLIDE 14

Main Result

t t+1

V⋆(s+

1 )

V⋆(s+

2 )

V⋆(s+

3 )

(s, a)

slide-15
SLIDE 15

Main Result

ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)

t t+1

V⋆(s+

1 )

V⋆(s+

2 )

V⋆(s+

3 )

(s, a)

slide-16
SLIDE 16

Main Result

ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)

t t+1

V⋆(s+

1 )

V⋆(s+

2 )

V⋆(s+

3 )

(s, a)

r1 + r2 + … + rH ≤ 𝒣

slide-17
SLIDE 17

Main Result

ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)

t t+1

V⋆(s+

1 )

V⋆(s+

2 )

V⋆(s+

3 )

(s, a)

Main Result: An algorithm with 


a (high probability) regret bound:

min { ˜ O( ℚ⋆SAT) + [const], ˜ O( 𝒣2 H SAT) + [const]}

r1 + r2 + … + rH ≤ 𝒣

slide-18
SLIDE 18

Main Result

ℚ⋆ = maxs,aVars+∼p(s,a)V⋆(s+)

t t+1

V⋆(s+

1 )

V⋆(s+

2 )

V⋆(s+

3 )

(s, a)

Main Result: An algorithm with 


a (high probability) regret bound:

min { ˜ O( ℚ⋆SAT) + [const], ˜ O( 𝒣2 H SAT) + [const]}

Technique: exploration bonus which is adaptively adjusted as a function of the problem difficulty

r1 + r2 + … + rH ≤ 𝒣

slide-19
SLIDE 19

Long Horizon MDPs

slide-20
SLIDE 20

Long Horizon MDPs

r ∈ [0,1]

Standard Setting

slide-21
SLIDE 21

Long Horizon MDPs

r ∈ [0,1]

r ≥ 0,

H

t=1

rt ≤ 1

Standard Setting Goal MDP Setting*

* this is a more general setting

slide-22
SLIDE 22

Long Horizon MDPs

r ∈ [0,1]

r ≥ 0,

H

t=1

rt ≤ 1

Standard Setting Goal MDP Setting*

* this is a more general setting

COLT Conjecture of Jiang & Agarwal, 2018:

Any algorithm must suffer ~H dependence in terms of sample complexity and regret
 in the Goal MDP setting

slide-23
SLIDE 23

Long Horizon MDPs

r ∈ [0,1]

r ≥ 0,

H

t=1

rt ≤ 1

Standard Setting Goal MDP Setting*

* this is a more general setting

Our algorithm yields 
 no horizon dependence in the regret bound for the setting 


  • f the COLT conjecture without being informed of the setting.

COLT Conjecture of Jiang & Agarwal, 2018:

Any algorithm must suffer ~H dependence in terms of sample complexity and regret
 in the Goal MDP setting

slide-24
SLIDE 24

Effect of MDP Stochasticity

slide-25
SLIDE 25

Effect of MDP Stochasticity

Stochasticity in the Transition Dynamics

slide-26
SLIDE 26

Effect of MDP Stochasticity

Stochasticity in the Transition Dynamics

Deterministic MDP

˜ O(SAH2)

slide-27
SLIDE 27

Effect of MDP Stochasticity

Bandit Like Structure

Stochasticity in the Transition Dynamics

Deterministic MDP

˜ O(SAH2)

˜ O( SAT + [ . . . ])

slide-28
SLIDE 28

Effect of MDP Stochasticity

Bandit Like Structure

Stochasticity in the Transition Dynamics

Hard Instances of the Lower Bound Deterministic MDP

˜ O(SAH2) ˜ O( HSAT + [ . . . ])

˜ O( SAT + [ . . . ])

slide-29
SLIDE 29

Effect of MDP Stochasticity

Bandit Like Structure

Stochasticity in the Transition Dynamics

Hard Instances of the Lower Bound Deterministic MDP

˜ O(SAH2) ˜ O( HSAT + [ . . . ])

˜ O( SAT + [ . . . ])

Our algorithm matches in dominant terms the best performance for each setting

slide-30
SLIDE 30

Related Work (infinite horizon)

slide-31
SLIDE 31

Related Work (infinite horizon)

In mixing domains:


  • (Talebi et al, 2018)

  • (Ortner, 2018)

May not improve over worst-case:


  • (Maillard et al, 2014)

With domain knowledge:


  • [REGAL] (Bartlett et al, 2010)

  • [SCAL] (Fruit et al, 2018)
slide-32
SLIDE 32
  • Episodic tabular MDP instance dependent bound without knowledge of the environment
  • Insights into hardness of RL; provable improvements in many settings of interest

Conclusion Related Work (infinite horizon)

In mixing domains:


  • (Talebi et al, 2018)

  • (Ortner, 2018)

May not improve over worst-case:


  • (Maillard et al, 2014)

With domain knowledge:


  • [REGAL] (Bartlett et al, 2010)

  • [SCAL] (Fruit et al, 2018)
slide-33
SLIDE 33
  • Episodic tabular MDP instance dependent bound without knowledge of the environment
  • Insights into hardness of RL; provable improvements in many settings of interest

Conclusion

Bandit-structure limited range of 


  • ptimal value function

limited variability in value function among successor states near-deterministic MDPs long horizon MDPs

Related Work (infinite horizon)

In mixing domains:


  • (Talebi et al, 2018)

  • (Ortner, 2018)

May not improve over worst-case:


  • (Maillard et al, 2014)

With domain knowledge:


  • [REGAL] (Bartlett et al, 2010)

  • [SCAL] (Fruit et al, 2018)