Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical Engineering and Department of Computer Science, Stanford University Exploration in RL = Learn quickly how to play near optimally Setting: episodic tabular RL Goal: automatically inherit instance-dependent regret bounds

State of the Art Regret Bounds for Episodic Tabular MDPs

State of the Art Regret Bounds for Episodic Tabular MDPs No Intelligent Exploration ˜ O ( T ) (naive greedy)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ O ( T ) (naive greedy) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ O ( HS AT ) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) O ( H SAT ) (naive greedy) (Dann 2015) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Dann 2017) (UCRL2, Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

State of the Art Regret Bounds for Episodic Tabular MDPs Problem Dependent Analysis Lower Bound E ffi cient Exploration No Intelligent Exploration ˜ ˜ O ( T ) ˜ O ( H SAT ) O ( HSAT ) ˜ ℚ ⋆ SAT ) O ( (naive greedy) (Dann 2015) (Azar 2017) ˜ ˜ O ( HS AT ) O ( S HAT ) (Our work) Ω ( HSAT ) (Dann 2017) (UCRL2, (Lower Bound) Jaksch 2010)

Main Result

Main Result ( s , a )

Main Result t t+1 ( s , a )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + )

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with   ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound:

Main Result t t+1 V ⋆ ( s + 1 ) V ⋆ ( s + 2 ) V ⋆ ( s + ( s , a ) 3 ) ℚ ⋆ = max s , a Var s + ∼ p ( s , a ) V ⋆ ( s + ) r 1 + r 2 + … + r H ≤ 𝒣 min { O ( H SAT ) + [ const ] } 𝒣 2 Main Result : An algorithm with   ˜ ˜ ℚ ⋆ SAT ) + [ const ], O ( a (high probability) regret bound: Technique : exploration bonus which is adaptively adjusted as a function of the problem di ffi culty

Long Horizon MDPs

Long Horizon MDPs Standard Setting r ∈ [0,1]

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret   in the Goal MDP setting

Long Horizon MDPs Standard Setting r ∈ [0,1] H Goal MDP Setting* ∑ r ≥ 0, r t ≤ 1 * this is a more general setting t =1 COLT Conjecture of Jiang & Agarwal, 2018: Any algorithm must su ff er ~H dependence in terms of sample complexity and regret   in the Goal MDP setting Our algorithm yields   no horizon dependence in the regret bound for the setting   of the COLT conjecture without being informed of the setting.

Effect of MDP Stochasticity

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Deterministic MDP ˜ O ( SAH 2 )

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Structure ˜ O ( SAH 2 ) ˜ O ( SAT + [ . . . ])

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ])

Effect of MDP Stochasticity Stochasticity in the Transition Dynamics Bandit Like Deterministic MDP Hard Instances of the Lower Bound Structure ˜ ˜ O ( SAH 2 ) ˜ O ( HSAT + [ . . . ]) O ( SAT + [ . . . ]) Our algorithm matches in dominant terms the best performance for each setting

Related Work (infinite horizon)

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018)

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest

Related Work (infinite horizon) In mixing domains:   May not improve over worst-case:   With domain knowledge:   - ( Talebi et al, 2018)   -(Maillard et al, 2014) - [REGAL] (Bartlett et al, 2010)   - (Ortner, 2018) - [SCAL] (Fruit et al, 2018) Conclusion - Episodic tabular MDP instance dependent bound without knowledge of the environment - Insights into hardness of RL; provable improvements in many settings of interest near-deterministic MDPs Bandit-structure long horizon MDPs limited range of   optimal value function limited variability in value function among successor states

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical

Tighter Bounds on the Inefficiency Ratio of Stable Equilibria in Load Balancing Games Akaki

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Tighter Bounds for the Sum of Irreducible LCP Values Juha Krkkinen 1 Dominik Kempa 1 Marcin

Pessimistic Query Optimization: Tighter Upper Bounds for Intermediate Join Cardinalities Walter

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier

L M RAIL SLIDES (HS-LM-B_-S_) SPECIFICATION :- These are L.M Rail slides specially used

October 9, 2015 Kris Palmer, Mina Dadgar, Katherine Bergman Career Ladders Project Tram

Principles to Ac.ons Effec.ve Mathema.cs Teaching Prac.ces The Case of Jamie Bassham and the

Quantum-secure symmetric-key cryptography based on Hidden Shifts Gorjan Alagic Alexander Russell

Slides Set 9: AND/OR search for Probabilistic Networks Rina Dechter (Dechter1 chapter 6 and 7 )

What's on the Wire? Physical Layer Tapping with Daisho Dominic Spill Mike Kershaw / Dragorn

MY STUDENTS DONT DO THAT! Student Health Behavior &

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning - PowerPoint PPT Presentation

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds Andrea Zanette*, Emma Brunskill zanette@stanford.edu ebrun@cs.stanford.edu Institute for Computational and Mathematical

Tighter Bounds on the Inefficiency Ratio of Stable Equilibria in Load Balancing Games Akaki

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

On adaptive regret bounds for non- stochastic bandits Gergely Neu INRIA Lille, SequeL team

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Regret Bounds for Lifelong Learning Pierre Alquier Groupe de Travail de Machine learning du CMLA

Counterfactual Regret Minimization and Domination in Extensive-Form Games Richard Gibson

No-Regret Learning in Convex Games Geoff Gordon, Amy Greenwald, Casey Marks, and Martin Zinkevich

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Tighter Bounds for the Sum of Irreducible LCP Values Juha Krkkinen 1 Dominik Kempa 1 Marcin

Pessimistic Query Optimization: Tighter Upper Bounds for Intermediate Join Cardinalities Walter

Extreme Classification Many modern applications involve a huge number of classes . E.g., image

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier

L M RAIL SLIDES (HS-LM-B_-S_) SPECIFICATION :- These are L.M Rail slides specially used

October 9, 2015 Kris Palmer, Mina Dadgar, Katherine Bergman Career Ladders Project Tram

Principles to Ac.ons Effec.ve Mathema.cs Teaching Prac.ces The Case of Jamie Bassham and the

Quantum-secure symmetric-key cryptography based on Hidden Shifts Gorjan Alagic Alexander Russell

Slides Set 9: AND/OR search for Probabilistic Networks Rina Dechter (Dechter1 chapter 6 and 7 )

What's on the Wire? Physical Layer Tapping with Daisho Dominic Spill Mike Kershaw / Dragorn

MY STUDENTS DONT DO THAT! Student Health Behavior &amp;

MY STUDENTS DONT DO THAT! Student Health Behavior &