Some Theoretical Aspects of Reinforcement Learning CS 285 - PowerPoint PPT Presentation

Some Theoretical Aspects of Reinforcement Learning   CS 285 Instructor: Aviral Kumar UC Berkeley

What Will We Discuss Today? A brief introduction to some theoretical aspects of RL: In particular error/suboptimality- analysis of RL algorithms, understanding of regret, and function approximation • Notions of Convergence in RL, Assumptions and Preliminaries   • Optimization Error in RL and Analyses of Fitted Q-Iteration Algorithms   • Regret Analyses of RL Algorithms: An Introduction   • RL with Function Approximation: When can we still obtain convergent algorithms? This is not at all an exhaustive coverage of topics in RL theory, checkout various resources on the last slide of this lecture.

Metrics used to evaluate RL methods Used typically for measuring how easy is to Sample complexity infer the optimal policy assuming no exploration bottlenecks (e.g., in offline RL) How many transitions/episodes do I need to obtain a good policy? ✓ ✓ ◆◆ 1 s,a | Q π ( s, a ) − ˆ then max Q π ( s, a ) | ≤ ε N = O | S | , | A | , poly 1 − γ Regret Used typically for measuring how good an exploration scheme is π 0 , π 1 , π 2 , · · · , π N N X Reg(N) = E s 0 ∼ ρ [ V ∗ ( s 0 )] − E s 0 ∼ ρ [ V π i ( s 0 )] i =1 √ This area Reg(N) = O ( N )

Assumptions used in RL Analyses We can breakdown the RL into two parts:   - the exploration part   - given data from the exploration policy, we should be able to learn from it Can we analyze these separately? To remove the exploration aspect, perform analysis under the “generative model” assumption access to sampling a model s 0 ∼ P ( ·| s, a ) Suppose we can query the true dynamics model of the MDP for each (s, a) pair N times and construct an empirical dynamics model P ( s 0 | s, a ) = #( s 0 , a, s ) ˆ N Goal: Approximate the Q-function or the value function How does the approximation error of this model translate to errors in the value function?

Preliminaries Concentration Says that average over samples gets closer to the mean More complex variants: We will use this version to obtain a worst case bound on the generative model. Lemmas from RL Theory Textbook (Draft). Agarwal, Jiang, Kakade, Sun. https://rltheorybook.github.io/

Part 1: Sampling/Optimization Error in RL Goal: How does error in training translate to error in the value-function? We will analyze this optimization error in two settings:   (1) generative model (2) Fitted Q-iteration We want results of the form: if || ˆ P ( s 0 | s, a ) − P ( s 0 | s, a ) || 1 ≤ ε then || Q ( s, a ) − ˆ Q ( s, a ) || 1 ≤ δ if || Q ( s, a ) − ˆ TQ ( s, a ) || ∞ ≤ ε then || Q ( s, a ) − ˆ Q ( s, a ) || ∞ ≤ δ h i “Empirical” Bellman operator: a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = r ( s, a ) + γ E s 0 ⇠ P ( s 0 | s,a ) max constructed using transition samples observed by h i ˆ a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = ˆ r ( s, a ) + γ E s 0 ⇠ ˆ max sampling the MDP P ( s 0 | s,a )

Sampling Error with Generative Model ˆ 1. Estimate   P ( s 0 | s, a ) = #( s 0 , a, s ) P ( s 0 | s, a ) ˆ N 2. For a given policy, plan under this dynamics model to obtain the Q- function ˆ Q π First Step: Bound the difference between the learned and true dynamics model Use concentration inequalities with high probability greater than 1 − δ m = number of samples used to estimate p ( s 0 | s, a ) The empirical dynamics model and the actual dynamics model are close

Sampling Error with Generative Model Second step: Compute how the dynamics model affects the Q-function 1. Express Q in the vector form 2. Express the difference between the two vectors in a more closed form version and obtain ( - P) in the ˆ P Q-function depends on the expression dynamics model P(s’|s, a) via a non-linear transformation

Sampling Error with Generative Model Third step: Understand how error in the Q-function depends on error in the model Define || P π || ∞ ≤ 1 Triangle inequality Thus, || w || ∞ ≤ || v || ∞ / (1 − γ )

Sampling Error with Generative Model Final step: Completing the Proof Bound the max element of the product by product of max elements Assume R max = 1 Now use the We want atmost eps previous relation error in , compute Q π r the minimum number | S | log(1 / δ ) γ || Q π − ˆ of samples m needed Q π || ≤ (1 − γ ) 2 c m for this..

Proof Takeaways and Summary r | S | log(1 / δ ) γ || Q π − ˆ Q π || ≤ (1 − γ ) 2 c m • A small error in estimating the dynamics model implies small error in the Q-function   • However, error “compounds”: Note the (1 - gamma)^2 factor in the denominator of the bound.   • The more samples we collect, the better our estimate will be, but sadly samples aren’t free! How does optimization error manifest in model-free variants (e.g., fitted Q-iteration)?

Part 2: Optimization Error in FQI Fitted Q-iteration runs a sequence of backups by minimizing mean-squared error || Q − ˆ TQ k || 2 initial Q-value Q 0 Q k +1 ← min 2 Q if we use T instead of ˆ and || Q k +1 − TQ k || = 0 T then FQI converges to the optimal Q-function Q ∗ Which sources of error are we considering here? - T is inexact, “sampling error” due to limited samples   - Bellman errors in that may not be 0 | Q k +1 − TQ k | h i ˆ a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = ˆ r ( s, a ) + γ E s 0 ⇠ ˆ max P ( s 0 | s,a ) h i a 0 Q ( s 0 , a 0 ) TQ ( s, a ) = r ( s, a ) + γ E s 0 ⇠ P ( s 0 | s,a ) max

Optimization Error in Fitted Q-Iteration First Step: Bound the difference between the empirical and actual Bellman backup Concentration of reward � | ˆ TQ ( s, a ) − TQ ( s, a ) | ≤ � ˆ r ( s, a ) − r ( s, a ) + � ⌘� ⇣ a 0 Q ( s 0 , a 0 )] − E s ⇠ P ( s 0 | s,a ) [max a 0 Q ( s 0 , a 0 )] ) + γ P ( s 0 | s,a ) [max E s ⇠ ˆ � � Triangle inequality, Concentration of dynamics bound each term separately � � a 0 Q ( s 0 , a 0 )] − E s ⇠ P ( s 0 | s,a ) [max a 0 Q ( s 0 , a 0 )] ≤ | ˆ r ( s, a ) − r ( s, a ) | + + γ P ( s 0 | s,a ) [max � E s ⇠ ˆ � � � ( ˆ X P ( s 0 | s, a ) − P ( s 0 | s, a )) max a 0 Q ( s 0 , a 0 ) | := | Directly apply Hoe ff ding’s s 0 Vector-form r ≤ || ˆ log(1 / δ ) P ( ·| s, a ) − P ( ·| s, a ) || 1 || Q || ∞ ≤ 2 R max 2 m Sum of product ≤ sum of product of absolute values, Q-values bounded by the ∞ -norm

Optimization Error in Fitted Q-Iteration Combining the bounds on the previous slide, and taking a max over (s, a) we get: r r log( | S || A | / δ ) | S | log(1 / δ ) || ˆ TQ − TQ || ∞ ≤ 2 R max c 1 + c 2 || Q | ∞ m m Second step: How does error in each fitting iteration affect optimality Let’s say, we incur error in each fitting step of FQI, i.e., || Q k +1 − TQ k || ∞ ≤ ε k ε k || Q k − Q ∗ || ∞ ≤ ? Then, what can we say about: || Q k − Q ∗ || ∞ ≤ || TQ k − 1 + ( Q k − TQ k − 1 ) − TQ ∗ || = || ( TQ k − 1 − TQ ∗ ) + ( Q k − TQ k − 1 ) || || ≤ || TQ k − 1 − TQ ∗ || + || Q k − TQ k − 1 || ≤ γ || Q k − 1 − Q ∗ || ∞ + ε k

Optimization Error in Fitted Q-Iteration || Q k − Q ∗ || ∞ ≤ γ || Q k − 1 − Q ∗ || ∞ + ε k ≤ γ 2 || Q k − 2 − Q ∗ || ∞ + γε k − 1 + ε k Error from previous iteration X ≤ γ k || Q 0 − Q ∗ || ∞ + γ j ε k − j “compounds”, “propagates”, etc… j Let’s consider a large number of fitting iterations in FQI (so k tends ∞ ) X γ j ε k − j k →∞ || Q k − Q ∗ || ∞ ≤ 0 + lim lim k →∞ j We pay a price for each 0 1 error term, and the total ∞ A || ε || ∞ = || ε || ∞ X γ j error in the worst-case is ≤ @ scaled by the (1 - gamma) 1 − γ j =0 factor in the denominator.

  Optimization Error in Fitted Q-Iteration Completing the Proof So far, we have seen how errors in the Bellman error can accumulate to form error against Q*   What is the total error in the Bellman error?   - optimization error   ε k - “sampling error” due to limited data || Q k − TQ k − 1 || ∞ = || Q k − ˆ TQ k − 1 + ˆ TQ k − 1 − TQ k − 1 || ∞ ≤ || Q k − ˆ TQ k − 1 || ∞ + || ˆ TQ k − 1 − TQ k − 1 || ∞ Sampling error: depends Optimization error: how on number of times we easily can we minimize see each (s, a) Bellman error 1 k →∞ || Q k − Q ∗ || ∞ ≤ || Q k − TQ k − 1 || ∞ ≤ · · · lim 1 − γ max k

Proof Takeaways and Summary • Error compounds with FQI or DQN-style methods: especially a problem in o ffl ine RL settings, where the “sampling error” component is also quite high   • A stringent requirements with these bounds is that they directly ∞ -norm of the error in the Q- function: but can we ever practically bound the error at the worst state-action pair? — Mostly not since we can’t even enumerate the state or action-space! Can we remove the dependency on the ∞ -norm? Yes! Can derive similar results for other data-distributions ( µ ) and L p norms � 1 /p || Q k − Q ∗ || µ E s,a ∼ µ ( s,a ) [ | Q k ( s, a ) − Q ∗ ( s, a ) | p ] � p = • So far we’ve looked at the generative model setting, where we have oracle MDP access to compute an approximate dynamics model. What happens in the substantially harder setting without this access, where we need exploration strategies? Coming up next…

Some Theoretical Aspects of Reinforcement Learning CS 285 - PowerPoint PPT Presentation

Some Theoretical Aspects of Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley What Will We Discuss Today? A brief introduction to some theoretical aspects of RL: In particular error/suboptimality- analysis of RL algorithms,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Some Recent Algorithmic Questions in Deep Reinforcement Learning CS 285 Instructor: Aviral

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS 285 Instructor: Sergey Levine UC Berkeley The goal of reinforcement learning well come

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Dominance Frontier Example Dominance Frontier Example II DF(d) = {n | p pred(n), d dom

Programming Models for Future High Performance Computing Systems John Gurd University of

CPSC 121: Models of Computation PART 1 REVIEW OF TEXT READING Unit 12: Functions These pages

!SQL - Augmenting the RDBMS with a Distributed Key Value Store in the Real World or

T ake, Drop, Nth, match-argument suppression The "unit" type Check-expect and

Smart Glasses Fashion Glasses Smart Glasses identity Ephesians 2 Ephesians 2 Ephesians 2

Stimulus Payments Outreach 6.23.20 Center on Budget and Policy Priorities Center on

Computing Hilbert class polynomials with the CRT method Andrew V. Sutherland Massachusetts

Sambuz

Useful Links

Newsletter

Mail Us

Some Theoretical Aspects of Reinforcement Learning CS 285 - PowerPoint PPT Presentation

Some Theoretical Aspects of Reinforcement Learning CS 285 Instructor: Aviral Kumar UC Berkeley What Will We Discuss Today? A brief introduction to some theoretical aspects of RL: In particular error/suboptimality- analysis of RL algorithms,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Some Recent Algorithmic Questions in Deep Reinforcement Learning CS 285 Instructor: Aviral

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS 285 Instructor: Sergey Levine UC Berkeley The goal of reinforcement learning well come

Performa 285 Performa 285 High Alloy Zinc Nickel High Alloy Zinc Nickel Alloy Zinc Automotive

Ichthys LNG Project Ichthys Project Location Abadi WA 285 P Ichthys Field WA 285

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Dominance Frontier Example Dominance Frontier Example II DF(d) = {n | p pred(n), d dom

Programming Models for Future High Performance Computing Systems John Gurd University of

CPSC 121: Models of Computation PART 1 REVIEW OF TEXT READING Unit 12: Functions These pages

!SQL - Augmenting the RDBMS with a Distributed Key Value Store in the Real World or

T ake, Drop, Nth, match-argument suppression The &quot;unit&quot; type Check-expect and

Smart Glasses Fashion Glasses Smart Glasses identity Ephesians 2 Ephesians 2 Ephesians 2

Stimulus Payments Outreach 6.23.20 Center on Budget and Policy Priorities Center on

Computing Hilbert class polynomials with the CRT method Andrew V. Sutherland Massachusetts

Sambuz

Useful Links

Newsletter

Mail Us

T ake, Drop, Nth, match-argument suppression The "unit" type Check-expect and