Function Approximation for (on policy) Prediction and Control - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina Fragkiadaki

Used Materials • Disclaimer : Much of the material and slides for this lecture were borrowed from Russ Salakhutdinov, Rich Sutton’s class and David Silver’s class on Reinforcement Learning.

Large-Scale Reinforcement Learning Reinforcement learning has been used to solve large problems, e.g. ‣ - Backgammon: 10 20 states - Computer Go: 10 170 states - Helicopter: continuous state space Tabular methods clearly do not work ‣

Value Function Approximation (VFA) So far we have represented value function by a lookup table ‣ - Every state s has an entry V(s), or - Every state-action pair (s,a) has an entry Q(s,a) Problem with large MDPs: ‣ - There are too many states and/or actions to store in memory - It is too slow to learn the value of each state individually Solution for large MDPs: ‣ - Estimate value function with function approximation - Generalize from seen states to unseen states

Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form:

̂ Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form: π ( A t | S t , θ )

Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form: | θ | < < | 𝒯 | When we update the parameters , the values of many states change θ simultaneously!

Which Function Approximation? There are many function approximators, e.g. ‣ - Linear combinations of features - Neural networks - Decision tree - Nearest neighbour - Fourier / wavelet bases - …

Which Function Approximation? There are many function approximators, e.g. ‣ - Linear combinations of features - Neural networks - Decision tree - Nearest neighbour - Fourier / wavelet bases - … differentiable function approximators ‣

Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be:   ‣

Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be:   ‣ To find a local minimum of J(w), adjust w in ‣ direction of the negative gradient: Step-size

Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be:   ‣ Starting from a guess ‣ w 0 w 0 , w 1 , w 2 , . . . We consider the sequence s.t. : ‣ w n +1 = w n − 1 2 α ∇ w J ( w n ) J ( w 0 ) ≥ J ( w 1 ) ≥ J ( w 2 ) ≥ . . . We then have ‣

Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation :

Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : π μ ( S ) Let denote how much time we spend in each state s under policy , then: | 𝒯 | 2 ∑ ∑ μ ( S ) [ v π ( S ) − ̂ v ( S , w ) ] μ ( S ) = 1 J ( w ) = s ∈𝒯 n =1 Very important choice: it is OK if we cannot learn the value of states we visit very few times, there are too many states, I should focus on the ones that matter: the RL way of approximating the Bellman equations!

Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : μ ( S ) π Let denote how much time we spend in each state s under policy , then: | 𝒯 | 2 ∑ ∑ μ ( S ) [ v π ( S ) − ̂ v ( S , w ) ] μ ( S ) = 1 J ( w ) = s ∈𝒯 n =1 1 2 | 𝒯 | ∑ In contrast to: [ v π ( S ) − ̂ v ( S , w ) ] J 2 ( w ) = s ∈𝒯

On-policy state distribution h ( s ) Let be the initial sate distribution, i.e, the probability that an episode starts at state s, then: η ( s ) = h ( s ) + ∑ s ) ∑ s , a ), ∀ s ∈ 𝒯 π ( a | ¯ s ) p ( s | ¯ η (¯ s a ¯ η ( s ) ∑ s ′ � η ( s ′ � ), ∀ s ∈ 𝒯 μ ( s ) =

Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation :

Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Starting from a guess w 0 ‣

Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Starting from a guess w 0 ‣ w 0 , w 1 , w 2 , . . . We consider the sequence s.t. : ‣ w n +1 = w n − 1 2 α ∇ w J ( w n ) J ( w 0 ) ≥ J ( w 1 ) ≥ J ( w 2 ) ≥ . . . We then have ‣

Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Gradient descent finds a local minimum: ‣

Stochastic Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Gradient descent finds a local minimum: ‣ Stochastic gradient descent (SGD) samples the gradient: ‣

Least Squares Prediction Given value function approximation: ‣ And experience D consisting of ⟨ state,value ⟩ pairs ‣ Find parameters w that give the best fitting value function v(s,w)? ‣ Least squares algorithms find parameter vector w minimizing sum- ‣ squared error between v(S t ,w) and target values v t π :

SGD with Experience Replay Given experience consisting of ⟨ state, value ⟩ pairs ‣ Repeat ‣ - Sample state, value from experience - Apply stochastic gradient descent update Converges to least squares solution ‣

Feature Vectors Represent state by a feature vector ‣ For example ‣ - Distance of robot from landmarks - Trends in the stock market - Piece and pawn configurations in chess

Linear Value Function Approximation (VFA) Represent value function by a linear combination of features ‣ Objective function is quadratic in parameters w ‣ Update rule is particularly simple ‣ Update = step-size × prediction error × feature value ‣ Later, we will look at the neural networks as function approximators. ‣

Incremental Prediction Algorithms We have assumed the true value function v π (s) is given by a supervisor ‣ But in RL there is no supervisor, only rewards ‣ In practice, we substitute a target for v π (s) ‣ For MC, the target is the return G t ‣ For TD(0), the target is the TD target: ‣ Remember

Monte Carlo with VFA Return G t is an unbiased, noisy sample of true value v π (S t ) ‣ Can therefore apply supervised learning to “training data”: ‣ For example, using linear Monte-Carlo policy evaluation ‣ Monte-Carlo evaluation converges to a local optimum ‣

Monte Carlo with VFA Gradient Monte Carlo Algorithm for Approximating ˆ v ⇡ v π Input: the policy π to be evaluated v : S ⇥ R n ! R Input: a di ff erentiable function ˆ Initialize value-function weights θ as appropriate (e.g., θ = 0 ) Repeat forever: Generate an episode S 0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T using π For t = 0 , 1 , . . . , T � 1: ⇥ ⇤ θ θ + α G t � ˆ v ( S t , θ ) r ˆ v ( S t , θ )

TD Learning with VFA The TD-target is a biased sample of true value ‣ v π (S t ) Can still apply supervised learning to “training data”: ‣ For example, using linear TD(0): ‣ We ignore the dependence of the target on w! We call it semi-gradient methods

TD Learning with VFA Semi-gradient TD(0) for estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S + ⇥ R n ! R such that ˆ Input: a di ff erentiable function ˆ v (terminal , · ) = 0 Initialize value-function weights θ arbitrarily (e.g., θ = 0 ) Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A ⇠ π ( ·| S ) Take action A , observe R, S 0 ⇥ ⇤ v ( S 0 , θ ) � ˆ θ θ + α R + γ ˆ v ( S, θ ) r ˆ v ( S, θ ) S S 0 until S 0 is terminal

Control with VFA Policy evaluation Approximate policy evaluation: ‣ Policy improvement ε -greedy policy improvement ‣

Action-Value Function Approximation Approximate the action-value function ‣ Minimize mean-squared error between the true action-value function ‣ q π (S,A) and the approximate action-value function: Use stochastic gradient descent to find a local minimum ‣

Linear Action-Value Function Approximation Represent state and action by a feature vector ‣ Represent action-value function by linear combination of features ‣ Stochastic gradient descent update ‣

Function Approximation for (on policy) Prediction and Control - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the material

6. Approximation and fitting norm approximation least-norm problems regularized

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Policy Approximation Policy = a function from state to action ! How does the agent select

Function Representation & Spherical Harmonics Function approximation G (x) ... function

Class 4: On Policy Prediction With Approximation Chapter 9 Sutton slides/silver slides 295,

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Off-policy methods with approximation Recall off-policy learning involves two policies One

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Todays Outline Reinforcement Learning II Dan Weld Review Reinforcement Learning Review

17. Review Linear approximation: f f x x + f y y. Tangent plane: to z = f ( x, y )

Linear Approximations of Addition Modulo 2 n -1 Chunfang Zhou, Xiutao Feng and Chuankun Wu State

Linear Approximations Suppose we want to approximate the value of a function f for some value of x

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Nonlinear Equations = 40 / How can we solve these equations? Spring force:

Systems of Nonlinear Equations CS3220 - Summer 2008 Jonathan Kaldor From 1 to N So far, we

Polygon Rendering Methods Ray Casting Given a freeform surface, one usually Simplest

Function Approximation for (on policy) Prediction and Control - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the material

6. Approximation and fitting norm approximation least-norm problems regularized

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Policy Approximation Policy = a function from state to action ! How does the agent select

Function Representation &amp; Spherical Harmonics Function approximation G (x) ... function

Class 4: On Policy Prediction With Approximation Chapter 9 Sutton slides/silver slides 295,

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Off-policy methods with approximation Recall off-policy learning involves two policies One

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Todays Outline Reinforcement Learning II Dan Weld Review Reinforcement Learning Review

17. Review Linear approximation: f f x x + f y y. Tangent plane: to z = f ( x, y )

Linear Approximations of Addition Modulo 2 n -1 Chunfang Zhou, Xiutao Feng and Chuankun Wu State

Linear Approximations Suppose we want to approximate the value of a function f for some value of x

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Nonlinear Equations = 40 / How can we solve these equations? Spring force:

Systems of Nonlinear Equations CS3220 - Summer 2008 Jonathan Kaldor From 1 to N So far, we

Polygon Rendering Methods Ray Casting Given a freeform surface, one usually Simplest

Function Representation & Spherical Harmonics Function approximation G (x) ... function