function approximation for on policy prediction and
play

Function Approximation for (on policy) Prediction and Control - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Much of the material


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Function Approximation for (on policy) Prediction and Control Lecture 8, CMU 10-403 Katerina Fragkiadaki

  2. Used Materials • Disclaimer : Much of the material and slides for this lecture were borrowed from Russ Salakhutdinov, Rich Sutton’s class and David Silver’s class on Reinforcement Learning.

  3. Large-Scale Reinforcement Learning Reinforcement learning has been used to solve large problems, e.g. ‣ - Backgammon: 10 20 states - Computer Go: 10 170 states - Helicopter: continuous state space Tabular methods clearly do not work ‣

  4. Value Function Approximation (VFA) So far we have represented value function by a lookup table ‣ - Every state s has an entry V(s), or - Every state-action pair (s,a) has an entry Q(s,a) Problem with large MDPs: ‣ - There are too many states and/or actions to store in memory - It is too slow to learn the value of each state individually Solution for large MDPs: ‣ - Estimate value function with function approximation - Generalize from seen states to unseen states

  5. Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form:

  6. ̂ Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form: π ( A t | S t , θ )

  7. Value Function Approximation (VFA) Value function approximation (VFA) replaces the table with a general ‣ parameterized form: | θ | < < | 𝒯 | When we update the parameters , the values of many states change θ simultaneously!

  8. Which Function Approximation? There are many function approximators, e.g. ‣ - Linear combinations of features - Neural networks - Decision tree - Nearest neighbour - Fourier / wavelet bases - …

  9. Which Function Approximation? There are many function approximators, e.g. ‣ - Linear combinations of features - Neural networks - Decision tree - Nearest neighbour - Fourier / wavelet bases - … differentiable function approximators ‣

  10. Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be: 
 ‣

  11. Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be: 
 ‣ To find a local minimum of J(w), adjust w in ‣ direction of the negative gradient: Step-size

  12. Gradient Descent Let J(w) be a differentiable function of parameter vector w ‣ Define the gradient of J(w) to be: 
 ‣ Starting from a guess ‣ w 0 w 0 , w 1 , w 2 , . . . We consider the sequence s.t. : ‣ w n +1 = w n − 1 2 α ∇ w J ( w n ) J ( w 0 ) ≥ J ( w 1 ) ≥ J ( w 2 ) ≥ . . . We then have ‣

  13. Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation :

  14. Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation :

  15. Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : π μ ( S ) Let denote how much time we spend in each state s under policy , then: | 𝒯 | 2 ∑ ∑ μ ( S ) [ v π ( S ) − ̂ v ( S , w ) ] μ ( S ) = 1 J ( w ) = s ∈𝒯 n =1 Very important choice: it is OK if we cannot learn the value of states we visit very few times, there are too many states, I should focus on the ones that matter: the RL way of approximating the Bellman equations!

  16. Our objective Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : μ ( S ) π Let denote how much time we spend in each state s under policy , then: | 𝒯 | 2 ∑ ∑ μ ( S ) [ v π ( S ) − ̂ v ( S , w ) ] μ ( S ) = 1 J ( w ) = s ∈𝒯 n =1 1 2 | 𝒯 | ∑ In contrast to: [ v π ( S ) − ̂ v ( S , w ) ] J 2 ( w ) = s ∈𝒯

  17. On-policy state distribution h ( s ) Let be the initial sate distribution, i.e, the probability that an episode starts at state s, then: η ( s ) = h ( s ) + ∑ s ) ∑ s , a ), ∀ s ∈ 𝒯 π ( a | ¯ s ) p ( s | ¯ η (¯ s a ¯ η ( s ) ∑ s ′ � η ( s ′ � ), ∀ s ∈ 𝒯 μ ( s ) =

  18. Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation :

  19. Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Starting from a guess w 0 ‣

  20. Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Starting from a guess w 0 ‣ w 0 , w 1 , w 2 , . . . We consider the sequence s.t. : ‣ w n +1 = w n − 1 2 α ∇ w J ( w n ) J ( w 0 ) ≥ J ( w 1 ) ≥ J ( w 2 ) ≥ . . . We then have ‣

  21. Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Gradient descent finds a local minimum: ‣

  22. Stochastic Gradient Descent Goal: find parameter vector w minimizing mean-squared error between the ‣ true value function v π (S) and its approximation : Gradient descent finds a local minimum: ‣ Stochastic gradient descent (SGD) samples the gradient: ‣

  23. Least Squares Prediction Given value function approximation: ‣ And experience D consisting of ⟨ state,value ⟩ pairs ‣ Find parameters w that give the best fitting value function v(s,w)? ‣ Least squares algorithms find parameter vector w minimizing sum- ‣ squared error between v(S t ,w) and target values v t π :

  24. SGD with Experience Replay Given experience consisting of ⟨ state, value ⟩ pairs ‣ Repeat ‣ - Sample state, value from experience - Apply stochastic gradient descent update Converges to least squares solution ‣

  25. Feature Vectors Represent state by a feature vector ‣ For example ‣ - Distance of robot from landmarks - Trends in the stock market - Piece and pawn configurations in chess

  26. Linear Value Function Approximation (VFA) Represent value function by a linear combination of features ‣ Objective function is quadratic in parameters w ‣ Update rule is particularly simple ‣ Update = step-size × prediction error × feature value ‣ Later, we will look at the neural networks as function approximators. ‣

  27. Incremental Prediction Algorithms We have assumed the true value function v π (s) is given by a supervisor ‣ But in RL there is no supervisor, only rewards ‣ In practice, we substitute a target for v π (s) ‣ For MC, the target is the return G t ‣ For TD(0), the target is the TD target: ‣ Remember

  28. Monte Carlo with VFA Return G t is an unbiased, noisy sample of true value v π (S t ) ‣ Can therefore apply supervised learning to “training data”: ‣ For example, using linear Monte-Carlo policy evaluation ‣ Monte-Carlo evaluation converges to a local optimum ‣

  29. Monte Carlo with VFA Gradient Monte Carlo Algorithm for Approximating ˆ v ⇡ v π Input: the policy π to be evaluated v : S ⇥ R n ! R Input: a di ff erentiable function ˆ Initialize value-function weights θ as appropriate (e.g., θ = 0 ) Repeat forever: Generate an episode S 0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T using π For t = 0 , 1 , . . . , T � 1: ⇥ ⇤ θ θ + α G t � ˆ v ( S t , θ ) r ˆ v ( S t , θ )

  30. TD Learning with VFA The TD-target is a biased sample of true value ‣ v π (S t ) Can still apply supervised learning to “training data”: ‣ For example, using linear TD(0): ‣ We ignore the dependence of the target on w! We call it semi-gradient methods

  31. TD Learning with VFA Semi-gradient TD(0) for estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S + ⇥ R n ! R such that ˆ Input: a di ff erentiable function ˆ v (terminal , · ) = 0 Initialize value-function weights θ arbitrarily (e.g., θ = 0 ) Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A ⇠ π ( ·| S ) Take action A , observe R, S 0 ⇥ ⇤ v ( S 0 , θ ) � ˆ θ θ + α R + γ ˆ v ( S, θ ) r ˆ v ( S, θ ) S S 0 until S 0 is terminal

  32. Control with VFA Policy evaluation Approximate policy evaluation: ‣ Policy improvement ε -greedy policy improvement ‣

  33. Action-Value Function Approximation Approximate the action-value function ‣ Minimize mean-squared error between the true action-value function ‣ q π (S,A) and the approximate action-value function: Use stochastic gradient descent to find a local minimum ‣

  34. Linear Action-Value Function Approximation Represent state and action by a feature vector ‣ Represent action-value function by linear combination of features ‣ Stochastic gradient descent update ‣

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend