Function Approximation, Deep Q Network Milan Straka November 12, - PowerPoint PPT Presentation

NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

n -step Methods ∞ -step TD 1-step TD and Monte Carlo and TD(0) 2-step TD 3-step TD n-step TD Full return is ∞ ∑ = , G R k +1 t k = t · · · · · · one-step return is = + γV ( S ). · · · G R t : t +1 t +1 t +1 · · · n We can generalize both into -step returns: t + n −1 ( k +1 ) ∑ def k − t t : t + n = + γ V ( S n ). Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition". G γ R t + n k = t def G t : t + n = t + n ≥ T G t with if . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 2/33

n -step Sarsa ∞ -step TD 1-step TD n and TD(0) 2-step TD 3-step TD n-step TD and Monte Carlo Defining the -step return to utilize action-value function as t + n −1 ( k +1 ) ∑ k − t def n t : t + n = + γ Q ( S , A ) G γ R t + n t + n · · · · · · k = t · · · def G t : t + n = t + n ≥ T G t with if , we get the following · · · straightforward update rule: Q ( S , A ) ← Q ( S , A ) + [ − Q ( S , A t ] ) . α G t : t + n Figure 7.1 of "Reinforcement Learning: An Introduction, t t t t t Second Edition". Action values increased Action values increased Path taken by one-step Sarsa by 10-step Sarsa G G G Figure 7.4 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 3/33

n Off-policy -step Sarsa Recall the relative probability of a trajectory under the target and behaviour policies, which we now generalize as min( t + n , T −1) π ( A ∣ S ) ∏ def k k t : t + n = . ρ b ( A ∣ S ) k k k = t n Then a simple off-policy -step TD can be computed as V ( S ) ← V ( S ) + t : t + n −1 [ − V ( S t ] ) . αρ G t : t + n t t n Similarly, -step Sarsa becomes Q ( S , A ) ← Q ( S , A ) + t +1 : t + n [ − Q ( S , A t ] ) . αρ G t : t + n t t t t t NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 4/33

n Off-policy -step Without Importance Sampling n S t , A t We now derive the -step reward, starting from one-step: ∑ R t +1 def R t : t +1 = + π ( a ∣ S ) Q ( S , a ). G t +1 t +1 t +1 S t +1 a A t +1 For two-step, we get: R t +2 ∑ def R t : t +2 = + π ( a ∣ S ) Q ( S , a ) + γπ ( A ∣ S ) G . S t +2 G γ t +1 t +1 t +1 t +1 t +1 t +1: t +2 a = A  t +1 A t +2 R t +3 Therefore, we can generalize to: S t +3 ∑ def R t : t + n = + π ( a ∣ S ) Q ( S , a ) + γπ ( A ∣ S ) G . G γ t +1 t +1 t +1 t +1 t +1 t +1: t + n a = A  t +1 the 3-step tree-backup update Example in Section 7.5 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 5/33

Function Approximation v q We will approximate value function and/or state-value function , choosing from a family of w ∈ R d functions parametrized by a weight vector . We denote the approximations as ^ ( s , w ), v ^ ( s , a , w ). q V E We utilize the Mean Squared Value Error objective, denoted : ∑ ] 2 def ( w ) = μ ( s ) v [ π ( s ) − ( s , w ) ^ , V E v s ∈ S μ ( s ) where the state distribution is usually on-policy distribution. NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 6/33

Gradient and Semi-Gradient Methods w The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as 1 t ] 2 t +1 ← w − α ∇ v ( S [ π ) − ( S ^ , w ) w v t t t 2 ← w − α v [ π ( S ) − ( S ^ , w t ] ) ∇ ( S ^ , w ). v v t t t t t ( S ) v π t As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, G t we use episodic return , and in temporal difference methods, we employ bootstrapping and + γ ( S ^ , w ). R v t +1 t +1 use NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 7/33

Monte Carlo Gradient Policy Evaluation Gradient Monte Carlo Algorithm for Estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S ⇥ R d ! R Input: a di ff erentiable function ˆ Algorithm parameter: step size α > 0 Initialize value-function weights w 2 R d arbitrarily (e.g., w = 0 ) Loop forever (for each episode): Generate an episode S 0 , A 0 , R 1 , S 1 , A 1 , . . . , R T , S T using π Loop for each step of episode, t = 0 , 1 , . . . , T  1: ⇥ ⇤ w w + α G t  ˆ v ( S t , w ) r ˆ v ( S t , w ) Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 8/33

Linear Methods A simple special case of function approximation are linear methods, where ∑ def x ( s ) w = ^ ( x ( s ), w ) = T x ( s ) . v w i i x ( s ) s w The is a representation of state , which is a vector of the same size as . It is sometimes called a feature vector . The SGD update rule then becomes ← − [ π ( S ) − ( x ( S ^ ), w t ] ) x ( S ). w w α v v t +1 t t t t NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 9/33

Feature Construction for Linear Methods Many methods developed in the past: state aggregation, polynomials Fourier basis tile coding radial basis functions But of course, nowadays we use deep neural networks which construct a suitable feature vector automatically as a latent variable (the last hidden layer). NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 10/33

State Aggregation Simple way of generating a feature vector is state aggregation , where several neighboring states are grouped together. For example, consider a 1000-state random walk, where transitions go uniformly randomly to any of 100 neighboring states on the left or on the right. Using state aggregation, we can partition the 1000 states into 10 groups of 100 states. Monte Carlo policy evaluation then computes the following: 1 0.0137 True value v π Approximate Value Distribution MC value ˆ 0 scale scale v State distribution 0.0017 µ - 1 0 1 1000 State Figure 9.1 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 11/33

Tile Coding Tiling 1 Tiling 2 Tiling 3 Tiling 4 Continuous Four active tiles/features 2D state overlap the point space and are used to Point in represent it state space to be represented Figure 9.9 of "Reinforcement Learning: An Introduction, Second Edition". α / t t If overlapping tiles are used, the learning rate is usually normalized as . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 12/33

Tile Coding For example, on the 1000-state random walk example, the performance of tile coding surpasses state aggregation: .4 .3 p .2 VE averaged State aggregation over 30 runs (one tiling) .1 Tile coding (50 tilings) 0 0 5000 Episodes Figure 9.10 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 13/33

Asymmetrical Tile Coding In higher dimensions, the tiles should have asymmetrical offsets, with a sequence of (1, 3, 5, … , 2 d − 1) being a good choice. Possible generalizations for uniformly o ff set tilings Possible generalizations for asymmetrically o ff set tilings Figure 9.11 of "Reinforcement Learning: An Introduction, Second Edition". NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 14/33

Temporal Difference Semi-Gradient Policy Evaluation ( S ) + γ ( S ^ , w ). v R v t +1 t +1 π t In TD methods, we again use bootstrapping to estimate as Semi-gradient TD(0) for estimating ˆ v ⇡ v π Input: the policy π to be evaluated v : S + ⇥ R d ! R such that ˆ Input: a di ff erentiable function ˆ v (terminal , · ) = 0 Algorithm parameter: step size α > 0 Initialize value-function weights w 2 R d arbitrarily (e.g., w = 0 ) Loop for each episode: Initialize S Loop for each step of episode: Choose A ⇠ π ( ·| S ) Take action A , observe R, S 0 ⇥ ⇤ v ( S 0 , w )  ˆ w w + α R + γ ˆ v ( S, w ) r ˆ v ( S, w ) S S 0 until S is terminal Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition". Note that such algorithm is called semi-gradient , because it does not backpropagate through ′ ^ ( S , w ) v . NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 15/33

Temporal Difference Semi-Gradient Policy Evaluation V E An important fact is that linear semi-gradient TD methods do not converge to . Instead, w TD they converge to a different TD fixed point . It can be proven that 1 ( w ) ≤ min V E ( w ). V E TD 1 − γ w γ However, when is close to one, the multiplication factor in the above bound is quite large. NPFL122, Lecture 5 Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN 16/33

Function Approximation, Deep Q Network Milan Straka November 12, - PowerPoint PPT Presentation

NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated n -step Methods

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

6. Approximation and fitting norm approximation least-norm problems regularized

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Function Representation & Spherical Harmonics Function approximation G (x) ... function

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Rational function approximation Rational function of degree N = n + m is written as q ( x ) = p 0 +

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Sequential Model List Selection for Function Approximation Ernest Fokou e epf@samsi.info

according to curriculum framework A Sndor Egri, Pter dm, Gyula Honyek, Pter

CISC 326 Game Architecture Module 02: Challenges In Game Development Ahmed E. Hassan (with

Tracking Cold Related Illness in Massachusetts Photo: "Boston Winter Days" by

The Project FeederWatch Top 20 feeder birds in the Southeast Based on the reports of citizen

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

MA/CSSE 474 Theory of Computation Nondeterminism NFSMs Your Questions? Previous class

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

North Dakota Tier II Instructions If you have already filed a Tier II report for a previous year,

Sambuz

Useful Links

Newsletter

Mail Us

Function Approximation, Deep Q Network Milan Straka November 12, - PowerPoint PPT Presentation

NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated n -step Methods

Deep Approximation via Deep Learning Zuowei Shen Department of Mathematics National University

6. Approximation and fitting norm approximation least-norm problems regularized

LOCAL LINEAR APPROXIMATION MATH 200 GOALS Be able to compute the local linear approximation

Function Representation &amp; Spherical Harmonics Function approximation G (x) ... function

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Moderately exponential approximation Bridging the gap between exact computation and polynomial

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao

ECS 231 Lecture on Approximation and Error Analysis 1 / 9 Approximation and error analysis 1.

Lecture 18: PCP Theorem and Hardness of Approximation I Arijit Bishnu 26.04.2010 Introduction

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Rational function approximation Rational function of degree N = n + m is written as q ( x ) = p 0 +

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Sequential Model List Selection for Function Approximation Ernest Fokou e epf@samsi.info

according to curriculum framework A Sndor Egri, Pter dm, Gyula Honyek, Pter

CISC 326 Game Architecture Module 02: Challenges In Game Development Ahmed E. Hassan (with

Tracking Cold Related Illness in Massachusetts Photo: &quot;Boston Winter Days&quot; by

The Project FeederWatch Top 20 feeder birds in the Southeast Based on the reports of citizen

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

MA/CSSE 474 Theory of Computation Nondeterminism NFSMs Your Questions? Previous class

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

North Dakota Tier II Instructions If you have already filed a Tier II report for a previous year,

Sambuz

Useful Links

Newsletter

Mail Us

Function Representation & Spherical Harmonics Function approximation G (x) ... function

Tracking Cold Related Illness in Massachusetts Photo: "Boston Winter Days" by