NPFL122, Lecture 5
Function Approximation, Deep Q Network
Milan Straka
November 12, 2018
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Function Approximation, Deep Q Network Milan Straka November 12, - - PowerPoint PPT Presentation
NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated n -step Methods
Milan Straka
November 12, 2018
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
1-step TD and TD(0) 2-step TD 3-step TD n-step TD
∞-step TD
and Monte Carlo
· · ·
· · ·
· · ·
· · ·
Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition".
Full return is
We can generalize both into -step returns: with if .
G
=t
R ,k=t
∑
∞ k+1
G
=t:t+1
R
+t+1
γV (S
).t+1
n G
t:t+n =
def
γR + (
k=t
∑
t+n−1 k−t k+1)
γ V (S
).n t+n
G
t:t+n =
def G
t
t + n ≥ T
2/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
1-step TD and TD(0) 2-step TD 3-step TD n-step TD
∞-step TD
and Monte Carlo
· · ·
· · ·
· · ·
· · ·
Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition".
Defining the -step return to utilize action-value function as with if , we get the following straightforward update rule:
Path taken Action values increased by one-step Sarsa Action values increased by 10-step Sarsa
G G G
Figure 7.4 of "Reinforcement Learning: An Introduction, Second Edition".
n G
t:t+n =
def
γR + (
k=t
∑
t+n−1 k−t k+1)
γ Q(S
, A )n t+n t+n
G
t:t+n =
def G
t
t + n ≥ T Q(S
, A ) ←t t
Q(S
, A ) +t t
α G
− Q(S , A ) .[
t:t+n t t ]
3/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Recall the relative probability of a trajectory under the target and behaviour policies, which we now generalize as Then a simple off-policy -step TD can be computed as Similarly, -step Sarsa becomes
ρ
t:t+n =
def
.k=t
∏
min(t+n,T −1)
b(A
∣S )k k
π(A
∣S )k k
n V (S ) ←
t
V (S
) +t
αρ
G − V (S ) .t:t+n−1 [ t:t+n t ]
n Q(S
, A ) ←t t
Q(S
, A ) +t t
αρ
G − Q(S , A ) .t+1:t+n [ t:t+n t t ]
4/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
St, At At+1 Rt+1 St+1 St+2 Rt+2 At+2 Rt+3 St+3
the 3-step tree-backup update
Example in Section 7.5 of "Reinforcement Learning: An Introduction, Second Edition".
We now derive the -step reward, starting from one-step: For two-step, we get: Therefore, we can generalize to:
n G
t:t+1 =
def R
+t+1
π(a∣S )Q(S , a).∑
a t+1 t+1
G
t:t+2 =
def R
+t+1
γ
π(a∣S )Q(S , a) +∑
a =A
t+1
t+1 t+1
γπ(A
∣S )G .t+1 t+1 t+1:t+2
G
t:t+n =
def R
+t+1
γ
π(a∣S )Q(S , a) +∑
a =A
t+1
t+1 t+1
γπ(A
∣S )G .t+1 t+1 t+1:t+n
5/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
We will approximate value function and/or state-value function , choosing from a family of functions parametrized by a weight vector . We denote the approximations as We utilize the Mean Squared Value Error objective, denoted : where the state distribution is usually on-policy distribution.
v q w ∈ Rd (s, w), v ^
(s, a, w).q ^ V E (w) V E =
def
μ(s) v (s) − (s, w),
s∈S
∑ [ π v ^ ]2 μ(s)
6/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, we use episodic return , and in temporal difference methods, we employ bootstrapping and use
w w
t+1 ← w
− α∇ v (S ) − (S , w )t
2 1 [ π
t
v ^
t t ]2
← w
− α v (S ) − (S , w ) ∇ (S , w ).t
[ π
t
v ^
t t ]
v ^
t t
v
(S )π t
G
t
R
+t+1
γ (S
, w).v ^
t+1
7/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Gradient Monte Carlo Algorithm for Estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S ⇥ Rd ! R Algorithm parameter: step size α > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop forever (for each episode): Generate an episode S0, A0, R1, S1, A1, . . . , RT , ST using π Loop for each step of episode, t = 0, 1, . . . , T 1: w w + α ⇥ Gt ˆ v(St,w) ⇤ rˆ v(St,w)
Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition".
8/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
A simple special case of function approximation are linear methods, where The is a representation of state , which is a vector of the same size as . It is sometimes called a feature vector. The SGD update rule then becomes
(x(s), w) v ^ =
def x(s) w =
T
x(s)
w .∑
i i
x(s) s w w
←t+1
w
−t
α v
(S ) − (x(S ), w ) x(S ).[ π
t
v ^
t t ] t
9/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Many methods developed in the past: state aggregation, polynomials Fourier basis tile coding radial basis functions But of course, nowadays we use deep neural networks which construct a suitable feature vector automatically as a latent variable (the last hidden layer).
10/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Simple way of generating a feature vector is state aggregation, where several neighboring states are grouped together. For example, consider a 1000-state random walk, where transitions go uniformly randomly to any of 100 neighboring states on the left or on the right. Using state aggregation, we can partition the 1000 states into 10 groups of 100 states. Monte Carlo policy evaluation then computes the following:
State Value scale
True value vπ Approximate MC value ˆ
v
State distribution
0.0017 0.0137
Distribution scale
1000 1
1
µ
Figure 9.1 of "Reinforcement Learning: An Introduction, Second Edition".
11/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Point in state space to be represented Tiling 1 Tiling 2 Tiling 3 Tiling 4
Four active tiles/features
and are used to represent it
Figure 9.9 of "Reinforcement Learning: An Introduction, Second Edition".
If overlapping tiles are used, the learning rate is usually normalized as .
t α/t
12/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
For example, on the 1000-state random walk example, the performance of tile coding surpasses state aggregation:
.4 .3 .2 .1 averaged
5000
State aggregation (one tiling) Tile coding (50 tilings)
Figure 9.10 of "Reinforcement Learning: An Introduction, Second Edition".
13/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
In higher dimensions, the tiles should have asymmetrical offsets, with a sequence of being a good choice.
Possible generalizations for uniformly
Possible generalizations for asymmetrically
Figure 9.11 of "Reinforcement Learning: An Introduction, Second Edition".
(1, 3, 5, … , 2d − 1)
14/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
In TD methods, we again use bootstrapping to estimate as
Semi-gradient TD(0) for estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S+ ⇥ Rd ! R such that ˆ v(terminal,·) = 0 Algorithm parameter: step size α > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop for each episode: Initialize S Loop for each step of episode: Choose A ⇠ π(·|S) Take action A, observe R, S0 w w + α ⇥ R + γˆ v(S0,w) ˆ v(S,w) ⇤ rˆ v(S,w) S S0 until S is terminal
Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition".
Note that such algorithm is called semi-gradient, because it does not backpropagate through .
v
(S )π t
R
+t+1
γ (S
, w).v ^
t+1
(S , w) v ^
′
15/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
An important fact is that linear semi-gradient TD methods do not converge to . Instead, they converge to a different TD fixed point . It can be proven that However, when is close to one, the multiplication factor in the above bound is quite large.
V E w
TD
(w
) ≤V E
TD
(w). 1 − γ 1
w
min V E γ
16/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
As before, we can utilize -step TD methods.
n-step semi-gradient TD for estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S+ ⇥ Rd ! R such that ˆ v(terminal,·) = 0 Algorithm parameters: step size α > 0, a positive integer n Initialize value-function weights w arbitrarily (e.g., w = 0) All store and access operations (St and Rt) can take their index mod n + 1 Loop for each episode: Initialize and store S0 6= terminal T 1 Loop for t = 0, 1, 2, . . . : | If t < T, then: | Take an action according to π(·|St) | Observe and store the next reward as Rt+1 and the next state as St+1 | If St+1 is terminal, then T t + 1 | τ t n + 1 (τ is the time whose state’s estimate is being updated) | If τ 0: | G Pmin(τ+n,T )
i=τ+1
γi−τ−1Ri | If τ + n < T, then: G G + γnˆ v(Sτ+n,w) (Gτ:τ+n) | w w + α [G ˆ v(Sτ,w)] rˆ v(Sτ,w) Until τ = T 1
Algorithm 9.5 of "Reinforcement Learning: An Introduction, Second Edition".
n
17/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
0.55 0.5 0.45 0.35 0.3 0.25 0.4 0.4 0.2 0.8 0.6 1
α
Average RMS error
and first 10 episodes
n=1 n=2 n=4 n=8 n=16 n=32 n=64
128
512 256
State
True value vπ Approximate TD value
1
1 1000
ˆ v
Figure 9.2 of "Reinforcement Learning: An Introduction, Second Edition".
18/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Until now, we talked only about policy evaluation. Naturally, we can extend it to a full Sarsa algorithm:
Episodic Semi-gradient Sarsa for Estimating ˆ q ⇡ q⇤ Input: a differentiable action-value function parameterization ˆ q : S ⇥ A ⇥ Rd ! R Algorithm parameters: step size α > 0, small ε > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop for each episode: S, A initial state and action of episode (e.g., ε-greedy) Loop for each step of episode: Take action A, observe R, S0 If S0 is terminal: w w + α ⇥ R ˆ q(S, A, w) ⇤ rˆ q(S, A, w) Go to next episode Choose A0 as a function of ˆ q(S0, ·, w) (e.g., ε-greedy) w w + α ⇥ R + γˆ q(S0, A0, w) ˆ q(S, A, w) ⇤ rˆ q(S, A, w) S S0 A A0
Algorithm 10.1 of "Reinforcement Learning: An Introduction, Second Edition".
19/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Additionally, we can incorporate -step returns:
Episodic semi-gradient n-step Sarsa for estimating ˆ q ⇡ q∗ or qπ Input: a differentiable action-value function parameterization ˆ q : S ⇥ A ⇥ Rd ! R Input: a policy π (if estimating qπ) Algorithm parameters: step size α > 0, small ε > 0, a positive integer n Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) All store and access operations (St, At, and Rt) can take their index mod n + 1 Loop for each episode: Initialize and store S0 6= terminal Select and store an action A0 ⇠ π(·|S0) or ε-greedy wrt ˆ q(S0, ·, w) T 1 Loop for t = 0, 1, 2, . . . : | If t < T, then: | Take action At | Observe and store the next reward as Rt+1 and the next state as St+1 | If St+1 is terminal, then: | T t + 1 | else: | Select and store At+1 ⇠ π(·|St+1) or ε-greedy wrt ˆ q(St+1, ·, w) | τ t n + 1 (τ is the time whose estimate is being updated) | If τ 0: | G Pmin(τ+n,T )
i=τ+1
γi−τ−1Ri | If τ + n < T, then G G + γnˆ q(Sτ+n, Aτ+n, w) (Gτ:τ+n) | w w + α [G ˆ q(Sτ, Aτ, w)] rˆ q(Sτ, Aτ, w) Until τ = T 1
Algorithm 10.2 of "Reinforcement Learning: An Introduction, Second Edition".
n
20/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Figure 10.1 of "Reinforcement Learning: An Introduction, Second Edition".
The performances are for semi-gradient Sarsa( ) algorithm (which we did not talked about yet) with tile coding of 8 overlapping tiles covering position and velocity, with offsets of .
λ (1, 3)
21/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
100 200 400 1000
Mountain Car
Steps per episode log scale
averaged over 100 runs
Episode
500
n=1 n=8 Figure 10.3 of "Reinforcement Learning: An Introduction, Second Edition".
220 240 260 300 0.5 1 1.5
Mountain Car
Steps per episode
averaged over first 50 episodes and 100 runs
α × number of tilings (8)
280
n=1 n=2 n=4 n=8 n=16 n=8 n=4 n=2 n=16 n=1
Figure 10.4 of "Reinforcement Learning: An Introduction, Second Edition".
22/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Consider a deterministic transition between two states whose values are computed using the same weight:
2w 2w
Figure from Section 11.2 of "Reinforcement Learning: An Introduction, Second Edition".
If initially , TD error will be also 10 (or nearly 10 if ). If for example , will be increased to 1 (by 10%). This process can continue indefinitely. However, the problem arises only in off-policy setting, where we do not decrease value of the second state from further observation.
w = 10 γ < 1 α = 0.1 w
23/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
The previous idea can be realized for instance by the following example.
2w2+w8 2w1+w8 2w3+w8 2w4+w8 2w5+w8 2w6+w8 w7+2w8
b(dashed|·) = 6/7 b(solid|·) = 1/7
π(solid|·) = 1
γ = 0.99
Figure 11.1 of "Reinforcement Learning: An Introduction, Second Edition".
24/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
2w2+w8 2w1+w8 2w3+w8 2w4+w8 2w5+w8 2w6+w8 w7+2w8
b(dashed|·) = 6/7 b(solid|·) = 1/7 π(solid|·) = 1 γ = 0.99
Figure 11.1 of "Reinforcement Learning: An Introduction, Second Edition".
w8 w8
300 200 100
10 1
1000 1000
w1– w6
Steps
w7
Sweeps
Semi-gradient Off-policy TD Semi-gradient DP
w1– w6 w7
Figure 11.2 of "Reinforcement Learning: An Introduction, Second Edition".
25/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Volodymyr Mnih et al.: Playing Atari with Deep Reinforcement Learning (Dec 2013 on arXiv). In 2015 accepted in Nature, as Human-level control through deep reinforcement learning. Off-policy Q-learning algorithm with a convolutional neural network function approximation of action-value function. Training can be extremely brittle (and can even diverge as shown earlier).
26/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Convol ut i
Convol ut i
Ful l y connect ed Ful l y connect ed
No i nput
Figure 1 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.
27/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Mont ezuma' s Revenge Pr i vat e Eye Gr avi t ar Fr
bi t e Ast er
ds Ms. Pac- Man Bowl i ng Doubl e Dunk Seaquest Vent ur e Al i en Ami dar Ri ver Rai d Bank Hei st Zaxxon Cent i pede Chopper Command Wi zar d
Wor Bat t l e Zone Ast er i x H. E. R. O. Q* ber t I ce Hockey Up and Down Fi shi ng Der by Endur
me Pi l
Fr eeway Kung- Fu Mast er Tut ankham Beam Ri der Space I nvader s Pong James Bond Tenni s Kangar
Runner Assaul t Kr ul l Name Thi s Game Demon At t ack Gopher Cr azy Cl i mber At l ant i s Robot ank St ar Gunner Br eakout Boxi ng Vi deo Pi nbal l At human- l evel
above Bel
human- l evel 100 200 300 400 4, 500% 500 1, 000 600 Best l i near l ear ner DQN
Figure 3 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.
28/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Extended Data Figure 2a of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.
29/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Extended Data Figure 2b of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.
30/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Preprocessing: 128-color images are converted to grayscale and then resized to . Frame skipping technique is used, i.e., only every frame (out of 60 per second) is considered, and the selected action is repeated on the other frames. Input to the network are last frames (considering only the frames kept by frame skipping), i.e., an image with channels. The network is fairly standard, performing 32 filters of size with stride 4 and ReLU, 64 filters of size with stride 2 and ReLU, 64 filters of size with stride 1 and ReLU, fully connected layer with 512 units and ReLU,
210 × 160 84 × 84 4th 4 4 8 × 8 4 × 4 3 × 3
31/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Network is trained with RMSProp to minimize the following loss: An -greedy behavior policy is utilized. Important improvements: experience replay: the generated episodes are stored in a buffer as quadruples, and for training a transition is sampled uniformly; separate target network : to prevent instabilities, a separate target network is used to estimate state-value function. The weights are not trained, but copied from the trained network once in a while; reward clipping of to .
L =
def E
(r + γ Q(s , a ; ) − Q(s, a; θ)).
(s,a,r,s )∼data
′
[
a′
max
′ ′ θ
ˉ
2]
ε (s, a, r, s )
′
θ ˉ (r + γ max
Q(s , a ; ) −a′ ′ ′ θ
ˉ Q(s, a; θ)) [−1, 1]
32/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN
Hyperparameter Value minibatch size 32 replay buffer size 1M target network update frequency 10k discount factor 0.99 training frames 50M RMSProp learning rate and momentum 0.00025, 0.95 initial , final and frame of final 1.0, 0.1, 1M replay start size 50k no-op max 30
ε ε ε
33/33 NPFL122, Lecture 5
Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN