Function Approximation, Deep Q Network Milan Straka November 12, - - PowerPoint PPT Presentation

function approximation deep q network
SMART_READER_LITE
LIVE PREVIEW

Function Approximation, Deep Q Network Milan Straka November 12, - - PowerPoint PPT Presentation

NPFL122, Lecture 5 Function Approximation, Deep Q Network Milan Straka November 12, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated n -step Methods


slide-1
SLIDE 1

NPFL122, Lecture 5

Function Approximation, Deep Q Network

Milan Straka

November 12, 2018

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2
  • step Methods

n

1-step TD and TD(0) 2-step TD 3-step TD n-step TD

∞-step TD

and Monte Carlo

· · ·

· · ·

· · ·

· · ·

Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition".

Full return is

  • ne-step return is

We can generalize both into -step returns: with if .

G

=

t

R ,

k=t

∞ k+1

G

=

t:t+1

R

+

t+1

γV (S

).

t+1

n G

t:t+n =

def

γ

R + (

k=t

t+n−1 k−t k+1)

γ V (S

).

n t+n

G

t:t+n =

def G

t

t + n ≥ T

2/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-3
SLIDE 3
  • step Sarsa

n

1-step TD and TD(0) 2-step TD 3-step TD n-step TD

∞-step TD

and Monte Carlo

· · ·

· · ·

· · ·

· · ·

Figure 7.1 of "Reinforcement Learning: An Introduction, Second Edition".

Defining the -step return to utilize action-value function as with if , we get the following straightforward update rule:

Path taken Action values increased by one-step Sarsa Action values increased by 10-step Sarsa

G G G

Figure 7.4 of "Reinforcement Learning: An Introduction, Second Edition".

n G

t:t+n =

def

γ

R + (

k=t

t+n−1 k−t k+1)

γ Q(S

, A )

n t+n t+n

G

t:t+n =

def G

t

t + n ≥ T Q(S

, A ) ←

t t

Q(S

, A ) +

t t

α G

− Q(S , A ) .

[

t:t+n t t ]

3/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-4
SLIDE 4

Off-policy -step Sarsa

n

Recall the relative probability of a trajectory under the target and behaviour policies, which we now generalize as Then a simple off-policy -step TD can be computed as Similarly, -step Sarsa becomes

ρ

t:t+n =

def

.

k=t

min(t+n,T −1)

b(A

∣S )

k k

π(A

∣S )

k k

n V (S ) ←

t

V (S

) +

t

αρ

G − V (S ) .

t:t+n−1 [ t:t+n t ]

n Q(S

, A ) ←

t t

Q(S

, A ) +

t t

αρ

G − Q(S , A ) .

t+1:t+n [ t:t+n t t ]

4/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-5
SLIDE 5

Off-policy -step Without Importance Sampling

n

St, At At+1 Rt+1 St+1 St+2 Rt+2 At+2 Rt+3 St+3

the 3-step tree-backup update

Example in Section 7.5 of "Reinforcement Learning: An Introduction, Second Edition".

We now derive the -step reward, starting from one-step: For two-step, we get: Therefore, we can generalize to:

n G

t:t+1 =

def R

+

t+1

π(a∣S )Q(S , a).

a t+1 t+1

G

t:t+2 =

def R

+

t+1

γ

π(a∣S )Q(S , a) +

a =A 

t+1

t+1 t+1

γπ(A

∣S )G .

t+1 t+1 t+1:t+2

G

t:t+n =

def R

+

t+1

γ

π(a∣S )Q(S , a) +

a =A 

t+1

t+1 t+1

γπ(A

∣S )G .

t+1 t+1 t+1:t+n

5/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-6
SLIDE 6

Function Approximation

We will approximate value function and/or state-value function , choosing from a family of functions parametrized by a weight vector . We denote the approximations as We utilize the Mean Squared Value Error objective, denoted : where the state distribution is usually on-policy distribution.

v q w ∈ Rd (s, w), v ^

(s, a, w).

q ^ V E (w) V E =

def

μ(s) v (s) − (s, w)

,

s∈S

∑ [ π v ^ ]2 μ(s)

6/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-7
SLIDE 7

Gradient and Semi-Gradient Methods

The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, we use episodic return , and in temporal difference methods, we employ bootstrapping and use

w w

t+1 ← w

− α∇ v (S ) − (S , w )

t

2 1 [ π

t

v ^

t t ]2

← w

− α v (S ) − (S , w ) ∇ (S , w ).

t

[ π

t

v ^

t t ]

v ^

t t

v

(S )

π t

G

t

R

+

t+1

γ (S

, w).

v ^

t+1

7/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-8
SLIDE 8

Monte Carlo Gradient Policy Evaluation

Gradient Monte Carlo Algorithm for Estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S ⇥ Rd ! R Algorithm parameter: step size α > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop forever (for each episode): Generate an episode S0, A0, R1, S1, A1, . . . , RT , ST using π Loop for each step of episode, t = 0, 1, . . . , T  1: w w + α ⇥ Gt  ˆ v(St,w) ⇤ rˆ v(St,w)

Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition".

8/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-9
SLIDE 9

Linear Methods

A simple special case of function approximation are linear methods, where The is a representation of state , which is a vector of the same size as . It is sometimes called a feature vector. The SGD update rule then becomes

(x(s), w) v ^ =

def x(s) w =

T

x(s)

w .

i i

x(s) s w w

t+1

w

t

α v

(S ) − (x(S ), w ) x(S ).

[ π

t

v ^

t t ] t

9/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-10
SLIDE 10

Feature Construction for Linear Methods

Many methods developed in the past: state aggregation, polynomials Fourier basis tile coding radial basis functions But of course, nowadays we use deep neural networks which construct a suitable feature vector automatically as a latent variable (the last hidden layer).

10/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-11
SLIDE 11

State Aggregation

Simple way of generating a feature vector is state aggregation, where several neighboring states are grouped together. For example, consider a 1000-state random walk, where transitions go uniformly randomly to any of 100 neighboring states on the left or on the right. Using state aggregation, we can partition the 1000 states into 10 groups of 100 states. Monte Carlo policy evaluation then computes the following:

State Value scale

True value vπ Approximate MC value ˆ

v

State distribution

0.0017 0.0137

Distribution scale

1000 1

  • 1

1

µ

Figure 9.1 of "Reinforcement Learning: An Introduction, Second Edition".

11/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-12
SLIDE 12

Tile Coding

Point in state space to be represented Tiling 1 Tiling 2 Tiling 3 Tiling 4

Continuous 2D state space

Four active tiles/features

  • verlap the point

and are used to represent it

Figure 9.9 of "Reinforcement Learning: An Introduction, Second Edition".

If overlapping tiles are used, the learning rate is usually normalized as .

t α/t

12/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-13
SLIDE 13

Tile Coding

For example, on the 1000-state random walk example, the performance of tile coding surpasses state aggregation:

.4 .3 .2 .1 averaged

  • ver 30 runs

5000

Episodes

State aggregation (one tiling) Tile coding (50 tilings)

p VE

Figure 9.10 of "Reinforcement Learning: An Introduction, Second Edition".

13/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-14
SLIDE 14

Asymmetrical Tile Coding

In higher dimensions, the tiles should have asymmetrical offsets, with a sequence of being a good choice.

Possible generalizations for uniformly

  • ffset tilings

Possible generalizations for asymmetrically

  • ffset tilings

Figure 9.11 of "Reinforcement Learning: An Introduction, Second Edition".

(1, 3, 5, … , 2d − 1)

14/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-15
SLIDE 15

Temporal Difference Semi-Gradient Policy Evaluation

In TD methods, we again use bootstrapping to estimate as

Semi-gradient TD(0) for estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S+ ⇥ Rd ! R such that ˆ v(terminal,·) = 0 Algorithm parameter: step size α > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop for each episode: Initialize S Loop for each step of episode: Choose A ⇠ π(·|S) Take action A, observe R, S0 w w + α ⇥ R + γˆ v(S0,w)  ˆ v(S,w) ⇤ rˆ v(S,w) S S0 until S is terminal

Algorithm 9.3 of "Reinforcement Learning: An Introduction, Second Edition".

Note that such algorithm is called semi-gradient, because it does not backpropagate through .

v

(S )

π t

R

+

t+1

γ (S

, w).

v ^

t+1

(S , w) v ^

15/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-16
SLIDE 16

Temporal Difference Semi-Gradient Policy Evaluation

An important fact is that linear semi-gradient TD methods do not converge to . Instead, they converge to a different TD fixed point . It can be proven that However, when is close to one, the multiplication factor in the above bound is quite large.

V E w

TD

(w

) ≤

V E

TD

(w). 1 − γ 1

w

min V E γ

16/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-17
SLIDE 17

Temporal Difference Semi-Gradient Policy Evaluation

As before, we can utilize -step TD methods.

n-step semi-gradient TD for estimating ˆ v ⇡ vπ Input: the policy π to be evaluated Input: a differentiable function ˆ v : S+ ⇥ Rd ! R such that ˆ v(terminal,·) = 0 Algorithm parameters: step size α > 0, a positive integer n Initialize value-function weights w arbitrarily (e.g., w = 0) All store and access operations (St and Rt) can take their index mod n + 1 Loop for each episode: Initialize and store S0 6= terminal T 1 Loop for t = 0, 1, 2, . . . : | If t < T, then: | Take an action according to π(·|St) | Observe and store the next reward as Rt+1 and the next state as St+1 | If St+1 is terminal, then T t + 1 | τ t  n + 1 (τ is the time whose state’s estimate is being updated) | If τ  0: | G Pmin(τ+n,T )

i=τ+1

γi−τ−1Ri | If τ + n < T, then: G G + γnˆ v(Sτ+n,w) (Gτ:τ+n) | w w + α [G  ˆ v(Sτ,w)] rˆ v(Sτ,w) Until τ = T  1

Algorithm 9.5 of "Reinforcement Learning: An Introduction, Second Edition".

n

17/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-18
SLIDE 18

Temporal Difference Semi-Gradient Policy Evaluation

0.55 0.5 0.45 0.35 0.3 0.25 0.4 0.4 0.2 0.8 0.6 1

α

Average RMS error

  • ver 1000 states

and first 10 episodes

n=1 n=2 n=4 n=8 n=16 n=32 n=64

128

512 256

State

True value vπ Approximate TD value

1

  • 1

1 1000

ˆ v

Figure 9.2 of "Reinforcement Learning: An Introduction, Second Edition".

18/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-19
SLIDE 19

Sarsa with Function Approximation

Until now, we talked only about policy evaluation. Naturally, we can extend it to a full Sarsa algorithm:

Episodic Semi-gradient Sarsa for Estimating ˆ q ⇡ q⇤ Input: a differentiable action-value function parameterization ˆ q : S ⇥ A ⇥ Rd ! R Algorithm parameters: step size α > 0, small ε > 0 Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) Loop for each episode: S, A initial state and action of episode (e.g., ε-greedy) Loop for each step of episode: Take action A, observe R, S0 If S0 is terminal: w w + α ⇥ R  ˆ q(S, A, w) ⇤ rˆ q(S, A, w) Go to next episode Choose A0 as a function of ˆ q(S0, ·, w) (e.g., ε-greedy) w w + α ⇥ R + γˆ q(S0, A0, w)  ˆ q(S, A, w) ⇤ rˆ q(S, A, w) S S0 A A0

Algorithm 10.1 of "Reinforcement Learning: An Introduction, Second Edition".

19/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-20
SLIDE 20

Sarsa with Function Approximation

Additionally, we can incorporate -step returns:

Episodic semi-gradient n-step Sarsa for estimating ˆ q ⇡ q∗ or qπ Input: a differentiable action-value function parameterization ˆ q : S ⇥ A ⇥ Rd ! R Input: a policy π (if estimating qπ) Algorithm parameters: step size α > 0, small ε > 0, a positive integer n Initialize value-function weights w 2 Rd arbitrarily (e.g., w = 0) All store and access operations (St, At, and Rt) can take their index mod n + 1 Loop for each episode: Initialize and store S0 6= terminal Select and store an action A0 ⇠ π(·|S0) or ε-greedy wrt ˆ q(S0, ·, w) T 1 Loop for t = 0, 1, 2, . . . : | If t < T, then: | Take action At | Observe and store the next reward as Rt+1 and the next state as St+1 | If St+1 is terminal, then: | T t + 1 | else: | Select and store At+1 ⇠ π(·|St+1) or ε-greedy wrt ˆ q(St+1, ·, w) | τ t  n + 1 (τ is the time whose estimate is being updated) | If τ  0: | G Pmin(τ+n,T )

i=τ+1

γi−τ−1Ri | If τ + n < T, then G G + γnˆ q(Sτ+n, Aτ+n, w) (Gτ:τ+n) | w w + α [G  ˆ q(Sτ, Aτ, w)] rˆ q(Sτ, Aτ, w) Until τ = T  1

Algorithm 10.2 of "Reinforcement Learning: An Introduction, Second Edition".

n

20/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-21
SLIDE 21

Mountain Car Example

Figure 10.1 of "Reinforcement Learning: An Introduction, Second Edition".

The performances are for semi-gradient Sarsa( ) algorithm (which we did not talked about yet) with tile coding of 8 overlapping tiles covering position and velocity, with offsets of .

λ (1, 3)

21/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-22
SLIDE 22

Mountain Car Example

100 200 400 1000

Mountain Car

Steps per episode log scale

averaged over 100 runs

Episode

500

n=1 n=8 Figure 10.3 of "Reinforcement Learning: An Introduction, Second Edition".

220 240 260 300 0.5 1 1.5

Mountain Car

Steps per episode

averaged over first 50 episodes and 100 runs

α × number of tilings (8)

280

n=1 n=2 n=4 n=8 n=16 n=8 n=4 n=2 n=16 n=1

Figure 10.4 of "Reinforcement Learning: An Introduction, Second Edition".

22/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-23
SLIDE 23

Off-policy Divergence With Function Approximation

Consider a deterministic transition between two states whose values are computed using the same weight:

2w 2w

Figure from Section 11.2 of "Reinforcement Learning: An Introduction, Second Edition".

If initially , TD error will be also 10 (or nearly 10 if ). If for example , will be increased to 1 (by 10%). This process can continue indefinitely. However, the problem arises only in off-policy setting, where we do not decrease value of the second state from further observation.

w = 10 γ < 1 α = 0.1 w

23/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-24
SLIDE 24

Off-policy Divergence With Function Approximation

The previous idea can be realized for instance by the following example.

2w2+w8 2w1+w8 2w3+w8 2w4+w8 2w5+w8 2w6+w8 w7+2w8

b(dashed|·) = 6/7 b(solid|·) = 1/7

π(solid|·) = 1

γ = 0.99

Figure 11.1 of "Reinforcement Learning: An Introduction, Second Edition".

24/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-25
SLIDE 25

Off-policy Divergence With Function Approximation

2w2+w8 2w1+w8 2w3+w8 2w4+w8 2w5+w8 2w6+w8 w7+2w8

b(dashed|·) = 6/7 b(solid|·) = 1/7 π(solid|·) = 1 γ = 0.99

Figure 11.1 of "Reinforcement Learning: An Introduction, Second Edition".

w8 w8

300 200 100

10 1

1000 1000

w1– w6

Steps

w7

Sweeps

Semi-gradient Off-policy TD Semi-gradient DP

w1– w6 w7

Figure 11.2 of "Reinforcement Learning: An Introduction, Second Edition".

25/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-26
SLIDE 26

Deep Q Networks

Volodymyr Mnih et al.: Playing Atari with Deep Reinforcement Learning (Dec 2013 on arXiv). In 2015 accepted in Nature, as Human-level control through deep reinforcement learning. Off-policy Q-learning algorithm with a convolutional neural network function approximation of action-value function. Training can be extremely brittle (and can even diverge as shown earlier).

26/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-27
SLIDE 27

Deep Q Network

Convol ut i

  • n

Convol ut i

  • n

Ful l y connect ed Ful l y connect ed

No i nput

Figure 1 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.

27/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-28
SLIDE 28

Deep Q Network

Mont ezuma' s Revenge Pr i vat e Eye Gr avi t ar Fr

  • st

bi t e Ast er

  • i

ds Ms. Pac- Man Bowl i ng Doubl e Dunk Seaquest Vent ur e Al i en Ami dar Ri ver Rai d Bank Hei st Zaxxon Cent i pede Chopper Command Wi zar d

  • f

Wor Bat t l e Zone Ast er i x H. E. R. O. Q* ber t I ce Hockey Up and Down Fi shi ng Der by Endur

  • Ti

me Pi l

  • t

Fr eeway Kung- Fu Mast er Tut ankham Beam Ri der Space I nvader s Pong James Bond Tenni s Kangar

  • Road

Runner Assaul t Kr ul l Name Thi s Game Demon At t ack Gopher Cr azy Cl i mber At l ant i s Robot ank St ar Gunner Br eakout Boxi ng Vi deo Pi nbal l At human- l evel

  • r

above Bel

  • w

human- l evel 100 200 300 400 4, 500% 500 1, 000 600 Best l i near l ear ner DQN

Figure 3 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.

28/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-29
SLIDE 29

Deep Q Network

Extended Data Figure 2a of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.

29/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-30
SLIDE 30

Deep Q Network

Extended Data Figure 2b of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.

30/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-31
SLIDE 31

Deep Q Networks

Preprocessing: 128-color images are converted to grayscale and then resized to . Frame skipping technique is used, i.e., only every frame (out of 60 per second) is considered, and the selected action is repeated on the other frames. Input to the network are last frames (considering only the frames kept by frame skipping), i.e., an image with channels. The network is fairly standard, performing 32 filters of size with stride 4 and ReLU, 64 filters of size with stride 2 and ReLU, 64 filters of size with stride 1 and ReLU, fully connected layer with 512 units and ReLU,

  • utput layer with 18 output units (one for each action)

210 × 160 84 × 84 4th 4 4 8 × 8 4 × 4 3 × 3

31/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-32
SLIDE 32

Deep Q Networks

Network is trained with RMSProp to minimize the following loss: An -greedy behavior policy is utilized. Important improvements: experience replay: the generated episodes are stored in a buffer as quadruples, and for training a transition is sampled uniformly; separate target network : to prevent instabilities, a separate target network is used to estimate state-value function. The weights are not trained, but copied from the trained network once in a while; reward clipping of to .

L =

def E

(r + γ Q(s , a ; ) − Q(s, a; θ))

.

(s,a,r,s )∼data

[

a′

max

′ ′ θ

ˉ

2]

ε (s, a, r, s )

θ ˉ (r + γ max

Q(s , a ; ) −

a′ ′ ′ θ

ˉ Q(s, a; θ)) [−1, 1]

32/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN

slide-33
SLIDE 33

Deep Q Networks Hyperparameters

Hyperparameter Value minibatch size 32 replay buffer size 1M target network update frequency 10k discount factor 0.99 training frames 50M RMSProp learning rate and momentum 0.00025, 0.95 initial , final and frame of final 1.0, 0.1, 1M replay start size 50k no-op max 30

ε ε ε

33/33 NPFL122, Lecture 5

Refresh Tile Coding Semi-Gradient TD Off-policy Divergence DQN