Rainbow Milan Straka November 19, 2018 Charles University in - - PowerPoint PPT Presentation

rainbow
SMART_READER_LITE
LIVE PREVIEW

Rainbow Milan Straka November 19, 2018 Charles University in - - PowerPoint PPT Presentation

NPFL122, Lecture 6 Rainbow Milan Straka November 19, 2018 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Function Approximation v q We will approximate


slide-1
SLIDE 1

NPFL122, Lecture 6

Rainbow

Milan Straka

November 19, 2018

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

slide-2
SLIDE 2

Function Approximation

We will approximate value function and/or state-value function , choosing from a family of functions parametrized by a weight vector . We denote the approximations as We utilize the Mean Squared Value Error objective, denoted : where the state distribution is usually on-policy distribution.

v q w ∈ Rd (s, w), v ^

(s, a, w).

q ^ V E (w) V E =

def

μ(s) v (s) − (s, w)

,

s∈S

∑ [ π v ^ ]2 μ(s)

2/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-3
SLIDE 3

Gradient and Semi-Gradient Methods

The functional approximation (i.e., the weight vector ) is usually optimized using gradient methods, for example as As usual, the is estimated by a suitable sample. For example in Monte Carlo methods, we use episodic return , and in temporal difference methods, we employ bootstrapping and use

w w

t+1 ← w

− α∇ v (S ) − (S , w )

t

2 1 [ π

t

v ^

t t ]2

← w

+ α v (S ) − (S , w ) ∇ (S , w ).

t

[ π

t

v ^

t t ]

v ^

t t

v

(S )

π t

G

t

R

+

t+1

γ (S

, w).

v ^

t+1

3/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-4
SLIDE 4

Deep Q Network

Off-policy Q-learning algorithm with a convolutional neural network function approximation of action-value function. Training can be extremely brittle (and can even diverge as shown earlier).

Convol ut i

  • n

Convol ut i

  • n

Ful l y connect ed Ful l y connect ed

No i nput

Figure 1 of the paper "Human-level control through deep reinforcement learning" by Volodymyr Mnih et al.

4/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-5
SLIDE 5

Deep Q Networks

Preprocessing: 128-color images are converted to grayscale and then resized to . Frame skipping technique is used, i.e., only every frame (out of 60 per second) is considered, and the selected action is repeated on the other frames. Input to the network are last frames (considering only the frames kept by frame skipping), i.e., an image with channels. The network is fairly standard, performing 32 filters of size with stride 4 and ReLU, 64 filters of size with stride 2 and ReLU, 64 filters of size with stride 1 and ReLU, fully connected layer with 512 units and ReLU,

  • utput layer with 18 output units (one for each action)

210 × 160 84 × 84 4th 4 4 8 × 8 4 × 4 3 × 3

5/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-6
SLIDE 6

Deep Q Networks

Network is trained with RMSProp to minimize the following loss: An -greedy behavior policy is utilized. Important improvements: experience replay: the generated episodes are stored in a buffer as quadruples, and for training a transition is sampled uniformly; separate target network : to prevent instabilities, a separate target network is used to estimate state-value function. The weights are not trained, but copied from the trained network once in a while; reward clipping of to .

L =

def E

(r + γ Q(s , a ; ) − Q(s, a; θ))

.

(s,a,r,s )∼data

[

a′

max

′ ′ θ

ˉ

2]

ε (s, a, r, s )

θ ˉ (r + γ max

Q(s , a ; ) −

a′ ′ ′ θ

ˉ Q(s, a; θ)) [−1, 1]

6/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-7
SLIDE 7

Deep Q Networks Hyperparameters

Hyperparameter Value minibatch size 32 replay buffer size 1M target network update frequency 10k discount factor 0.99 training frames 50M RMSProp learning rate and momentum 0.00025, 0.95 initial , final and frame of final 1.0, 0.1, 1M replay start size 50k no-op max 30

ε ε ε

7/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-8
SLIDE 8

Rainbow

There have been many suggested improvements to the DQN architecture. In the end of 2017, the Rainbow: Combining Improvements in Deep Reinforcement Learning paper combines 7 of them into a single architecture they call Rainbow.

Figure 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

8/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-9
SLIDE 9

Rainbow DQN Extensions

Double Q-learning

Similarly to double Q-learning, instead of we minimize

2 4 8 16 32 64 128 256 512 1024

number of actions

0.0 0.5 1.0 1.5

error

Figure 1: The orange bars show the bias in a single Q- learning update when the action values are Q(s, a) = V∗(s) + ǫa and the errors {ǫa}m

a=1 are independent standard

normal random variables. The second set of action values Q′, used for the blue bars, was generated identically and in-

  • dependently. All bars are the average of 100 repetitions.

Figure 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al.

r + γ

Q(s , a ; ) −

a′

max

′ ′ θ

ˉ Q(s, a; θ), r + γQ(s ,

Q(s , a ; θ); ) −

′ a′

arg max

′ ′

θ ˉ Q(s, a; θ).

9/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-10
SLIDE 10

Rainbow DQN Extensions

Double Q-learning

−2 2 Qt(s, a) Q∗(s, a)

True value and an estimate

−2 2 maxa Qt(s, a)

All estimates and max

−1 1 maxa Qt(s, a) − maxa Q∗(s, a) Double-Q estimate +0.61 −0.02 Average error

Bias as function of state

2 Qt(s, a) Q∗(s, a) 2 maxa Qt(s, a) −1 1 maxa Qt(s, a) − maxa Q∗(s, a) Double-Q estimate +0.47 +0.02 −6 −4 −2 2 4 6

state

2 4 Qt(s, a) Q∗(s, a) −6 −4 −2 2 4 6

state

2 4 maxa Qt(s, a) −6 −4 −2 2 4 6

state

2 4 maxa Qt(s, a)− maxa Q∗(s, a) Double-Q estimate +3.35 −0.02

Figure 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al.

10/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-11
SLIDE 11

Rainbow DQN Extensions

Double Q-learning

50 100 150 200 10 15 20

Value estimates Alien

50 100 150 200 4 6 8

Space Invaders

50 100 150 200 1.0 1.5 2.0 2.5

Time Pilot

50 100 150 200 Training steps (in millions) 2 4 6 8

DQN estimate Double DQN estimate DQN true value Double DQN true value

Zaxxon

50 100 150 200 1 10 100

Value estimates (log scale)

DQN Double DQN

Wizard of Wor

50 100 150 200 5 10 20 40 80 DQN Double DQN

Asterix

50 100 150 200

Training steps (in millions)

1000 2000 3000 4000

Score

DQN Double DQN

Wizard of Wor

50 100 150 200

Training steps (in millions)

2000 4000 6000 DQN Double DQN

Asterix Figure 3 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al.

11/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-12
SLIDE 12

Rainbow DQN Extensions

Double Q-learning

Table 1 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al. Table 2 of the paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt et al.

12/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-13
SLIDE 13

Rainbow DQN Extensions

Prioritized Replay

Instead of sampling the transitions uniformly from the replay buffer, we instead prefer those with a large TD error. Therefore, we sample transitions according to their probability where controls the shape of the distribution (which is uniform for and corresponds to TD error for ). New transitions are inserted into the replay buffer with maximum probability to support exploration of all encountered transitions.

p

t

r +

∣ ∣ ∣ γ

Q(s , a ; ) −

a′

max

′ ′ θ

ˉ Q(s, a; θ) , ∣ ∣ ∣ω ω ω = 0 ω = 1

13/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-14
SLIDE 14

Rainbow DQN Extensions

Prioritized Replay

Because we now sample transitions according to instead of uniformly, on-policy distribution and sampling distribution differ. To compensate, we therefore utilize importance sampling with ratio The authors utilize in fact “for stability reasons”

p

t

ρ

=

t

. ( p

t

1/N )

β

ρ

/ ρ .

t i

max

i

14/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-15
SLIDE 15

Rainbow DQN Extensions

Prioritized Replay

Algorithm 1 Double DQN with proportional prioritization

1: Input: minibatch k, step-size η, replay period K and size N, exponents α and β, budget T. 2: Initialize replay memory H = ∅, ∆ = 0, p1 = 1 3: Observe S0 and choose A0 ∼ πθ(S0) 4: for t = 1 to T do 5:

Observe St, Rt, γt

6:

Store transition (St−1, At−1, Rt, γt, St) in H with maximal priority pt = maxi<t pi

7:

if t ≡ 0 mod K then

8:

for j = 1 to k do

9:

Sample transition j ∼ P(j) = pα

j / ∑ i pα i

10:

Compute importance-sampling weight wj = (N · P(j))−β / maxi wi

11:

Compute TD-error δj = Rj + γjQtarget (Sj, arg maxa Q(Sj, a)) − Q(Sj−1, Aj−1)

12:

Update transition priority pj ← |δj|

13:

Accumulate weight-change ∆ ← ∆ + wj · δj · ∇θQ(Sj−1, Aj−1)

14:

end for

15:

Update weights θ ← θ + η · ∆, reset ∆ = 0

16:

From time to time copy weights into target network θtarget ← θ

17:

end if

18:

Choose action At ∼ πθ(St)

19: end for

Algorithm 1 of the paper "Prioritized Experience Replay" by Tom Schaul et al.

15/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-16
SLIDE 16

Rainbow DQN Extensions

Duelling Networks

Instead of computing directly , we compose it from the following quantities: value function for a given state , advantage function computing an advantage of using action in state .

Figure 1 of the paper "Dueling Network Architectures for Deep Reinforcement Learning" by Ziyu Wang et al.

Q(s, a; θ) s a s Q(s, a) =

def V (f(s; ζ); η) + A(f(s; ζ), a; ψ) −

∣A∣

A(f(s; ζ), a ; ψ)

∑a ∈A

16/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-17
SLIDE 17

Rainbow DQN Extensions

Duelling Networks

Figure 3 of the paper "Dueling Network Architectures for Deep Reinforcement Learning" by Ziyu Wang et al.

17/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-18
SLIDE 18

Rainbow DQN Extensions

Duelling Networks

VALUE ADVANTAGE VALUE ADVANTAGE

Figure 2 of the paper "Dueling Network Architectures for Deep Reinforcement Learning" by Ziyu Wang et al.

18/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-19
SLIDE 19

Rainbow DQN Extensions

Duelling Networks

30 no-ops Human Starts Mean Median Mean Median

  • Prior. Duel Clip

591.9% 172.1% 567.0% 115.3%

  • Prior. Single

434.6% 123.7% 386.7% 112.9% Duel Clip 373.1% 151.5% 343.8% 117.1% Single Clip 341.2% 132.6% 302.8% 114.1% Single 307.3% 117.8% 332.9% 110.9% Nature DQN 227.9% 79.1% 219.6% 68.5%

Table 1 of the paper "Dueling Network Architectures for Deep Reinforcement Learning" by Ziyu Wang et al.

19/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-20
SLIDE 20

Rainbow DQN Extensions

Multi-step Learning

Instead of Q-learning, we use -step variant Q-learning (to be exact, we use -step Expected Sarsa) to maximize This changes the off-policy algorithm to on-policy, but it is not discussed in any way by the authors.

n n

γ

r

+

i=1

n i−1 i

γ

Q(s , a ; ) −

n a′

max

′ ′ θ

ˉ Q(s, a; θ),

20/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-21
SLIDE 21

Rainbow DQN Extensions

Noisy Nets

Noisy Nets are neural networks whose weights and biases are perturbed by a parametric function of a noise. The parameters are represented as where is zero-mean noise with fixed statistics. We therefore learn the parameters . Therefore, a fully connected layer is represented in the following way in Noisy Nets:

θ θ =

def μ + σ ⊙ ε,

ε ζ =

def (μ, σ)

y = wx + b y = (μ

+

w

σ

w

ε

)x +

w

+

b

σ

b

ε

).

b

21/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-22
SLIDE 22

Rainbow DQN Extensions

Noisy Nets

The noise can be for example independent Gaussian noise. However, for performance reasons, factorized Gaussian noise is used to generate a matrix of noise. If is noise corresponding to a layer with inputs and outputs, we generate independent noise for input neurons, independent noise for output neurons, and set for . The authors generate noise samples for every batch, sharing the noise for all batch instances.

Deep Q Networks

When training a DQN, -greedy is no longer used and all policies are greedy, and all fully connected layers are parametrized as noisy nets.

ε ε

i,j

i j ε

i

ε

j

ε

=

i,j

f(ε

)f(ε )

i j

f(x) = sign(x) ∣x∣ ε

22/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-23
SLIDE 23

Rainbow DQN Extensions

Noisy Nets

Table 1 of the paper "Noisy Networks for Exploration" by Meire Fortunato et al. Figure 2 of the paper "Noisy Networks for Exploration" by Meire Fortunato et al.

23/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-24
SLIDE 24

Rainbow DQN Extensions

Noisy Nets

Figure 3: Comparison of the learning curves of the average noise parameter ¯ Σ across five Atari games in NoisyNet-DQN. The results are averaged across 3 seeds and error bars (+/- standard deviation) are plotted.

Figure 3 of the paper "Noisy Networks for Exploration" by Meire Fortunato et al.

24/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-25
SLIDE 25

Rainbow DQN Extensions

Distributional RL

Instead of an expected return , we could estimate distribution of expected returns . These distributions satisfy a distributional Bellman equation: The authors of the paper prove similar properties of the distributional Bellman operator compared to the regular Bellman operator, mainly being a contraction under a suitable metric (Wasserstein metric).

Q(s, a) Z(s, a) Z(s, a) = R(s, a) + γZ(s , a ).

′ ′

25/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-26
SLIDE 26

Rainbow DQN Extensions

Distributional RL

The distribution of returns is modeled as a discrete distribution parametrized by the number of atoms and by . Support of the distribution are atoms The atom probabilities are predicted using a distribution as

N ∈ N V

, V ∈

MIN MAX

R {z

i =

def V

+

MIN

iΔz : 0 ≤ i < N} for Δz =

def

.

N − 1 V

− V

MAX MIN

softmax Z

(s, a) =

θ

z

with probability p =

. { i

i

e

∑j

f

(s,a)

j

ef

(s,a)

i

}

26/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-27
SLIDE 27

Rainbow DQN Extensions

P πZ R+ P πZ  Z P π

(a) (b) (c) (d)

T πZ

Figure 1 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

Distributional RL

After the Bellman update, the support of the distribution is not the same as the original support. We therefore project it to the original support by proportionally mapping each atom of the Bellman update to immediate neighbors in the original support. The network is trained to minimize the Kullbeck-Leibler divergence between the current distribution and the (mapped) distribution of the one-step update

R(s, a) + γZ(s , a )

′ ′

Φ(R(s, a) + γZ(s , a ))

′ ′ i =

def

1 − p (s , a ).

j=1

N

⎣ ⎡ Δz

[r + γz ] − z

∣ ∣ ∣

j V

MIN

V

MAX

i∣

∣ ∣ ⎦ ⎤

1 j ′ ′

D

(Φ(R +

KL

Z(s , a )∣∣Z(s, a)).

a′

max

′ ′

27/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-28
SLIDE 28

Rainbow DQN Extensions

Distributional RL

Algorithm 1 Categorical Algorithm input A transition xt, at, rt, xt+1, γt ∈ [0, 1] Q(xt+1, a) := ∑

i zipi(xt+1, a)

a∗ ← arg maxa Q(xt+1, a) mi = 0, i ∈ 0, . . . , N − 1 for j ∈ 0, . . . , N − 1 do # Compute the projection of ˆ T zj onto the support {zi} ˆ T zj ← [rt + γtzj]VMAX

VMIN

bj ← ( ˆ T zj − VMIN)/∆z # bj ∈ [0, N − 1] l ← ⌊bj⌋, u ← ⌈bj⌉ # Distribute probability of ˆ T zj ml ← ml + pj(xt+1, a∗)(u − bj) mu ← mu + pj(xt+1, a∗)(bj − l) end for

  • utput − ∑

i mi log pi(xt, at) # Cross-entropy loss

Algorithm 1 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

28/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-29
SLIDE 29

Rainbow DQN Extensions

Distributional RL

Mean Median > H.B. > DQN

DQN

228% 79% 24

DDQN

307% 118% 33 43 DUEL. 373% 151% 37 50 PRIOR. 434% 124% 39 48

  • PR. DUEL.

592% 172% 39 44 C51 701% 178% 40 50

Figure 6 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

Return Probability

Right Left Right+Laser Left+Laser Laser Noop

Figure 4. Learned value distribution during an episode of SPACE

  • INVADERS. Different actions are shaded different colours. Re-

turns below 0 (which do not occur in SPACE INVADERS) are not shown here as the agent assigns virtually no probability to them.

Figure 4 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

29/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-30
SLIDE 30

Rainbow DQN Extensions

Distributional RL

Figure 18. SPACE INVADERS: Top-Left: Multi-modal distribution with high uncertainty. Top-Right: Subsequent frame, a more certain

  • demise. Bottom-Left: Clear difference between actions. Bottom-Middle: Uncertain survival. Bottom-Right: Certain success.

Figure 18 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

30/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-31
SLIDE 31

Rainbow DQN Extensions

Distributional RL

ASTERIX Q*BERT BREAKOUT PONG SEAQUEST

Categorical DQN

5 returns 11 returns 21 returns 51 returns DQN Bernoulli

Average Score Training Frames (millions)

Dueling Arch.

Figure 3. Categorical DQN: Varying number of atoms in the discrete distribution. Scores are moving averages over 5 million frames.

Figure 3 of the paper "A Distributional Perspective on Reinforcement Learning" by Marc G. Bellemare et al.

31/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-32
SLIDE 32

Rainbow Architecture

Rainbow combines all described DQN extensions. Instead of -step updates, -step updates are utilized, and KL divergence of the current and target return distribution is minimized: The prioritized replay chooses transitions according to the probability Network utilizes duelling architecture feeding the shared representation into value computation and advantage computation for atom , and the final probability of atom in state and action is computed as

1 n D

(Φ(G +

KL t:t+n

γ

Z(s , a ))∣∣Z(s, a)).

n a′

max

′ ′

p

t

(D

(Φ(G +

KL t:t+n

γ

Z(s , a ))∣∣Z(s, a))) .

n a′

max

′ ′ ω

f(s; ζ) V (f(s; ζ); η) A

(f(s; ζ), a; ψ)

i

z

i

z

i

s a p

(s, a)

i

=

def

.

e

∑j

V (f(s;ζ);η)+A

(f(s;ζ),a;ψ)− A (f(s;ζ),a ;ψ)/∣A∣

j

∑a ∈A

′ j ′

eV (f(s;ζ);η)+A

(f(s;ζ),a;ψ)− A (f(s;ζ),a ;ψ)/∣A∣

i

∑a ∈A

′ i ′

32/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-33
SLIDE 33

Rainbow Hyperparameters

Table 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

33/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-34
SLIDE 34

Rainbow Results

Figure 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

Agent no-ops human starts DQN 79% 68% DDQN (*) 117% 110% Prioritized DDQN (*) 140% 128% Dueling DDQN (*) 151% 117% A3C (*)

  • 116%

Noisy DQN 118% 102% Distributional DQN 164% 125% Rainbow 223% 153%

Table 2 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

34/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-35
SLIDE 35

Rainbow Results

Figure 1 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al. Figure 3 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

35/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-36
SLIDE 36

Rainbow Ablations

Figure 2 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

36/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow

slide-37
SLIDE 37

Rainbow Ablations

Figure 4 of the paper "Rainbow: Combining Improvements in Deep Reinforcement Learning" by Matteo Hessel et al.

37/37 NPFL122, Lecture 6

Refresh DQN DDQN PriRep Duelling NoisyNets DistRL Rainbow