NPFL122, Lecture 11
V-trace, PopArt Normalization, Partially Observable MDPs
Milan Straka
January 7, 2019
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
V-trace, PopArt Normalization, Partially Observable MDPs Milan - - PowerPoint PPT Presentation
NPFL122, Lecture 11 V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Milan Straka
January 7, 2019
Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated
Impala (Importance Weighted Actor-Learner Architecture) was suggested in Feb 2018 paper and allows massively distributed implementation of an actor-critic-like learning algorithm. Compared to A3C-based agents, which communicates gradients with respect to the parameters
Actor Actor Actor Actor Actor Actor Learner
Observations
Parameters Actor Actor
Observations Observations
Parameters
Gradients
Learner Worker Master Learner Actor Actor Actor
Figure 1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
Environment steps Forward pass Backward pass
Actor 2 Actor 3 Actor 1 Actor 0 4 time steps
(a) Batched A2C (sync step.)
Actor 2 Actor 3 Actor 1 Actor 0 4 time steps
(b) Batched A2C (sync traj.)
… …
Actor 2 Actor 3 Actor 1 Actor 0 Actor 4 Actor 5 Actor 6 Actor 7 ...next unroll
(c) IMPALA
Figure 2 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
If many actors are used, the policy used to generate a trajectory can lag behind the latest
2/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Consider a trajectory generated by a behaviour policy . The -step V-trace target for is defined as where is the temporal difference for V and and are truncated importance sampling ratios with : Note that if and assuming , reduces to -step Bellman target.
(S
, A , R )t t t+1 t=s t=s+n
b n S
s
v
s =
def V (S
) +s
γ cδ
V ,t=s
∑
s+n−1 t−s (∏ i=s t−1 i) t
δ
Vt
δ
Vt
=
def ρ
(R +t t+1
γV (s
) −t+1
V (s
)),t
ρ
t
c
i
≥ρ ˉ c ˉ ρ
t =
def min
,, c (ρ ˉ b(A
∣S )t t
π(A
∣S )t t ) i =
def min
, . (c ˉ b(A
∣S )i i
π(A
∣S )i i )
b = π ≥ c ˉ 1 v
s
n
3/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Note that the truncated IS weights and play different roles: The appears in the definition of and defines the fixed point of the update rule. For , the target is the value function , if , the fixed point is somewhere between and . Notice that we do not compute a product of these coefficients. The impacts the speed of convergence (the contraction rate of the Bellman operator), not the sought policy. Because a product of the ratios is computed, it plays an important role in variance reduction. The paper utilizes and out of , works empirically the best.
ρ
t
c
i
ρ
t
δ
Vt
=ρ ˉ ∞ v
π
<ρ ˉ ∞ v
π
v
b
ρ
t
c
i
c
i
= c ˉ 1
∈ρ ˉ {1, 10, 100}
=ρ ˉ 1
4/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Consider a parametrized functions computing and . Assuming the defined - step V-trace target we update the critic in the direction of and the actor in the direction of the policy gradient Finally, we again add the entropy regularization term to the loss function.
v(s; θ) π(a∣s; ω) n v
s =
def V (S
) +s
γ cδ
V ,t=s
∑
s+n−1 t−s (∏ i=s t−1 i) t
(v
−s
v(S
; θ))∇ v(S ; θ)s θ s
ρ
∇ log π(A ∣S ; ω)(R +s ω s s s+1
γv
−s+1
v(S
; θ)).s
H(π(⋅∣S
; θ))s
5/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Architecture CPUs GPUs1 FPS2 Single-Machine
Task 1 Task 2
A3C 32 workers 64 6.5K 9K Batched A2C (sync step) 48 9K 5K Batched A2C (sync step) 48 1 13K 5.5K Batched A2C (sync traj.) 48 16K 17.5K Batched A2C (dyn. batch) 48 1 16K 13K IMPALA 48 actors 48 17K 20.5K IMPALA (dyn. batch) 48 actors3 48 1 21K 24K Distributed A3C 200 46K 50K IMPALA 150 1 80K IMPALA (optimised) 375 1 200K IMPALA (optimised) batch 128 500 1 250K
1 Nvidia P100 2 In frames/sec (4 times the agent steps due to action repeat). 3 Limited by
amount of rendering possible on a single machine.
Table 1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
6/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
For Atari experiments, population based training with a population of 24 agents is used to adapt entropy regularization, learning rate, RMSProp and the global gradient norm clipping threshold.
exploit explore
(a) Sequential Optimisation (b) Parallel Random/Grid Search (c) Population Based Training
Hyperparameters Weights Performance Training Hyperparameters Weights Performance Figure 1 of paper "Population Based Training of Neural Networks" by Max Jaderberg et al.
ε
7/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
For Atari experiments, population based training with a population of 24 agents is used to adapt entropy regularization, learning rate, RMSProp and the global gradient norm clipping threshold. In population based training, several agents are trained in parallel. When an agent is ready (after 5000 episodes), then: it may be overwritten by parameters and hyperparameters of another agent, if it is sufficiently better (5000 episode mean capped human normalized score returns are 5% better); and independently, the hyperparameters may undergo a change (multiplied by either 1.2 or 1/1.2 with 33% chance).
ε
8/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
IMPALA - 1 GPU - 200 actors Batched A2C - Single Machine - 32 workers A3C - Single Machine - 32 workers A3C - Distributed - 200 workers
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames
1e9 10 15 20 25 30 35 40 45 50 55
Return
rooms_watermaze
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames
1e9 5 10 15 20 25 30
rooms_keys_doors_puzzle
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames
1e9 −5 5 10 15 20 25 30 35
lasertag_three_opponents_small
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames
1e9 50 100 150 200 250
explore_goal_locations_small
0.0 0.2 0.4 0.6 0.8 1.0
Environment Frames
1e9 5 10 15 20 25 30 35 40 45
seekavoid_arena_01
1 5 9 13 17 21 24
Hyperparameter Combination
10 20 30 40 50 60
Final Return
rooms_watermaze
1 5 9 13 17 21 24
Hyperparameter Combination
5 10 15 20 25 30 35 40
rooms_keys_doors_puzzle
1 5 9 13 17 21 24
Hyperparameter Combination
−5 5 10 15 20 25 30 35 40
lasertag_three_opponents_small
1 5 9 13 17 21 24
Hyperparameter Combination
50 100 150 200 250 300
explore_goal_locations_small
1 5 9 13 17 21 24
Hyperparameter Combination
10 20 30 40 50
seekavoid_arena_01
Figure 4 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
9/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figures 5, 6 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
10/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Table 4 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
11/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Task 1 Task 2 Task 3 Task 4 Task 5
Without Replay V-trace 46.8 32.9 31.3 229.2 43.8 1-Step 51.8 35.9 25.4 215.8 43.7 ε-correction 44.2 27.3 4.3 107.7 41.5 No-correction 40.3 29.1 5.0 94.9 16.1 With Replay V-trace 47.1 35.8 34.5 250.8 46.9 1-Step 54.7 34.4 26.4 204.8 41.6 ε-correction 30.4 30.2 3.9 101.5 37.6 No-correction 35.0 21.1 2.8 85.0 11.2
Tasks: rooms watermaze, rooms keys doors puzzle, lasertag three opponents small, explore goal locations small, seekavoid arena 01
Table 2 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
12/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figure E.1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al.
13/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
An improvement of IMPALA from Sep 2018, which performs normalization of task rewards instead of just reward clipping. PopArt stands for Preserving Outputs Precisely, while Adaptively Rescaling Targets. Assume the value estimate is computed using a normalized value predictor and further assume that is an output of a linear function We can update the and using exponentially moving average with decay rate (in the paper, first moment and second moment is tracked, and standard deviation is computed as ; decay rate is employed).
v(s; θ, σ, μ) n(s; θ) v(s; θ, σ, μ) =
def σn(s; θ) + μ
n(s; θ) n(s; θ) =
def ω f(s; θ −
T
{ω, b}) + b. σ μ β μ υ σ = υ − μ2 β = 3 ⋅ 10−4
14/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Utilizing the parameters and , we can normalize the observed (unnormalized) returns as and use an actor-critic algorithm with advantage . However, in order to make sure the value function estimate does not change when the normalization parameters change, the parameters computing the unnormalized value estimate are updated under any change and as: In multi-task settings, we train a task-agnostic policy and task-specific value functions (therefore, , and are vectors).
μ σ (G − μ)/σ (G − μ)/σ − n(S; θ) ω, b μ → μ′ σ → σ′ ω′ =
def
ω, bσ′ σ
′ =
def
.σ′ σb + μ − μ′ μ σ n(s; θ)
15/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Atari-57 Atari-57 (unclipped) DmLab-30 Agent Random Human Random Human Train Test IMPALA 59.7% 28.5% 0.3% 1.0% 60.6% 58.4% PopArt-IMPALA 110.7% 101.5% 107.0% 93.7% 73.5% 72.8%
Table 1 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.
2 4 6 8 10 12
Environment Frames
1e9
20 40 60 80 100 120 Median Human Normalised Score
Atari-57 (clipped)
PopArt-IMPALA MultiHead-IMPALA IMPALA
2 4 6 8 10 12
Environment Frames
1e9
−20 20 40 60 80 100 120 Median Human Normalised Score
Atari-57 (unclipped)
PopArt-IMPALA MultiHead-IMPALA IMPALA
Figures 1, 2 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.
16/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
breakout crazy_climber qbert seaquest
Undiscounted Return [μ-σ, μ+σ]
Environment Frames Environment Frames Environment Frames Environment Frames
Figure 3 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.
17/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
2 4 6 8 10
Environment Frames
1e9
10 20 30 40 50 60 70 80 Mean Capped Human Normalised Score
DmLab-30
PopArt-IMPALA IMPALA IMPALA-original
2 4 6 8 10
Environment Frames
1e9
10 20 30 40 50 60 70 80 Mean Capped Human Normalised Score 0.1
IMPALA-original@10B IMPALA@10B PopArt-IMPALA@10B
DmLab-30
PopArt-IMPALA Pixel-PopArt-IMPALA
Figures 4, 5 of paper "Multi-task Deep Reinforcement Learning with PopArt" by Matteo Hessel et al.
18/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Recall that a Markov decision process (MDP) is a quadruple , where: is a set of states, is a set of actions, is a probability that action will lead from state to , producing a reward , is a discount factor. Partially observable Markov decision process extends the Markov decision process to a sextuple , where in addition to an MDP is a set of observations, is an observation model. In robotics (out of the domain of this course), several approaches are used to handle POMDPs, to model uncertainty, imprecise mechanisms and inaccurate sensors.
(S, A, p, γ) S A p(S
=t+1
s , R
=′ t+1
r∣S
=t
s, A
=t
a) a ∈ A s ∈ S s ∈
′
S r ∈ R γ ∈ [0, 1] (S, A, p, γ, O, o) O
t t t−1
19/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
In Deep RL, partially observable MDPs are usually handled using recurrent networks. After suitable encoding of input observation and previous action , a RNN (usually LSTM) unit is used to model the current (or its suitable latent representation), which is in turn utilized to produce .
Figure 1a of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
O
t
A
t−1
S
t
A
t
20/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
However, keeping all information in the RNN state is substantially limiting. Therefore, memory- augmented networks can be used to store suitable information in external memory (in the lines
We now describe an approach used by Merlin architecture (Unsupervised Predictive Memory in a Goal-Directed Agent DeepMind Mar 2018 paper).
Figure 1b of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
21/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figure 1b of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
Let be a memory matrix of size . Assume we have already encoded observations as and previous action . We concatenate them with previously read vectors and process by a deep LSTM (two layers are used in the paper) to compute . Then, we apply a linear layer to , computing key vectors
and positive scalars . Reading: For each , we compute cosine similarity of and all memory rows , multiply the similarities by and pass them through a to obtain weights . The read vector is then computed as . Writing: We find one-hot write index to be the least used memory row (we keep usage indicators and add read weights to them). We then compute , and update the memory matrix using .
M N
×mem
2∣z∣ e
t
a
t−1
K h
t
h
t
K k
, … k1 K
2∣z∣ K β
, … , β1 K
i k
i
M
j
β
i
softmax ω
i
Mw
i
v
wr
v
←ret
γv
+ret
(1 − γ)v
wr
M ← M + v
[e , 0] +wr t
v
[0, e ]ret t
22/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
However, updating the encoder and memory content purely using RL is inefficient. Therfore, MERLIN includes a memory-based predictor (MBP) in addition to policy. The goal of MBP is to compress observations into low-dimensional state representations and storing them in memory. According to the paper, the idea of unsupervised and predictive modeling has been entertained for decades, and recent discussions have proposed such modeling to be connected to hippocampal memory. We want the state variables not only to faithfully represent the data, but also emphasise rewarding elements of the environment above irrelevant ones. To accomplish this, the authors follow the hippocampal representation theory of Gluck and Myers, who proposed that hippocampal representations pass through a compressive bottleneck and then reconstruct input stimuli together with task reward. In MERLIN, a prior distribution over predicts next state variable conditioned on history of state variables and actions , and posterior corrects the prior using the new observation , forming a better estimate .
z z
t
p(z
∣z , a , … , z , a )t t−1 t−1 1 1
q(z
∣o , z , a , … , z , a )t t t−1 t−1 1 1
23/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
To achieve the mentioned goals, we add two terms to the loss. We try reconstructing input stimuli, action, reward and return using a sample from the state variable posterior, and add the difference of the reconstruction and ground truth to the loss. We also add KL divergence of the prior and posterior to the loss, to ensure consistency between the prior and posterior.
Figure 1c of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
24/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Algorithm 1 MERLIN Worker Pseudocode // Assume global shared parameter vectors θ for the policy network and χ for the memory- based predictor; global shared counter T := 0 // Assume thread-specific parameter vectors θ′, χ′ // Assume discount factor γ ∈ (0, 1] and bootstrapping parameter λ ∈ [0, 1] Initialize thread step counter t := 1 repeat Synchronize thread-specific parameters θ′ := θ; χ′ := χ Zero model’s memory & recurrent state if new episode begins tstart := t repeat Prior N(µp
t, log Σp t) = p(ht−1, mt−1)
et = enc(ot) Posterior N(µq
t, log Σq t) = q(et, ht−1, mt−1, µp t, log Σp t)
Sample zt ∼ N(µq
t, log Σq t)
Policy network update ˜ ht = rec(˜ ht−1, ˜ mt, StopGradient(zt)) Policy distribution πt = π(˜ ht, StopGradient(zt)) Sample at ∼ πt ht = rec(ht−1, mt, zt) Update memory with zt by Methods Eq. 2 Rt, or
t = dec(zt, πt, at)
Apply at to environment and receive reward rt and observation ot+1 t := t + 1; T := T + 1 until environment termination or t − tstart == τwindow If not terminated, run additional step to compute V π
ν (zt+1, log πt+1)
and set Rt+1 := V π(zt+1, log πt+1) // (but don’t increment counters) Reset performance accumulators A := 0; L := 0; H := 0 for k from t down to tstart do γt := { 0, if k is environment termination γ, otherwise Rk := rk + γtRk+1 δk := rk + γtV π(zk+1, log πk+1) − V π(zk, log πk) Ak := δk + (γλ)Ak+1 L := L + Lk (Eq. 7) A := A + Ak log πk[ak] H := H − αentropy ∑
i πk[i] log πk[i] (Entropy loss)
end for dχ′ := ∇χ′L dθ′ := ∇θ′(A + H) Asynchronously update via gradient ascent θ using dθ′ and χ using dχ′ until T > Tmax
Algorithm 1 of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
25/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figure 2 of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
26/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figure 3 of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
27/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Extended Figure 3 of paper "Unsupervised Predictive Memory in a Goal-Directed Agent" by Greg Wayne et al.
28/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Agent Elo
Internal Timescale KL Weighting Learning Rate
FTW
Strong Human Average Human Self-play + RS
(a) FTW Agent Architecture (b) Progression During Training
Self-play Random agent
…
? t at
Sampled latent variable
rt w
Game points ?t Internal reward Action Slow RNN Fast RNN Winning signal
Pt Qt Qt+1
Policy Observation xt
Figure 2 of paper "Human-level performance in first-person multiplayer games with population-based deep reinforcement learning" by Max Jaderber et al.
29/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Extension of the MERLIN architecture. Hierarchical RNN with two timescales. Population based training controlling KL divergence penalty weights, slow ticking RNN speed and gradient flow factor from fast to slow RNN.
30/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
Figure S10 of paper "Human-level performance in first-person multiplayer games with population-based deep reinforcement learning" by Max Jaderber et al.
31/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW
0K 10K 30K 350K 350K 450K Behaviour Probability Teammate Following Opponent Base Camping Home Base Defence Agent Strength Opponent Captured Flag Agent Picked up Flag Agent Tagged Opponent
Beating Average Human
Beating Weak Bots
Relative Internal Reward Magnitude Knowledge 0% 25% 50% 75% 100% 450K 300K 200K 45K 7K 4K 2K 450K 300K 200K 45K 7K 4K 2K
Phase 1 Learning the basics of the game Phase 2 Increasing navigation, tagging, and coordination skills Phase 3 Perfecting strategy and memory
“I have the flag” “I am respawning” “My flag is taken” “Teammate has the flag” Single Neuron Response Visitation Map Top Memory Read Locations Visitation Map Top Memory Read Locations Visitation Map Top Memory Read Locations Games Played Memory Usage
Beating Strong Humans
Figure 4 of paper "Human-level performance in first-person multiplayer games with population-based deep reinforcement learning" by Max Jaderber et al.
32/32 NPFL122, Lecture 11
IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW