V-trace, PopArt Normalization, Partially Observable MDPs Milan - PowerPoint PPT Presentation

NPFL122, Lecture 11 V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

IMPALA Impala ( Imp ortance Weighted A ctor- L earner A rchitecture) was suggested in Feb 2018 paper and allows massively distributed implementation of an actor-critic-like learning algorithm. Compared to A3C-based agents, which communicates gradients with respect to the parameters of the policy, IMPALA actors communicates trajectories to the centralized learner. Observations Environment steps Forward pass Backward pass Actor 0 ... next unroll 4 time steps Actor Actor Actor 1 Worker Actor Actor Actor 2 Learner Actor 0 Actor 3 Actor 1 Observations Actor 4 Actor 2 Parameters Actor 5 Actor 3 Actor 6 Actor 7 … Parameters Gradients Actor Actor (a) Batched A2C (sync step.) Learner … 4 time steps Actor 0 Actor 1 Master Actor 2 Actor Actor Actor Actor 3 (c) IMPALA Learner Actor Actor Observations (b) Batched A2C (sync traj.) Figure 2 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Figure 1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. Actor-Learner Architectures" by Lasse Espeholt et al. If many actors are used, the policy used to generate a trajectory can lag behind the latest policy. Therefore, a new V-trace off-policy actor-critic algorithm is proposed. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 2/32

IMPALA – V-trace t = s + n ( S , A , R ) b t +1 t = s t t Consider a trajectory generated by a behaviour policy . n S s The -step V-trace target for is defined as s + n −1 t − s ( ∏ i ) t −1 ∑ def V ( S s = ) + V , v γ c δ s t i = s t = s δ V t where is the temporal difference for V def ρ = + γV ( s ) − V ( s ) ) , ( R δ V t +1 t +1 t t t ˉ ≥ ˉ ρ c ρ c t i and and are truncated importance sampling ratios with : t ) i ) π ( A ∣ S ) π ( A ∣ S ) def min def min ( ρ ( c t i t = ˉ b ( A , , c i = ˉ b ( A , . ρ ∣ S ) ∣ S ) t t i i b = π ˉ ≥ 1 v c n s Note that if and assuming , reduces to -step Bellman target. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 3/32

IMPALA – V-trace ρ c t i Note that the truncated IS weights and play different roles: ρ δ V t t The appears in the definition of and defines the fixed point of the update rule. For ˉ = ∞ ˉ < ∞ ρ v ρ π , the target is the value function , if , the fixed point is somewhere v v ρ π b t between and . Notice that we do not compute a product of these coefficients. c i The impacts the speed of convergence (the contraction rate of the Bellman operator), c i not the sought policy. Because a product of the ratios is computed, it plays an important role in variance reduction. ˉ = 1 ˉ ∈ {1, 10, 100} ˉ = 1 c ρ ρ The paper utilizes and out of , works empirically the best. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 4/32

IMPALA – V-trace v ( s ; θ ) π ( a ∣ s ; ω ) n Consider a parametrized functions computing and . Assuming the defined - step V-trace target t − s ( ∏ s + n −1 i ) t −1 ∑ def V ( S s = ) + V , v γ c δ s t i = s t = s we update the critic in the direction of − v ( S ; θ ) ) ∇ v ( S ; θ ) ( v s s θ s and the actor in the direction of the policy gradient ∇ log π ( A ∣ S ; ω ) ( R + − v ( S ; θ ) ) . ρ γv s +1 s +1 s ω s s s H ( π (⋅∣ S ; θ )) s Finally, we again add the entropy regularization term to the loss function. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 5/32

IMPALA CPUs GPUs 1 FPS 2 Architecture Single-Machine Task 1 Task 2 A3C 32 workers 64 0 6.5K 9K Batched A2C (sync step) 48 0 9K 5K Batched A2C (sync step) 48 1 13K 5.5K Batched A2C (sync traj.) 48 0 16K 17.5K Batched A2C (dyn. batch) 48 1 16K 13K IMPALA 48 actors 48 0 17K 20.5K IMPALA (dyn. batch) 48 actors 3 48 1 21K 24K Distributed A3C 200 0 46K 50K IMPALA 150 1 80K IMPALA (optimised) 375 1 200K IMPALA (optimised) batch 128 500 1 250K 1 Nvidia P100 2 In frames/sec (4 times the agent steps due to action repeat). 3 Limited by amount of rendering possible on a single machine. Table 1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 6/32

IMPALA – Population Based Training For Atari experiments, population based training with a population of 24 agents is used to ε adapt entropy regularization, learning rate, RMSProp and the global gradient norm clipping threshold. (a) Sequential Optimisation Performance Hyperparameters Training Weights (b) Parallel Random/Grid Search (c) Population Based Training Performance Hyperparameters Weights exploit explore Figure 1 of paper "Population Based Training of Neural Networks" by Max Jaderberg et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 7/32

IMPALA – Population Based Training For Atari experiments, population based training with a population of 24 agents is used to ε adapt entropy regularization, learning rate, RMSProp and the global gradient norm clipping threshold. In population based training, several agents are trained in parallel. When an agent is ready (after 5000 episodes), then: it may be overwritten by parameters and hyperparameters of another agent, if it is sufficiently better (5000 episode mean capped human normalized score returns are 5% better); and independently, the hyperparameters may undergo a change (multiplied by either 1.2 or 1/1.2 with 33% chance). NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 8/32

IMPALA IMPALA - 1 GPU - 200 actors Batched A2C - Single Machine - 32 workers A3C - Single Machine - 32 workers A3C - Distributed - 200 workers rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01 55 30 35 250 45 50 30 40 25 200 45 25 35 Return 40 20 20 30 150 35 15 15 25 30 100 10 20 25 10 5 15 20 50 5 0 10 15 10 0 −5 0 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 Environment Frames 1e9 rooms_watermaze rooms_keys_doors_puzzle lasertag_three_opponents_small explore_goal_locations_small seekavoid_arena_01 60 40 40 300 50 35 35 Final Return 50 250 40 30 30 40 25 200 25 30 20 30 20 150 15 20 15 20 10 100 10 5 10 10 50 5 0 0 0 −5 0 0 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 1 5 9 13 17 21 24 Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Hyperparameter Combination Figure 4 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 9/32

IMPALA – Learning Curves Figures 5, 6 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 10/32

IMPALA – Atari Games Table 4 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 11/32

IMPALA – Ablations Task 1 Task 2 Task 3 Task 4 Task 5 Without Replay V-trace 46.8 32.9 31.3 229.2 43.8 1-Step 51.8 35.9 25.4 215.8 43.7 ε -correction 44.2 27.3 4.3 107.7 41.5 No-correction 40.3 29.1 5.0 94.9 16.1 With Replay V-trace 47.1 35.8 34.5 250.8 46.9 1-Step 54.7 34.4 26.4 204.8 41.6 ε -correction 30.4 30.2 3.9 101.5 37.6 No-correction 35.0 21.1 2.8 85.0 11.2 Tasks: rooms watermaze , rooms keys doors puzzle , lasertag three opponents small , explore goal locations small , seekavoid arena 01 Table 2 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 12/32

IMPALA – Ablations Figure E.1 of the paper "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures" by Lasse Espeholt et al. NPFL122, Lecture 11 IMPALA PopArt Normalization POMDPs MERLIN CTF-FTW 13/32

V-trace, PopArt Normalization, Partially Observable MDPs Milan - PowerPoint PPT Presentation

NPFL122, Lecture 11 V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Frequency, Edges, and Corners Various slides from previous courses by: D.A. Forsyth (Berkeley /

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Structures

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

V-trace, PopArt Normalization, Partially Observable MDPs Milan - PowerPoint PPT Presentation

NPFL122, Lecture 11 V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

1 Stochastic, Partially Observable Markov Decision Process (MDP) Partially Observable MDP S

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Deep Recurrent Q-Learning for Partially Observable MDPs Matthew Hausknecht and Peter Stone

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Reinforcement Learning Environments Fully-observable vs partially-observable Single agent

Sequence Estimation and Schedulability Aim Analysis for Partially Observable Petri Nets

Partially observable Markov decision processes Matthijs Spaan Institute for Systems and Robotics

A Multi-Agent Prediction Market based on Raj Dasgupta Partially Observable Stochastic Game

POMDPs (Ch. 17.4-17.6) Markov Decision Process Recap of Markov Decision Processes (MDPs): Know:

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Multiagent models for partially observable environments Matthijs Spaan Institute for Systems and

Dichotomic Observables; GP(1992) vs Psudospin observable(2002). Gisin-Peres observable for Bell

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Field of a moving particle in a dielectric The E- and B-field associated to a uniformly

Frequency, Edges, and Corners Various slides from previous courses by: D.A. Forsyth (Berkeley /

Computer Science &amp; Engineering 150A Introduction Problem Solving Using Computers Structures

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

Using Pig, Hive, and Impala with Hadoop Jay Urbain,

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Computer Science & Engineering 150A Introduction Problem Solving Using Computers Structures