Provably Convergent Two- Timescale Off-Policy Actor-Critic with - PowerPoint PPT Presentation

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei

Preview • O ff -policy control under the excursion objective ∑ s d μ ( s ) v π ( s ) • The first provably convergent two-timescale o ff -policy actor-critic algorithm with function approximation • New perspective for Emphatic TD (Sutton et al, 2016) • Convergence of Regularized GTD-style algorithms under a changing target policy

  The excursion objective is commonly used for off-policy control • J ( π ) = ∑ s d μ ( s ) i ( s ) v π ( s ) : stationary distribution of the behaviour policy   d μ : value function of the target policy   v π , the interest function (Sutton et al, 2016) i : 𝒯 → [0, ∞ )

    Off-policy policy gradient theorem gives the exact the gradient (Imani et al, 2018) • ∇ J ( π ) = ∑ s ¯ m ( s ) ∑ a q π ( s , a ) ∇ π ( a | s ) m ≐ ( I − γ P ⊤ π ) − 1 Di ∈ ℝ N s ¯ D = diag ( d μ )

  Rewriting the gradients gives a taxonomy of previous algorithms • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] m π ≐ D − 1 ( I − γ P ⊤ π ) − 1 Di (emphasis) 1. Ignoring (Degris et al, 2012) m π ( s ) 2. Use followon trace to approximate (Imani et al, 2018) m π ( s ) 3. Learn with function approximation (Ours) m π ( s )

Ignoring emphasis is theoretically justified only in tabular setting • Gradient Estimator (Degris et al, 2012):   ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) • O ff -Policy Actor Critic (O ff -PAC)   Extensions: O ff -policy DPG, DDPG, ACER, O ff -policy EPG, TD3, IMPALA • O ff -PAC is biased even with linear function approximation   (Degris et al, 2012, Imani et al, 2018, Maei et al, 2018, Liu et al, 2019)

  Followon trace is unbiased only in a limiting sense • Gradient Estimator (Imani et al, 2018):   M t ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) (followon trace)   M t ≐ i ( S t ) + γρ t − 1 M t − 1 Assuming is FIXED, t →∞ 𝔽 μ [ M t | S t = s ] = m π ( s ) π lim • is a scalar, but is a vector! M t m π

̂ ̂ Emphasis is the fixed point of a Bellman-like operator • 𝕌 y ≐ i + γ D − 1 P ⊤ π Dy • is a contraction mapping w.r.t. some weighted 𝕌 maximum norm (for any ) γ < 1 • The emphasis is its fixed point m π

̂ We propose to learn emphasis based on ̂ 𝕌 • A semi-gradient update based on 𝕌 • Gradient Temporal Di ff erence Learning (GTD)   L ( ν ) ≐ || Π𝕌 v − v || 2 MSPBE: ( v = X ν ) D • Gradient Emphasis Learning (GEM)   L ( w ) ≐ || Π ̂ 𝕌 m − m || 2 ( m = Xw ) D • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )]

  Regularized GTD-style algorithms converge under a changing policy • TD converges under a changing policy (Konda’s thesis)   But those arguments can NOT be used to show the convergence of GTD • Regularization has to be used for GTD-style algorithms   L ( m ) ≐ || Π ̂ 𝕌 Xw − Xw || 2 D + || w || 2 GEM: L ( v ) ≐ || Π𝕌 X ν − X ν || 2 D + || ν || 2 GTD: • Regularization in GTD: • Optimization perspective under a fixed :   π Mahadevan et al. (2014), Liu et al., (2015), Macua et al., (2015), Yu (2017), Du et al. (2017) • Stochastic approximation perspective under a changing   π

The Convergence Off-Policy Actor-Critic (COF-PAC) algorithm • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] L ( v ) ≐ || Π𝕌 X ν − X ν || D + || ν || 2 L ( w ) ≐ || Π ̂ 𝕌 Xw − Xw || D + || w || 2 • Two-timescale instead of bi-level optimization like SBEED • COF-PAC visits a neighbourhood of a stationary point of infinitely many times J ( π )

GEM approximates emphasis better than followon trace in Baird’s counterexample Averaged over 30 runs, mean + std

GEM-ETD doss better policy evaluation than ETD in Baird’s counterexample • ETD: ν t +1 ← ν t + α M t ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t • GEM-ETD: ν t +1 ← ν t + α 2 ( w ⊤ t x t ) ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t Averaged over 30 runs, mean + std

COF-PAC does better control than ACE in Reacher Averaged over 30 runs, mean + std

Thanks • Code and Dockerfile are available at   https://github.com/ShangtongZhang/DeepRL

Provably Convergent Two- Timescale Off-Policy Actor-Critic with - PowerPoint PPT Presentation

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei Preview O ff -policy control under

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Free-Fall Timescale of Sun Free-fall timescale: The time it would take a star (or cloud) to

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic fit a model to estimate

Provably secure hash functions - do we care? Krystian Matusiewicz Technical University of Denmark

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

CAF C++ Actor Framework Matthias Vallentin UC Berkeley Berkeley C++ Summit October 17, 2016

ECE 3574: Applied Software Design Actor Pattern Today we are going to look at an abstraction of

Parallel Programming and Heterogeneous Computing D3 - Shared-Nothing: Actors Max Plauth, Sven

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

Event finding in GEM T racker Radoslaw Karabowicz 1 Introduction This is a

Development of thin GEM readout structures Yasemin Schelhaas MAGIX Collaboration Meeting

Marek Szyprowski m.szyprowski@samsung.com Samsung R&D Institute Poland Quick Introduction

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

GEM Canada Report on Youth Entrepreneurship 2018 Nov Overview Report purpose and key definition

Say Hi! Programme 12.30pm Welcome Remarks by Mr Johnie Goh, OGEM 12.30pm 1pm Preparing

Toward a 35-Year North American Precipitation and Ground Surface Reanalysis International

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Provably Convergent Two- Timescale Off-Policy Actor-Critic with - PowerPoint PPT Presentation

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei Preview O ff -policy control under

Soft Actor-Critic Zikun Chen, Minghan Li Jan. 28, 2020 Soft Actor-Critic: Ofg-Policy Maximum

AN AN AN ACTOR AN ACTOR ACTOR ACTOR- - - -CENTERED POLICY PROCESS CENTERED POLICY PROCESS

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine,

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang,

Living Actor Living Actor Living Actor - Use Cases Living Actor - Use Cases Use Cases

Actor-Critic Policy Learning in Cooperative Planning Josh Redding, Alborz Geramifard Han-Lim Choi

Free-Fall Timescale of Sun Free-fall timescale: The time it would take a star (or cloud) to

Movie &amp; Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Why actor analysis? Actor and network analysis Bert Enserink Network map of linked Network map

CS 285 Instructor: Sergey Levine UC Berkeley Recap: actor-critic fit a model to estimate

Provably secure hash functions - do we care? Krystian Matusiewicz Technical University of Denmark

Data structures wa y x 1 D ASE System System E C r* O r state D Critic Critic E

CAF C++ Actor Framework Matthias Vallentin UC Berkeley Berkeley C++ Summit October 17, 2016

ECE 3574: Applied Software Design Actor Pattern Today we are going to look at an abstraction of

Parallel Programming and Heterogeneous Computing D3 - Shared-Nothing: Actors Max Plauth, Sven

CS 4803 / 7643: Deep Learning Topics: Policy Gradients Actor Critic Ashwin Kalyan

Event finding in GEM T racker Radoslaw Karabowicz 1 Introduction This is a

Development of thin GEM readout structures Yasemin Schelhaas MAGIX Collaboration Meeting

Marek Szyprowski m.szyprowski@samsung.com Samsung R&amp;D Institute Poland Quick Introduction

Notes on Neal and Hintons Generalized Expectation Maximization (GEM) Algorithm Mark Johnson

GEM Canada Report on Youth Entrepreneurship 2018 Nov Overview Report purpose and key definition

Say Hi! Programme 12.30pm Welcome Remarks by Mr Johnie Goh, OGEM 12.30pm 1pm Preparing

Toward a 35-Year North American Precipitation and Ground Surface Reanalysis International

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor

Marek Szyprowski m.szyprowski@samsung.com Samsung R&D Institute Poland Quick Introduction