provably convergent two timescale off policy actor critic
play

Provably Convergent Two- Timescale Off-Policy Actor-Critic with - PowerPoint PPT Presentation

Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei Preview O ff -policy control under


  1. Provably Convergent Two- Timescale Off-Policy Actor-Critic with Function Approximation Shangtong Zhang 1 , Bo Liu 2 , Hengshuai Yao 3 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University 3 Huawei

  2. Preview • O ff -policy control under the excursion objective ∑ s d μ ( s ) v π ( s ) • The first provably convergent two-timescale o ff -policy actor-critic algorithm with function approximation • New perspective for Emphatic TD (Sutton et al, 2016) • Convergence of Regularized GTD-style algorithms under a changing target policy

  3. 
 The excursion objective is commonly used for off-policy control • J ( π ) = ∑ s d μ ( s ) i ( s ) v π ( s ) : stationary distribution of the behaviour policy 
 d μ : value function of the target policy 
 v π , the interest function (Sutton et al, 2016) i : 𝒯 → [0, ∞ )

  4. 
 
 Off-policy policy gradient theorem gives the exact the gradient (Imani et al, 2018) • ∇ J ( π ) = ∑ s ¯ m ( s ) ∑ a q π ( s , a ) ∇ π ( a | s ) m ≐ ( I − γ P ⊤ π ) − 1 Di ∈ ℝ N s ¯ D = diag ( d μ )

  5. 
 Rewriting the gradients gives a taxonomy of previous algorithms • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] m π ≐ D − 1 ( I − γ P ⊤ π ) − 1 Di (emphasis) 1. Ignoring (Degris et al, 2012) m π ( s ) 2. Use followon trace to approximate (Imani et al, 2018) m π ( s ) 3. Learn with function approximation (Ours) m π ( s )

  6. Ignoring emphasis is theoretically justified only in tabular setting • Gradient Estimator (Degris et al, 2012): 
 ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) • O ff -Policy Actor Critic (O ff -PAC) 
 Extensions: O ff -policy DPG, DDPG, ACER, O ff -policy EPG, TD3, IMPALA • O ff -PAC is biased even with linear function approximation 
 (Degris et al, 2012, Imani et al, 2018, Maei et al, 2018, Liu et al, 2019)

  7. 
 Followon trace is unbiased only in a limiting sense • Gradient Estimator (Imani et al, 2018): 
 M t ρ π ( S t , A t ) q π ( S t , A t ) ∇ θ log π ( A t | S t ) (followon trace) 
 M t ≐ i ( S t ) + γρ t − 1 M t − 1 Assuming is FIXED, t →∞ 𝔽 μ [ M t | S t = s ] = m π ( s ) π lim • is a scalar, but is a vector! M t m π

  8. ̂ ̂ Emphasis is the fixed point of a Bellman-like operator • 𝕌 y ≐ i + γ D − 1 P ⊤ π Dy • is a contraction mapping w.r.t. some weighted 𝕌 maximum norm (for any ) γ < 1 • The emphasis is its fixed point m π

  9. ̂ We propose to learn emphasis based on ̂ 𝕌 • A semi-gradient update based on 𝕌 • Gradient Temporal Di ff erence Learning (GTD) 
 L ( ν ) ≐ || Π𝕌 v − v || 2 MSPBE: ( v = X ν ) D • Gradient Emphasis Learning (GEM) 
 L ( w ) ≐ || Π ̂ 𝕌 m − m || 2 ( m = Xw ) D • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )]

  10. 
 Regularized GTD-style algorithms converge under a changing policy • TD converges under a changing policy (Konda’s thesis) 
 But those arguments can NOT be used to show the convergence of GTD • Regularization has to be used for GTD-style algorithms 
 L ( m ) ≐ || Π ̂ 𝕌 Xw − Xw || 2 D + || w || 2 GEM: L ( v ) ≐ || Π𝕌 X ν − X ν || 2 D + || ν || 2 GTD: • Regularization in GTD: • Optimization perspective under a fixed : 
 π Mahadevan et al. (2014), Liu et al., (2015), Macua et al., (2015), Yu (2017), Du et al. (2017) • Stochastic approximation perspective under a changing 
 π

  11. The Convergence Off-Policy Actor-Critic (COF-PAC) algorithm • ∇ θ J ( π ) = 𝔽 s ∼ d μ , a ∼ μ ( ⋅ | s ) [ m π ( s ) ρ π ( s , a ) q π ( s , a ) ∇ θ log π ( a | s )] L ( v ) ≐ || Π𝕌 X ν − X ν || D + || ν || 2 L ( w ) ≐ || Π ̂ 𝕌 Xw − Xw || D + || w || 2 • Two-timescale instead of bi-level optimization like SBEED • COF-PAC visits a neighbourhood of a stationary point of infinitely many times J ( π )

  12. GEM approximates emphasis better than followon trace in Baird’s counterexample Averaged over 30 runs, mean + std

  13. GEM-ETD doss better policy evaluation than ETD in Baird’s counterexample • ETD: ν t +1 ← ν t + α M t ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t • GEM-ETD: ν t +1 ← ν t + α 2 ( w ⊤ t x t ) ρ t ( R t +1 + γ x ⊤ t +1 ν t − x ⊤ t ν t ) x ⊤ t Averaged over 30 runs, mean + std

  14. COF-PAC does better control than ACE in Reacher Averaged over 30 runs, mean + std

  15. Thanks • Code and Dockerfile are available at 
 https://github.com/ShangtongZhang/DeepRL

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend