gradientdice rethinking generalized offline estimation of
play

GradientDICE: Rethinking Generalized Offline Estimation of - PowerPoint PPT Presentation

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University Preview O ff -policy evaluation with density ratio learning Use the


  1. GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 , Bo Liu 2 , Shimon Whiteson 1 1 University of Oxford 2 Auburn University

  2. Preview • O ff -policy evaluation with density ratio learning • Use the Perron-Frobenius theorem to reduce the constraints from 3 to 2, reducing the positiveness constraint, making the problem convex in both tabular and linear setting • A special weighted norm L 2 • Improvements over DualDICE and GenDICE in tabular, linear and neural network settings

  3. Off-policy evaluation is to estimate the performance of a policy with off-policy data • The target policy π • A data set { s i , a i , r i , s ′ i } i =1,…, N • s i , a i ∼ d μ ( s , a ), r i = r ( s i , a i ), s ′ i ∼ p ( ⋅ | s i , a i ) • The performance metric ρ γ ( π ) ≐ ∑ s , a d γ ( s , a ) r ( s , a ) d γ ( s , a ) ≐ (1 − γ ) ∑ ∞ t =0 γ t Pr( S t = s , A t = a ∣ π , p ) • ( γ < 1) • d γ ( s , a ) ≐ lim t →∞ Pr( S t = s , A t = a ∣ π , p ) ( γ = 1)

  4. Density ratio learning is promising for off-policy evaluation (Liu et al, 2018) d γ ( s , a ) • Learn with function approximation τ * ( s , a ) ≐ d μ ( s , a ) ρ γ ( π ) = ∑ s , a d μ ( s , a ) τ * ( s , a ) r ( s , a ) ≈ 1 N ∑ N • i =1 τ * ( s i , a i ) r i

  5. Density ratio satisfies a Bellman- like equation (Zhang et al, 2020) • D τ * = (1 − γ ) μ 0 + γ P ⊤ π D τ * D ∈ ℝ N sa × N sa , D ≐ diag ( d μ ) • • τ * ∈ ℝ N sa • μ 0 ∈ ℝ N sa , μ 0 ( s , a ) ≐ μ 0 ( s ) π ( a | s ) • P π ∈ ℝ N sa × N sa , P π (( s , a ), ( s ′ , a ′ )) ≐ p ( s ′ | s , a ) π ( a ′ | s ′ )

  6. is easy as it implies a γ < 1 unique solution • D τ = (1 − γ ) μ 0 + γ P ⊤ π D τ • ( I − γ P ⊤ π ) − 1 exists

  7. Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. GenDICE (Zhang et al, 2020) considers 1 & 3 explicitly L ( τ ) ≐ divergence ( D τ , P ⊤ π D τ ) + (1 − 1 ⊤ D τ ) 2 τ 2 , e τ and implements 2 with positive function approximation (e.g. ), projected SGD, or stochastic mirror descent Mousavi et al. (2020) implements 3 with self-normalization over all state-action pairs

  8. Previous work requires three constraints for γ = 1 D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. The objective becomes non-convex with positive function approximation or self-normalization, even in tabular or linear setting. Projected SGD is computationally infeasible. Stochastic mirror descent significantly reduces the capacity of the (linear) function class.

  9. We actually need only two constraints! D τ = P ⊤ 1. π D τ 2. D τ ≻ 0 1 ⊤ D τ = 1 3. Perron-Frobenius theorem: the solution space of 1 is one-dimensional Either 2 or 3 is enough to guarantee a unique solution

  10. 
 GradientDICE considers a special norm for the loss L 2 • GenDICE: 
 L ( τ ) ≐ divergence ((1 − γ ) μ 0 + γ P ⊤ π D τ , D τ ) + (1 − 1 ⊤ D τ ) 2 subject to Dy ≻ 0 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 • GradientTD loss: || … || D

  11. GradientDICE considers a special norm for the loss L 2 • L ( τ ) ≐ || (1 − γ ) μ 0 + γ P ⊤ π D τ − D τ || D − 1 + (1 − 1 ⊤ D τ ) 2 f ∈ℝ Nsa , η ∈ℝ L ( τ , η , f ) ≐ (1 − γ ) 𝔽 μ 0 [ f ( s , a )] min max τ ∈ℝ Nsa + γ 𝔽 p [ τ ( s , a ) f ( s ′ , a ′ )] −𝔽 d μ [ τ ( s , a ) f ( s , a )] − 1 2 𝔽 d μ [ f ( s , a ) 2 ] • η 2 + λ ( 𝔽 d μ [ ητ ( s , a ) − η ] − 2 ) • Convergence in both tabular and linear setting with γ ∈ [0,1]

  12. GradientDICE outperforms baselines in Boyan’s Chain (Tabular) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

  13. GradientDICE outperforms baselines in Boyan’s Chain (Linear) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 {4 − 6 ,4 − 5 , …,4 − 1 } learning rates from • Tuned to minimize final prediction error

  14. 
 GradientDICE outperforms baselines in Reacher-v2 (Network) • 30 runs (mean + standard errors) • Grid Search for hyperparameters, e.g., 
 Learning rates from {0.01,0.005,0.001} Penalty from {0.1,1} • Tuned to minimize final prediction error

  15. Thanks • Code and Dockerfile are available at 
 https://github.com/ShangtongZhang/DeepRL

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend