from importance sampling to doubly robust policy gradient
play

From Importance Sampling to Doubly Robust Policy Gradient Jiawei - PowerPoint PPT Presentation

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC) Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1 Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators


  1. From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

  2. Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1

  3. Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 2 J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖

  4. Basic Idea T T T T Step-wise IS Standard PG (Tang and Abbeel, 2010) T T Traj-wise IS Ofg-Policy Evaluation Estimators Policy Gradient Estimators 3 REINFORCE J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � γ t ′ r t ′ ρ [ 0 : T ] γ t r t ∇ log π t θ t = 0 t = 0 t ′ = 0 � � � γ t ′ r t ′ ∇ log π t γ t ρ [ 0 : t ] r t θ t = 0 t = 0 t ′ = t

  5. Basic Idea Ofg-Policy Evaluation Estimators T T T OPE with State Baselines PG with State Baselines 4 Policy Gradient Estimators J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � � � � � � γ t ′ r t ′ − γ t b t ∇ log π t b 0 + γ t ρ [ 0 : t ] r t + γ b t + 1 − b t θ t = 0 t ′ = t t = 0

  6. Basic Idea T t T T T DR-PG (Ours) t t Doubly Robust OPE T T 5 Policy Gradient Estimators Ofg-Policy Evaluation Estimators Trajectory-wise CV (Cheng et al., 2019) T J ( π θ +∆ θ ) − J ( π θ ) ∇ θ J ( π θ ) = lim ∆ θ ∆ θ → 0 ↙ ↖ � � γ t ′ � �� � � � ↖ � t ′ − � γ t ′ r t ′ + V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� � � � ∇ � − � � V π ′ r t + γ � V π ′ t + 1 − � Q π ′ V π θ Q π θ t ∇ log π t 0 + γ t ρ [ 0 : t ] θ t = 0 ↙ � � γ t ′ � �� � � � γ t ′ r t ′ + � t ′ − � V π θ Q π θ ∇ log π t θ t ′ t = 0 t ′ = t t ′ = t + 1 + γ t � �� ∇ � t −∇ θ � − � V π θ Q π θ Q π θ t ∇ log π t θ

  7. Preliminaries MDP Setting • Fixed initial state distribution; Frequently used notations 6 • Episodic RL with discount factor γ , and maximum episode length T ; • Trajectory is defined as s 0 , a 0 , r 0 , s 1 , ..., s T , a T , r T . • π θ : Policy parameterized by θ . • J ( π θ ) = E π θ [ � T t = 0 γ t r ( s t , a t )] : Expected discounted return of π θ .

  8. A Concrete and Simple Example T t From Stepwise IS OPE to Standard PG 7 π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0

  9. A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG 8 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0

  10. A Concrete and Simple Example t t T From Stepwise IS OPE to Standard PG T t 9 T π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0

  11. A Concrete and Simple Example T which is known to be the standard PG. t T Then t T From Stepwise IS OPE to Standard PG t 10 T t π θ is the behavior policy and π θ +∆ θ as the target policy. r t = r ( s t , a t ) and π t θ = π θ ( a t | s t ) . π t ′ � � � θ +∆ θ J ( π θ +∆ θ ) = γ t r t π t ′ θ t = 0 t ′ = 0 � � � � ∇ θ π t ′ θ = γ t r t 1 + ∆ θ + o (∆ θ ) π t ′ θ t = 0 t ′ = 0 � � � � = � ∇ θ log π t ′ J ( π θ ) + ∆ θ + o (∆ θ ) . γ t r t θ t = 0 t ′ = 0 � J ( π θ +∆ θ ) − � � � J ( π θ ) ∇ θ log π t ′ lim = γ t r t θ ∆ θ ∆ θ → 0 t = 0 t ′ = 0

  12. Doubly-Robust Policy Gradient (DR-PG) T t Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) t 11 0 γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] .

  13. Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t t 2 Definition: Doubly-robust OPE estimator ( unbiased ) (Jiang and Li, 2016) T T T Theorem: Given DR-OPE estimator above, we can derive two unbiased estimators: 12 t 0 T γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � V π θ +∆ θ θ +∆ θ r t + γ � V π θ +∆ θ − � Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 V θ +∆ θ = E a ∼ π θ +∆ θ [ � where � Q θ +∆ θ ] . • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

  14. Doubly-Robust Policy Gradient (DR-PG) t t t 2 T T T t Definition: Doubly-robust OPE estimator (Jiang and Li, 2016) t 2 T T T Theorem: Given DR-OPE estimator above, we can derive: 13 t 0 T γ t � �� � � � π t ′ � J ( π θ +∆ θ ) = � r t + γ � − � V π θ +∆ θ θ +∆ θ V π θ +∆ θ Q π θ +∆ θ + . t + 1 π t ′ θ t = 0 t ′ = 0 • If � Q π θ +∆ θ = � Q π θ for arbitrary ∆ θ [ Traj-CV , (Cheng, Yan, and Boots., 2019) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � − � V π θ Q π θ V π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 • else [ DR-PG (Ours) ] � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Remark 1: The definitions of ∇ θ � V are difgerent. In Traj-CV, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ ] , while in DR-PG, ∇ θ � V = E π θ [ � Q π θ ∇ θ log π θ + ∇ θ � Q π θ ] Remark 2: ∇ θ � Q π θ is not necessary a gradient but just an approximation of ∇ θ Q π θ .

  15. Special Cases of DR-PG T t t 2 DR-PG T 14 T � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ t −∇ θ � Q π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

  16. Special Cases of DR-PG T t t 2 T T T DR-PG t t 2 15 T T � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1

  17. Special Cases of DR-PG t 2 t T T T t t 2 DR-PG T T T t 16 T T T � � γ t 2 � �� + γ t � �� � � � � t 2 − � ∇ θ � t −∇ θ � − � V π θ Q π θ V π θ Q π θ Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 Q π ′ invariant to π ′ ↓ Traj-CV Use � � � γ t 2 � �� + γ t � �� � � � � V π θ t 2 − � Q π θ ∇ θ � V π θ − � Q π θ ∇ θ log π t γ t 1 r t 1 + t ∇ θ log π t . θ θ t = 0 t 1 = t t 2 = t + 1 � s t + 1 ] = 0 , dropped ↓ PG with state-action baselines � � � γ t 2 ( � t 2 − � V π θ Q π θ E [ t 2 ) t 2 = t + 1 � � � + γ t � �� � � ∇ θ � − � V π θ Q π θ ∇ θ log π t γ t 1 r t 1 t ∇ θ log π t . θ θ t = 0 t 1 = t

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend