From Importance Sampling to Doubly Robust Policy Gradient
Jiawei Huang (UIUC) Nan Jiang (UIUC)
From Importance Sampling to Doubly Robust Policy Gradient Jiawei - - PowerPoint PPT Presentation
From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC) Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators 1 Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators
Jiawei Huang (UIUC) Nan Jiang (UIUC)
Basic Idea Policy Gradient Estimators Ofg-Policy Evaluation Estimators
1
Basic Idea
∆θ→0
Policy Gradient Estimators Ofg-Policy Evaluation Estimators
2
Basic Idea
∆θ→0
Policy Gradient Estimators Ofg-Policy Evaluation Estimators
REINFORCE Traj-wise IS ρ[0:T]
T
γtrt
T
∇ log πt
θ T
γt′rt′ (Tang and Abbeel, 2010) Standard PG Step-wise IS
T
γtρ[0:t]rt
T
∇ log πt
θ T
γt′rt′
3
Basic Idea
∆θ→0
Policy Gradient Estimators Ofg-Policy Evaluation Estimators
PG with State Baselines OPE with State Baselines
T
∇ log πt
θ
γt′rt′ − γtbt
T
γtρ[0:t]
Basic Idea
∆θ→0
Policy Gradient Estimators Ofg-Policy Evaluation Estimators
Trajectory-wise CV (Cheng et al., 2019)
T
θ
γt′rt′ +
T
γt′
t′ −
Qπθ
t′
Doubly Robust OPE +γt ∇ Vπθ
t
− Qπθ
t ∇ log πt θ
0 + T
γtρ[0:t]
Vπ′
t+1 −
Qπ′
t
T
θ
γt′rt′ +
T
γt′
t′ −
Qπθ
t′
∇ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇ log πt θ
Preliminaries
MDP Setting
Frequently used notations
t=0 γtr(st, at)]: Expected discounted return of πθ. 6
A Concrete and Simple Example From Stepwise IS OPE to Standard PG
πθ is the behavior policy and πθ+∆θ as the target policy. rt = r(st, at) and πt
θ = πθ(at|st).
T
γtrt
t
πt′
θ+∆θ
πt′
θ 7
A Concrete and Simple Example From Stepwise IS OPE to Standard PG
πθ is the behavior policy and πθ+∆θ as the target policy. rt = r(st, at) and πt
θ = πθ(at|st).
T
γtrt
t
πt′
θ+∆θ
πt′
θ
=
T
γtrt
t
∇θπt′
θ
πt′
θ
8
A Concrete and Simple Example From Stepwise IS OPE to Standard PG
πθ is the behavior policy and πθ+∆θ as the target policy. rt = r(st, at) and πt
θ = πθ(at|st).
T
γtrt
t
πt′
θ+∆θ
πt′
θ
=
T
γtrt
t
∇θπt′
θ
πt′
θ
= J(πθ) +
γtrt
t
∇θ log πt′
θ
9
A Concrete and Simple Example From Stepwise IS OPE to Standard PG
πθ is the behavior policy and πθ+∆θ as the target policy. rt = r(st, at) and πt
θ = πθ(at|st).
T
γtrt
t
πt′
θ+∆θ
πt′
θ
=
T
γtrt
t
∇θπt′
θ
πt′
θ
= J(πθ) +
γtrt
t
∇θ log πt′
θ
Then lim
∆θ→0
J(πθ) ∆θ =
T
γtrt
t
∇θ log πt′
θ
which is known to be the standard PG.
10
Doubly-Robust Policy Gradient (DR-PG)
Definition: Doubly-robust OPE estimator (unbiased) (Jiang and Li, 2016)
Vπθ+∆θ +
T
γt
t
πt′
θ+∆θ
πt′
θ
Vπθ+∆θ
t+1
− Qπθ+∆θ
t
where Vθ+∆θ = Ea∼πθ+∆θ[ Qθ+∆θ].
11
Doubly-Robust Policy Gradient (DR-PG)
Definition: Doubly-robust OPE estimator (unbiased) (Jiang and Li, 2016)
Vπθ+∆θ +
T
γt
t
πt′
θ+∆θ
πt′
θ
Vπθ+∆θ
t+1
− Qπθ+∆θ
t
where Vθ+∆θ = Ea∼πθ+∆θ[ Qθ+∆θ]. Theorem: Given DR-OPE estimator above, we can derive two unbiased estimators:
Qπθ+∆θ = Qπθ for arbitrary ∆θ [Traj-CV, (Cheng, Yan, and Boots., 2019)]
T
θ
γt1rt1 +
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
T
θ
γt1rt1 +
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
12
Doubly-Robust Policy Gradient (DR-PG)
Definition: Doubly-robust OPE estimator (Jiang and Li, 2016)
Vπθ+∆θ +
T
γt
t
πt′
θ+∆θ
πt′
θ
Vπθ+∆θ
t+1
− Qπθ+∆θ
t
Theorem: Given DR-OPE estimator above, we can derive:
Qπθ+∆θ = Qπθ for arbitrary ∆θ [Traj-CV, (Cheng, Yan, and Boots., 2019)]
T
θ
γt1rt1 +
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
T
θ
γt1rt1 +
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
Remark 1: The definitions of ∇θ V are difgerent. In Traj-CV, ∇θ V = Eπθ[ Qπθ∇θ log πθ], while in DR-PG, ∇θ V = Eπθ[ Qπθ∇θ log πθ + ∇θ Qπθ] Remark 2: ∇θ Qπθ is not necessary a gradient but just an approximation of ∇θQπθ.
13
Special Cases of DR-PG
DR-PG
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
14
Special Cases of DR-PG
DR-PG
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
Use Qπ′ invariant to π′↓ Traj-CV
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
15
Special Cases of DR-PG
DR-PG
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
Use Qπ′ invariant to π′↓ Traj-CV
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
E[
T
γt2( Vπθ
t2 −
Qπθ
t2 )
T
θ
γt1rt1
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
16
Special Cases of DR-PG
DR-PG
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t −∇θ
Qπθ
t
− Qπθ
t ∇θ log πt θ
Use Qπ′ invariant to π′↓ Traj-CV
T
θ
γt1rt1+
T
γt2
t2 −
Qπθ
t2
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
E[
T
γt2( Vπθ
t2 −
Qπθ
t2 )
T
θ
γt1rt1
∇θ Vπθ
t
− Qπθ
t ∇θ log πt θ
Use V as Q↓ PG with state baselines
T
θ
γt1rt1
− Vπθ
t ∇θ log πt θ
17
Variance Analysis
Theorem The covariance matrix of the DR-PG estimator is E
γ2n
∇θ log πt
θ
∇θ log πt
θ
⊤
+ Covn
n +
n−1
∇θ log πt
θ
n
+ Covn
n − ∇θ
Qπθ
n +
∇θ log πt
θ
n −
Qπθ
n
where Vn[·] :=V[·|s0, a0, ...sn−1, an−1] En[·] :=E[·|s0, a0, ...sn−1, an−1] Covn[v] :=En[vv⊤] − En[v]En[v]⊤.
18
Cramer-Rao Lower Bound
Theorem: For tree-structured MDPs (i.e., each state only appears at a unique time step and can be reached by a unique trajectory), the Cramer-Rao lower bound of PG is E
γ2t
∂ log πt1
θ
∂θi 2
+ Vt
t t−1
∂ log πt1
θ
∂θi + ∂Vπθ
t
∂θi
which coincides with the variance of DR-PG when Qπθ ≡ Qπθ and ∇θ Qπθ ≡ ∇θQπθ.
19
Variance Analysis Covariance Comparison in Special Case
Deterministic environment with perfect value function estimation Estimator Covariance Matrices PG with state baselines E
n Covn
n +
n−1
t=0 ∇θ log πt θ
n + ∇θ log πn θAπθ n
E
n Covn
n +
n−1
t=0 ∇θ log πt θ
n
E
n Covn
n
20
Experiments (Variance Reduction)
Figure 1: Variance reduction ratio. VG denotes the sum of estimator G’s variance over all parameters of the neural network.
21
Experiments (Algorithm Performance)
Figure 2: Performance in CartPole task. Average over 150 trials. Plot twice standard error.
22
Experiments (Algorithm Time Complexity)
Figure 3: Comparison of GPU/CPU Usage .
23