Trust Region Policy Optimization
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning)
- Prof. Pascal Poupart
June 20th 2018
Trust Region Policy Optimization John Schulman, Sergey Levine, - - PowerPoint PPT Presentation
Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra Shivam.kalra@uwaterloo.ca CS 885 (Reinforcement Learning)
June 20th 2018
Action Value Function Q-Learning Policy Gradients TRPO Actor Critic PPO A3C ACKTR
Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc
For i=1,2,β¦ Collect N trajectories for policy ππ Estimate advantage function π΅ Compute policy gradient π Update policy parameter π = ππππ + π½π
For i=1,2,β¦ Collect N trajectories for policy ππ Estimate advantage function π΅ Compute policy gradient π Update policy parameter π = ππππ + π½π
Non stationary input data due to changing policy and reward distributions change
For i=1,2,β¦ Collect N trajectories for policy ππ Estimate advantage function π΅ Compute policy gradient π Update policy parameter π = ππππ + π½π
cv Advantage is very random initially Advantage Policy
Youβre bad
For i=1,2,β¦ Collect N trajectories for policy ππ Estimate advantage function π΅ Compute policy gradient π Update policy parameter π = ππππ + π½π
We need more carefully crafted policy update We want improvement and not degradation Idea: We can update old policy ππππ to a new policy ΰ·€ π such that they are βtrustedβ distance apart. Such conservative policy update allows improvement instead
policy π based on data sampled from π (on-policy data)
Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)
π π = π½π‘0~π0,ππ’~π . π‘π’ ΰ·
π’=0 β
πΏπ’π
π’
π ΰ·€ π = π ππππ + π½π~ΰ·₯
π ΰ· π’=0 β
πΏπ’π΅ππππ(π‘π’, ππ’)
[1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.
Expected return of new policy Expected return of
Sample from new policy
π‘
π(π‘) ΰ· π
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
Expected return of new policy Expected return of
Discounted visitation frequency ππ π‘ = π π‘0 = π‘ + πΏπ π‘1 = π‘ + πΏ2π + β―
π‘
π(π‘) ΰ· π
New Expected Return Old Expected Return
β₯ π
π‘
π(π‘) ΰ· π
New Expected Return Old Expected Return
>
β₯ π
Guaranteed Improvement from ππππ β ΰ·€ π
π‘
π(π‘) ΰ· π
State visitation based on new policy New policy βComplex dependency of πΰ·₯
π(π‘) on
ΰ·€ π makes the equation difficult to
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
π‘
π(π‘) ΰ· π
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
π‘
π
Local approximation of π½(ΰ·₯ π)
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
π‘
π
The approximation is accurate within step size π (trust region) Monotonic improvement guaranteed π πβ² ππβ² π‘ π does not change dramatically. Trust region
π ΰ·€ π β₯ π ΰ·€ π β π·πΈπΏπ
πππ¦(π, ΰ·€
π) Where, π· =
4ππΏ 1βπΏ 2
π = arg max
π
[π ΰ·€ π β π·πΈπΏπ
πππ¦ π, ΰ·€
π ] Where, π· =
4ππΏ 1βπΏ 2
Actual function π(π) Surrogate function π π β π·πΈπΏπ
πππ¦(π, ΰ·€
π)
π
πππ¦ ππππ, π ]
π
πππ¦ ππππ, π ]
In practice π· results in very small step sizes One way to take larger step size is to constraint KL divergence between the new policy and the old policy, i.e., a trust region constraint: ππππππππ
πΎ
π΄πΎ(πΎ) subject to, π¬π³π΄
πππ πΎπππ, πΎ β€ πΊ
πππ¦(ππππ, π)
π
π π β π·. πΈπΏπ(ππππ, π)
maxππππ¨π
π
π . π β ππππ β π 2 π β ππππ ππΊ(π β ππππ) where, π =
π ππ π π Θπ=ππππ,
πΊ =
π2 π2π πΈπΏπ ππππ, π Θπ=ππππ
1 π πΊβ1π. Donβt want to form full Hessian matrix
π2 π2π πΈπΏπ ππππ, π Θπ=ππππ.
maxππππ¨π
π
π . π β ππππ β π 2 π β ππππ ππΊ(π β ππππ) where, π =
π ππ π π Θπ=ππππ,
πΊ = π2
π2π πΈπΏπ ππππ, π Θπ=ππππ
1 2 π¦ππ΅π¦ β ππ¦
π
π π β π·. πΈπΏπ(ππππ, π)
π
π π subject to π·. πΈπΏπ ππππ, π β€ π
to get correct KL
π
π . π β ππππ subject to 1
2 π β ππππ ππΊ π β ππππ β€ π
π β ππππ β π
2 [ π β ππππ ππΊ π β ππππ β π]
π πΊβ1π
2 π‘ππΊπ‘ = π
2π π‘π£ππ‘πππππ.(πΊπ‘π£ππ‘πππππ) π‘π£ππ‘πππππ
For i=1,2,β¦ Collect N trajectories for policy ππ Estimate advantage function π΅ Compute policy gradient π Use CG to compute πΌβ1π Compute rescaled step π‘ = π½πΌβ1π with rescaling and line search Apply update: π = ππππ + π½πΌβ1π maxππππ¨π
π
π π subject to π·. πΈπΏπ ππππ, π β€ π