Policy Gradients for CVaR-Constrained MDPs
Prashanth L.A.
INRIA Lille – Team SequeL
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26
Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA - - PowerPoint PPT Presentation
Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26 Motivation Risk is like fire: If controlled it will help you; if uncontrolled it
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26
1convex, monotone, positive homogeneous and translation equi-variant Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 5 / 26
Current Target
1
2
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26
Current Target
1
2
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 8 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26
Policy Gradients for CVaR-Constrained MDPs 11 / 26
λ min θ
1Note: ∇θLθ,λ(s0) = ∇θGθ(s0) + λ∇θCVaRα(Cθ(s0)), ∇λLθ,λ(s0) = CVaRα(Cθ(s0)) − Kα Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 12 / 26
λ min θ
1Note: ∇θLθ,λ(s0) = ∇θGθ(s0) + λ∇θCVaRα(Cθ(s0)), ∇λLθ,λ(s0) = CVaRα(Cθ(s0)) − Kα Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 12 / 26
λ min θ
1Note: ∇θLθ,λ(s0) = ∇θGθ(s0) + λ∇θCVaRα(Cθ(s0)), ∇λLθ,λ(s0) = CVaRα(Cθ(s0)) − Kα Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 12 / 26
1converge to a (local) saddle point of θ,λ(s0), i.e., to a tuple (θ∗, λ∗) that are a local minimum w.r.t. θ and a local maximum w.r.t. λ of Lθ,λ(s0) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 13 / 26
1converge to a (local) saddle point of θ,λ(s0), i.e., to a tuple (θ∗, λ∗) that are a local minimum w.r.t. θ and a local maximum w.r.t. λ of Lθ,λ(s0) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 13 / 26
θn Using policy πθn, simulate an SSP episode Simulation Estimate ∇θGθ(s0) Policy Gradient Estimate CVaRα(Cθ(s0)) CVaR Estimation Estimate ∇θCVaRα(Cθ(s0)) CVaR Gradient Update θn Policy Update θn+1
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 14 / 26
2Rockafellar, R.T., Uryasev, S. (2000), “Optimization of conditional value-at-risk”. In:Journal of risk Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 15 / 26
2Rockafellar, R.T., Uryasev, S. (2000), “Optimization of conditional value-at-risk”. In:Journal of risk Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 15 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 16 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 16 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 16 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 16 / 26
m
importance sampling.” In: Monte Carlo Methods and Applications Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 17 / 26
m
importance sampling.” In: Monte Carlo Methods and Applications Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 17 / 26
m
importance sampling.” In: Monte Carlo Methods and Applications Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 17 / 26
i,j=0
τ−1
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 18 / 26
i,j=0
τ−1
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 18 / 26
τ−1
5Bartlett, P.L., Baxter, J. (2011) “Infinite-horizon policy-gradient estimation.” Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 19 / 26
6Tamar, A. et al. (2014) “Policy Gradients Beyond Expectations: Conditional Value-at-Risk.” In: arxiv:1404.3862 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 20 / 26
τn−1
τn−1
τn−1
1 1−α 1{Cn≥ξn−1}
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 21 / 26
θn−1 Using policy πθn−1, simulate mn episodes Simulation Obtain {Gn,j, Cn,j, zn,j}mn
j=1
Cost/Likelihood Estimates Compute CVaRα(Cθ(s0)) and ∇θCVaRα(Cθ(s0)), ∇θGθ(s0) Averaging θn
mn
mn
mn
mn
Cn≥ξn}. Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 22 / 26
1Borkar V (2010) “Risk-constrained Markov decision processes” In: CDC 2Tamar et al (2014) “Policy Gradients Beyond Expectations: Conditional Value-at-Risk” In: arxiv:1404.3862 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 23 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 24 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 25 / 26
Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 26 / 26