policy gradients for cvar constrained mdps
play

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA - PowerPoint PPT Presentation

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26 Motivation Risk is like fire: If controlled it will help you; if uncontrolled it


  1. Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille – Team SequeL Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 1 / 26

  2. Motivation Risk is like fire: If controlled it will help you; if uncontrolled it will rise up and destroy you. Theodore Roosevelt The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. Douglas Adams Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26

  3. Motivation Risk is like fire: If controlled it will help you; if uncontrolled it will rise up and destroy you. Theodore Roosevelt The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. Douglas Adams Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 2 / 26

  4. Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26

  5. Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26

  6. Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26

  7. Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26

  8. Risk-Sensitive Sequential Decision-Making Risk-neutral Objective: � τ − 1 � � θ ∈ Θ G θ ( s 0 ) = E g ( s m , a m ) | s 0 = s 0 , θ min m = 0 Total Cost Cost Policy a criterion that penalizes the variability induced by a given policy minimize some measure of risk as well as maximizing a usual optimization criterion Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 3 / 26

  9. A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26

  10. A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26

  11. A brief history of risk measures Risk measures considered in the literature: expected exponential utility (Howard & Matheson 1972) variance-related measures (Sobel 1982; Filar et al. 1989) percentile performance (Filar et al. 1995) Open Question ??? construct conceptually meaningful and computationally tractable criteria mainly negative results: (e.g., Sobel 1982; Filar et al., 1989; Mannor & Tsitsiklis, 2011) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 4 / 26

  12. Conditional Value-at-Risk (CVaR) VaR α ( X ) := inf { ξ | P ( X ≤ ξ ) ≥ α } CVaR α ( X ) := E [ X | X ≥ VaR α ( X )] . Unlike VaR, CVaR is a coherent risk measure 1 1convex, monotone, positive homogeneous and translation equi-variant Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 5 / 26

  13. Practical Motivation Portfolio Re-allocation Stock 3 Portfolio composed of assets (e.g. stocks) Target Stochastic gains for buying/selling assets Aim find an investment strategy that Current achieves a targeted asset allocation Stock 1 Stock 2 A risk-averse investor would prefer a strategy that quickly achieves the target asset allocation; 1 minimizes the worst-case losses incurred 2 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26

  14. Practical Motivation Portfolio Re-allocation Stock 3 Portfolio composed of assets (e.g. stocks) Target Stochastic gains for buying/selling assets Aim find an investment strategy that Current achieves a targeted asset allocation Stock 1 Stock 2 A risk-averse investor would prefer a strategy that quickly achieves the target asset allocation; 1 minimizes the worst-case losses incurred 2 Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 6 / 26

  15. Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26

  16. Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26

  17. Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26

  18. Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26

  19. Our Contributions define a CVaR-constrained stochastic shortest path problem derive CVaR estimation procedures using stochastic approximation propose policy gradient algorithms to optimize CVaR-constrained SSP establish the asymptotic convergence of the algorithms adapt our proposed algorithms to incorporate importance sampling (IS) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 7 / 26

  20. CVaR-Constrained SSP Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 8 / 26

  21. Stochastic Shortest Path S = { 0 , 1 , . . . , r } State. A ( s ) = {feasible actions in state s } Actions. Costs. g ( s , a ) and c ( s , a ) used in the objective used in the constraint Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26

  22. Stochastic Shortest Path S = { 0 , 1 , . . . , r } State. A ( s ) = {feasible actions in state s } Actions. Costs. g ( s , a ) and c ( s , a ) used in the objective used in the constraint Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 9 / 26

  23. CVaR-Constrained SSP minimize the total cost: � τ − 1 � � � � s 0 = s 0 E g ( s m , a m ) m = 0 � �� � G θ ( s 0 ) subject to (CVaR constraint): � τ − 1 � � � � s 0 = s 0 c ( s m , a m ) CVaR α m = 0 � �� � C θ ( s 0 ) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26

  24. CVaR-Constrained SSP minimize the total cost: � τ − 1 � � � � s 0 = s 0 E g ( s m , a m ) m = 0 � �� � G θ ( s 0 ) subject to (CVaR constraint): � τ − 1 � � � � s 0 = s 0 c ( s m , a m ) CVaR α m = 0 � �� � C θ ( s 0 ) Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 10 / 26

  25. Lagrangian Relaxation min θ G θ ( s 0 ) CVaR α ( C θ ( s 0 ) ≤ K α s.t. � � � � � �� L θ,λ ( s 0 ) := G θ ( s 0 ) + λ CVaR α ( C θ ( s 0 )) − K α max λ min θ Prashanth L.A. (INRIA) Policy Gradients for CVaR-Constrained MDPs 11 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend