safe reinforcement learning
play

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - PowerPoint PPT Presentation

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017 Lecture overview What makes a reinforcement learning algorithm safe ? Notation Creating a safe reinforcement learning


  1. Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017

  2. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  3. What does it mean for a reinforcement learning algorithm to be safe ?

  4. Changing the objective +20 +20 -50 +0 +20 +20 +20 Policy 1 +0 +0 +0 +0 +0 +0 +20 Policy 2

  5. Changing the objective • Policy 1: • Reward = 0 with probability 0.999999 • Reward = 10 9 with probability 1-0.999999 • Expected reward approximately 1000 • Policy 2: • Reward = 999 with probability 0.5 • Reward = 1000 with probability 0.5 • Expected reward 999.5

  6. Another notion of safety

  7. Another notion of safety (Munos et. al)

  8. Another notion of safety

  9. The Problem • If you apply an existing method, do you have confidence that it will work?

  10. Reinforcement learning successes

  11. A property of many real applications • Deploying “bad” policies can be costly or dangerous .

  12. Deploying bad policies can be costly

  13. Deploying bad policies can be dangerous

  14. What property should a safe algorithm have? • Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝜀 , I will not change your policy to one that is worse than the current policy.” • You get to choose 𝜀 • This guarantee is not contingent on the tuning of any hyperparameters

  15. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  16. Notation • Policy, 𝜌 𝜌 𝑏 𝑡 = Pr⁡ (𝐵 𝑢 = 𝑏|𝑇 𝑢 = 𝑡) • History: Action, 𝑏 Agent 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 Reward, 𝑠 State, 𝑡 • Historical data: 𝐸 = 𝐼 1 , 𝐼 2 , … , 𝐼 𝑜 Environment • Historical data from behavior policy , 𝜌 b • Objective: 𝑀 𝐾 𝜌 = 𝐅 𝛿 𝑢 𝑆 𝑢 𝜌 𝑢=1 19

  17. Safe reinforcement learning algorithm • Reinforcement learning algorithm, 𝑏 • Historical data, 𝐸 , which is a random variable • Policy produced by the algorithm, 𝑏(𝐸) , which is a random variable • A safe reinforcement learning algorithm, 𝑏 , satisfies: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌 b ≥ 1 − 𝜀 or, in general: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 min ≥ 1 − 𝜀

  18. Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

  19. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  20. Off-policy policy evaluation (OPE) Historical Data, 𝐸 Estimate of 𝐾(𝜌 e ) Proposed Policy, 𝜌 e

  21. Importance Sampling (Intuition) • Reminder: • History, 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 𝑜 𝑜 𝑀 𝑀 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 𝜌 e = 1 𝜌 𝑓 = 1 𝑀 • Objective, 𝐾 𝜌 e = 𝐅 𝛿 𝑢 𝑆 𝑢 𝑗 𝑗 𝑜 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑜 𝛿 𝑢 𝑆 𝑢 𝜌 e 𝐾 𝐾 𝑢=1 𝑗=1 𝑗=1 𝑢=1 𝑢=1 𝑥 𝑗 = Pr 𝐼 𝑗 𝜌 e Pr 𝐼 𝑗 𝜌 b Evaluation Policy, 𝜌 e 𝑀 Behavior Policy, 𝜌 b = 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝜌 b 𝑏 𝑢 𝑡 𝑢 Probability of history 𝑢=1 Math Slide 2/3 Math Slide 2/3 24

  22. Importance sampling (History) • Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America , 1(5):263 – 278 • Let 𝑌 = 0 with probability 1 − 10 −10 and 𝑌 = 10 10 with probability 10 −10 • 𝐅 𝑌 = 1 • Monte Carlo estimate from 𝑜 ≪ 10 10 samples of 𝑌 is almost always zero • Idea: Sample 𝑌 from some other distribution and use importance sampling to “correct” the estimate • Can produce lower variance estimates. • Josiah Hannah et. al, ICML 2017 (to appear).

  23. Importance sampling (History, continued) • Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off- policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning , pp. 759 – 766. Morgan Kaufmann

  24. Importance sampling (Proof) • Estimate 𝐅 𝑞 [𝑔 𝑌 ] given a sample of 𝑌~𝑟 • Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟) , and 𝐺 = supp(𝑔) 𝑞 𝑌 • Importance sampling estimate: 𝑟 𝑌 𝑔 𝑌 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) 𝑦∈𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) + 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 ∩𝑅 𝑦∈𝑄 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅

  25. Importance sampling (Proof) ) • Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is an unbiased estimator of 𝐅 𝑞 𝑔 𝑌

  26. Importance sampling (proof) • Assume 𝑔 𝑦 ≥ 0 for all 𝑦 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 ≤ 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is a negative-bias estimator of 𝐅 𝑞 𝑔 𝑌

  27. Importance sampling (reminder) 𝑜 𝑀 𝑀 IS 𝐸 = 1 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝜌 b 𝑏 𝑢 𝑡 𝑢 𝑗=1 𝑢=1 𝑢=1 𝐅 IS(𝐸) = 𝐾 𝜌 e

  28. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  29. High confidence off-policy policy evaluation (HCOPE) Historical Data, 𝐸 1 − 𝜀 confidence lower Proposed Policy, 𝜌 𝑓 bound on 𝐾(𝜌 𝑓 ) Probability, 1 − 𝜀

  30. Hoeffding’s inequality • Let 𝑌 1 , … , 𝑌 𝑜 be 𝑜 independent identically distributed random variables such that ⁡𝑌 i ∈ [0, 𝑐] • Then with probability at least 1 − 𝜀 : ln 1 𝜀 𝑜 𝐅 𝑌 𝑗 ≥ 1 𝑜 𝑌 𝑗 − 𝑐 2𝑜 𝑗=1 𝑜 𝑀 1 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝑗=1 𝑢=1 Math Slide 3/3

  31. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

  32. Safe policy improvement (SPI) Historical Data, 𝐸 New policy 𝜌 , or No Solution Found Probability, 1 − 𝜀

  33. Safe policy improvement (SPI) Training Set Candidate Policy, 𝜌 (20%) Historical Data Testing Set Safety Test (80%) Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌 cur ) ? 36

  34. Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏 WON’T WORK

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend