Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - PowerPoint PPT Presentation

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017

Lecture overview • What makes a reinforcement learning algorithm safe ? • Notation • Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • High-confidence off-policy policy evaluation (HCOPE) • Safe policy improvement (SPI) • Empirical results • Research directions

What does it mean for a reinforcement learning algorithm to be safe ?

Changing the objective +20 +20 -50 +0 +20 +20 +20 Policy 1 +0 +0 +0 +0 +0 +0 +20 Policy 2

Changing the objective • Policy 1: • Reward = 0 with probability 0.999999 • Reward = 10 9 with probability 1-0.999999 • Expected reward approximately 1000 • Policy 2: • Reward = 999 with probability 0.5 • Reward = 1000 with probability 0.5 • Expected reward 999.5

Another notion of safety

Another notion of safety (Munos et. al)

Another notion of safety

The Problem • If you apply an existing method, do you have confidence that it will work?

Reinforcement learning successes

A property of many real applications • Deploying “bad” policies can be costly or dangerous .

Deploying bad policies can be costly

Deploying bad policies can be dangerous

What property should a safe algorithm have? • Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝜀 , I will not change your policy to one that is worse than the current policy.” • You get to choose 𝜀 • This guarantee is not contingent on the tuning of any hyperparameters

Notation • Policy, 𝜌 𝜌 𝑏 𝑡 = Pr⁡ (𝐵 𝑢 = 𝑏|𝑇 𝑢 = 𝑡) • History: Action, 𝑏 Agent 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 Reward, 𝑠 State, 𝑡 • Historical data: 𝐸 = 𝐼 1 , 𝐼 2 , … , 𝐼 𝑜 Environment • Historical data from behavior policy , 𝜌 b • Objective: 𝑀 𝐾 𝜌 = 𝐅 𝛿 𝑢 𝑆 𝑢 𝜌 𝑢=1 19

Safe reinforcement learning algorithm • Reinforcement learning algorithm, 𝑏 • Historical data, 𝐸 , which is a random variable • Policy produced by the algorithm, 𝑏(𝐸) , which is a random variable • A safe reinforcement learning algorithm, 𝑏 , satisfies: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 𝜌 b ≥ 1 − 𝜀 or, in general: Pr 𝐾 𝑏 𝐸 ≥ 𝐾 min ≥ 1 − 𝜀

Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏

Off-policy policy evaluation (OPE) Historical Data, 𝐸 Estimate of 𝐾(𝜌 e ) Proposed Policy, 𝜌 e

Importance Sampling (Intuition) • Reminder: • History, 𝐼 = 𝑡 1 , 𝑏 1 , 𝑠 𝑜 𝑜 𝑀 𝑀 1 , 𝑡 2 , 𝑏 2 , 𝑠 2 , … , 𝑡 𝑀 , 𝑏 𝑀 , 𝑠 𝑀 𝜌 e = 1 𝜌 𝑓 = 1 𝑀 • Objective, 𝐾 𝜌 e = 𝐅 𝛿 𝑢 𝑆 𝑢 𝑗 𝑗 𝑜 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑜 𝛿 𝑢 𝑆 𝑢 𝜌 e 𝐾 𝐾 𝑢=1 𝑗=1 𝑗=1 𝑢=1 𝑢=1 𝑥 𝑗 = Pr 𝐼 𝑗 𝜌 e Pr 𝐼 𝑗 𝜌 b Evaluation Policy, 𝜌 e 𝑀 Behavior Policy, 𝜌 b = 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝜌 b 𝑏 𝑢 𝑡 𝑢 Probability of history 𝑢=1 Math Slide 2/3 Math Slide 2/3 24

Importance sampling (History) • Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America , 1(5):263 – 278 • Let 𝑌 = 0 with probability 1 − 10 −10 and 𝑌 = 10 10 with probability 10 −10 • 𝐅 𝑌 = 1 • Monte Carlo estimate from 𝑜 ≪ 10 10 samples of 𝑌 is almost always zero • Idea: Sample 𝑌 from some other distribution and use importance sampling to “correct” the estimate • Can produce lower variance estimates. • Josiah Hannah et. al, ICML 2017 (to appear).

Importance sampling (History, continued) • Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off- policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning , pp. 759 – 766. Morgan Kaufmann

Importance sampling (Proof) • Estimate 𝐅 𝑞 [𝑔 𝑌 ] given a sample of 𝑌~𝑟 • Let 𝑄 = supp 𝑞 , 𝑅 = supp(𝑟) , and 𝐺 = supp(𝑔) 𝑞 𝑌 • Importance sampling estimate: 𝑟 𝑌 𝑔 𝑌 𝑞(𝑌) 𝑟(𝑌) 𝑔(𝑌) = 𝑟(𝑌) 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) 𝑦∈𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) + 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 ∩𝑅 𝑦∈𝑄 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅

Importance sampling (Proof) ) • Assume 𝑄 ⊆ 𝑅 (can relax assumption to 𝑄 ⊆ 𝑅 ∪ 𝐺 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 = 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is an unbiased estimator of 𝐅 𝑞 𝑔 𝑌

Importance sampling (proof) • Assume 𝑔 𝑦 ≥ 0 for all 𝑦 𝑞(𝑌) 𝐅 𝑟 𝑟(𝑌) 𝑔(𝑌) = 𝑞(𝑌)⁡𝑔(𝑌) − 𝑞 𝑌 𝑔 𝑌 𝑦∈𝑄 𝑦∈𝑄∩𝑅 ≤ 𝑞(𝑌)⁡𝑔(𝑌) 𝑦∈𝑄 = 𝐅 𝑞 𝑔 𝑌 • Importance sampling is a negative-bias estimator of 𝐅 𝑞 𝑔 𝑌

Importance sampling (reminder) 𝑜 𝑀 𝑀 IS 𝐸 = 1 𝜌 e 𝑏 𝑢 𝑡 𝑢 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝜌 b 𝑏 𝑢 𝑡 𝑢 𝑗=1 𝑢=1 𝑢=1 𝐅 IS(𝐸) = 𝐾 𝜌 e

High confidence off-policy policy evaluation (HCOPE) Historical Data, 𝐸 1 − 𝜀 confidence lower Proposed Policy, 𝜌 𝑓 bound on 𝐾(𝜌 𝑓 ) Probability, 1 − 𝜀

Hoeffding’s inequality • Let 𝑌 1 , … , 𝑌 𝑜 be 𝑜 independent identically distributed random variables such that ⁡𝑌 i ∈ [0, 𝑐] • Then with probability at least 1 − 𝜀 : ln 1 𝜀 𝑜 𝐅 𝑌 𝑗 ≥ 1 𝑜 𝑌 𝑗 − 𝑐 2𝑜 𝑗=1 𝑜 𝑀 1 𝑥 𝑗 𝛿 𝑢 𝑆 𝑢 𝑗 𝑜 𝑗=1 𝑢=1 Math Slide 3/3

Safe policy improvement (SPI) Historical Data, 𝐸 New policy 𝜌 , or No Solution Found Probability, 1 − 𝜀

Safe policy improvement (SPI) Training Set Candidate Policy, 𝜌 (20%) Historical Data Testing Set Safety Test (80%) Is 1 − 𝜀 confidence lower bound on 𝐾 𝜌 larger that 𝐾(𝜌 cur ) ? 36

Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy , 𝜌 e , Convert historical data, 𝐸 , into 𝑜 independent and unbiased estimates of 𝐾 𝜌 e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑜 independent and unbiased estimates of 𝐾 𝜌 e into a 1 − 𝜀 confidence lower bound on 𝐾 𝜌 e • Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑏 WON’T WORK

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: - PowerPoint PPT Presentation

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest Lecture May 24, 2017 Lecture overview What makes a reinforcement learning algorithm safe ? Notation Creating a safe reinforcement learning

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

Briefing by Form Teachers Building Sense of Responsibility Treat the school like their second

Tutorial on Overwriting Andrew Nevins Harvard University Leipzig Exponence Network June 2008

1 Profile:- The saying, Roti, kapda, aur makaan, got an imperative extension couple of

Naming and Addressing Jeff Chase Duke University Name Spaces OS-level Address spaces

The MINER A A The MINER Experiment Experiment Csar Castromonte Csar Castromonte

Logistical and environmental considerations for the Far East to Europe corridor Harilaos N.

Motivation Insensitive to radio Cherenkov Measured signal attributed to molecular

Diagram Operators in a Logical Framework LFMTP 2020: Session 1, Formalizing Logics Computer