SLIDE 1 Safe Reinforcement Learning via Formal Methods
Nathan Fulton and André Platzer Carnegie Mellon University
SLIDE 2 Safe Reinforcement Learning via Formal Methods
Nathan Fulton and André Platzer Carnegie Mellon University
SLIDE 3
Safety-Critical Systems
"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing
SLIDE 4
Autonomous Safety-Critical Systems
How can we provide people with autonomous cyber-physical systems they can bet their lives on?
SLIDE 5
Model-Based Verification
φ
Reinforcement Learning
SLIDE 6 Model-Based Verification
pos < stopSign
Reinforcement Learning
SLIDE 7 Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 8 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 9 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 10 Benefits:
- Strong safety guarantees
- Automated analysis
Model-Based Verification
φ
Reinforcement Learning
SLIDE 11 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 12 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 13 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
SLIDE 14 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
SLIDE 15 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
SLIDE 16 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
SLIDE 17 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
- 1. Learn Safety
- 2. Learn a Safe Policy
- 3. Justify claims of safety
SLIDE 18
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
SLIDE 19
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete control Continuous motion
SLIDE 20
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete, non-deterministic control Continuous motion
SLIDE 21
Model-Based Verification
Accurate, analyzable models often exist!
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 22
Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 23 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification.
SLIDE 24 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification
- Formal proofs mapping
model to runtime monitors
SLIDE 25
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
SLIDE 26
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
How to implement? Only accurate sometimes
SLIDE 27
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*
How to implement? Only accurate sometimes
SLIDE 28 Our Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve non-determinism without
sacrificing formal safety results
SLIDE 29 Our Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve non-determinism without
sacrificing formal safety results
- 2. allows and directs speculation whenever
model mismatches occur
SLIDE 30 Learning to Resolve Non-determinism
Observe & compute reward Act
SLIDE 31 Learning to Resolve Non-determinism
Observe & compute reward
accel ∪ brake U turn
SLIDE 32 Learning to Resolve Non-determinism
Observe & compute reward
{accel,brake,turn}
SLIDE 33 Learning to Resolve Non-determinism
⇨
Observe & compute reward
Policy
{accel,brake,turn}
SLIDE 34 Learning to Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
{accel,brake,turn}
SLIDE 35 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
SLIDE 36 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
≠ “Trust Me”
SLIDE 37 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 38 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 39 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the
non-deterministic model to the learned (deterministic) policy
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 40 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the
non-deterministic model to the learned (deterministic) policy via the model monitor.
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 41 What about the physical model?
⇨
Observe & compute reward
φ
Use a theorem prover to prove: (init→ [{{accel∪brake};ODEs}*](safe)) ↔ φ
{pos’=vel,vel’=acc} ≠ (safe?) Policy
SLIDE 42 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
SLIDE 43 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 44 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 45 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate
SLIDE 46 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate Obstacle!
SLIDE 47 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Expected Reality
SLIDE 48 Speculation is Justified
Observe & compute reward {brake, accel, turn}
Expected (safe) Reality (crash!)
SLIDE 49 Leveraging Verification Results to Learn Better
Observe & compute reward {brake, accel, turn}
Use a real-valued version of the model monitor as a reward signal
SLIDE 50 Conclusion
Justified Speculative Control provides the best of logic and learning:
⇨
Policy
φ
SLIDE 51 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
⇨
Policy
φ
SLIDE 52 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
⇨
Policy
φ
SLIDE 53 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
- Leverage theorem proving to transfer proofs to learned policies.
⇨
Policy
φ
SLIDE 54 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
- Leverage theorem proving to transfer proofs to learned policies.
- Unsafe speculation is justified when model deviates from reality
⇨
Policy
φ
SLIDE 55 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models
- Leverage theorem proving to transfer proofs to learned policies
- Unsafe speculation is justified when model deviates from reality,
but verification results can still be helpful!
⇨
Policy
φ
SLIDE 56 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models
- Leverage theorem proving to transfer proofs to learned policies
- Unsafe speculation is justified when model deviates from reality,
but verification results can still be helpful!
⇨
Policy
φ
SLIDE 57
SLIDE 58
SLIDE 59
SLIDE 60
Justified Speculative Control
≈
Learn over a constrained action space
≠
SLIDE 61
Justified Speculative Control
≈
Learn over a constrained action space
≠
SLIDE 62 Safe Reinforcement Learning?
⇨
Observe & compute reward
unverified Policy
Policy deviates from model:
- 1. Policy is deterministic, verification result is
set-valued.
{accel,brake,turn}
SLIDE 63 Some Actions Aren’t Always Safe
⇨
Observe & compute reward
unverified Policy
Policy deviates from model:
- 1. Policy is deterministic, verification result is
set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake
SLIDE 64 Some Actions Aren’t Always Safe
⇨
Observe & compute reward
unverified Policy
Policy deviates from model:
- 1. Policy is deterministic, verification result is
set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake
SLIDE 65 Safe Reinforcement Learning?
⇨
unverified Policy
Policy deviates from model:
- 1. Policy is deterministic, verification result is
set-valued.
Observe & compute reward
?safeAccel; accel ∪ brake ≠
SLIDE 66 Physical Models are Approximations
Policy deviates from model:
- 1. Policy is deterministic, verification result is
set-valued.
- 2. Environment may not be accurately modeled.
⇨
Observe & compute reward
unverified Policy
{accel,brake,turn}
≠ pos’=vel, vel’=acc
SLIDE 67
Safety resolving non-determinism
unverified Policy
?safeAccel; accel ∪ brake ≠
SLIDE 68 Sandboxing Reinforcement Learning
≈
“Accurate modulo determinism”
init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)
SLIDE 69
Sandboxing Reinforcement Learning
≈
Learn over a constrained action space “Accurate modulo determinism”
SLIDE 70
Sandboxing Reinforcement Learning
≈
Learn over a constrained action space “Accurate modulo determinism”
SLIDE 71 Sandboxing Reinforcement Learning
Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.
⇨
Policy
Constrained Actions Observe & compute reward
SLIDE 72 Sandboxing Reinforcement Learning
Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.
⇨
Observe & compute reward
Policy
Constrained Actions
init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)
SLIDE 73 Sandboxing Reinforcement Learning
Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.
⇨
Observe & compute reward
Policy
Constrained Actions
init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)
SLIDE 74 Sandboxing Safe Reinforcement Learning
Theorem: If the physical model is accurate then verification results are preserved by learned policies.
⇨
Observe & compute reward
Policy
Constrained Actions
init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)
SLIDE 75 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
SLIDE 76 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 77 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 78 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is correct.
Model is inaccurate
SLIDE 79 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is correct.
Model is inaccurate Obstacle!
SLIDE 80 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Expected Reality
SLIDE 81 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Expected (safe) Reality (crash!)
SLIDE 82
Justified Speculative Control
≈
Learn over a constrained action space
≠
SLIDE 83
Justified Speculative Control
≈
Learn over a constrained action space
≠
SLIDE 84 Justified Speculative Control
Some Questions:
- 1. How do we know when we’re in unmodeled state space?
- 2. What do we do when we are in modeled state space?
Learn over a constrained action space
Learn
SLIDE 85 Justified Speculative Control
Some Questions:
- 1. How do we know when we’re in unmodeled state space?
- 2. What do we do when we are in modeled state space?
Learn over a constrained action space
Learn
SLIDE 86 Justified Speculative Control
Theorem: Verification results are preserved outside of red
☒ How do we know when we’re in unmodeled state space? ☐ What do we do when we are in modeled state space? Learn over a constrained action space
Learn
SLIDE 87
What do we do in unmodeled state-space?
SLIDE 88
What do we do in unmodeled state-space?
SLIDE 89
What do we do in unmodeled state-space?
SLIDE 90
What do we do in unmodeled state-space?
Get from here...
SLIDE 91
What do we do in unmodeled state-space?
...to here Get from here...
SLIDE 92 Leveraging Formal Methods during Learning
Leader Own Car
SLIDE 93 Leveraging Formal Methods during Learning
Perturbation “Don’t hit the leader”
“Get back to modeled state space”
5% 3 2 25% 18 16 50% 41 24
Leader Own Car
SLIDE 94 Conclusion
KeYmaera X + Justified Speculative Control:
- 1. Transfer formal verification results for
non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.
SLIDE 95 Conclusion
KeYmaera X + Justified Speculative Control:
- 1. Transfer formal verification results for
non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.
- 2. Leverages insights obtained during verification to direct
future learning.
≠
SLIDE 96 init → [{ {?safeAccel; accel ∪ brake}; t:=0; {pos’=vel,vel’=acc} }*](pos < stopSign)
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl