SLIDE 1 Safe Reinforcement Learning via Formal Methods
Nathan Fulton and André Platzer Carnegie Mellon University
SLIDE 2
Safety-Critical Systems
"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing
SLIDE 3
Autonomous Safety-Critical Systems
How can we provide people with autonomous cyber-physical systems they can bet their lives on?
SLIDE 4
Model-Based Verification
φ
Reinforcement Learning
SLIDE 5 Model-Based Verification
pos < stopSign
Reinforcement Learning
SLIDE 6 Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 7 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 8 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 9 Benefits:
- Strong safety guarantees
- Automated analysis
Model-Based Verification
φ
Reinforcement Learning
SLIDE 10 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 11 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 12 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
SLIDE 13 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
SLIDE 14 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
SLIDE 15 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
SLIDE 16 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
- 1. Learn Safety
- 2. Learn a Safe Policy
- 3. Justify claims of safety
SLIDE 17
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
SLIDE 18
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete control Continuous motion
SLIDE 19
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete, non-deterministic control Continuous motion
SLIDE 20
Model-Based Verification
Accurate, analyzable models often exist!
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 21
Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 22 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification.
SLIDE 23 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification
- Formal proofs mapping
model to runtime monitors
SLIDE 24
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
SLIDE 25
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
How to implement? Only accurate sometimes
SLIDE 26
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*
How to implement? Only accurate sometimes
SLIDE 27 Our Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve non-determinism without
sacrificing formal safety results
SLIDE 28 Our Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve non-determinism without
sacrificing formal safety results
- 2. allows and directs speculation whenever
model mismatches occur
SLIDE 29 Learning to Resolve Non-determinism
Observe & compute reward Act
SLIDE 30 Learning to Resolve Non-determinism
Observe & compute reward
accel ∪ brake U turn
SLIDE 31 Learning to Resolve Non-determinism
Observe & compute reward
{accel,brake,turn}
SLIDE 32 Learning to Resolve Non-determinism
⇨
Observe & compute reward
Policy
{accel,brake,turn}
SLIDE 33 Learning to Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
{accel,brake,turn}
SLIDE 34 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
SLIDE 35 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
≠ “Trust Me”
SLIDE 36 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 37 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 38 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the
non-deterministic model to the learned (deterministic) policy
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 39 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the
non-deterministic model to the learned (deterministic) policy via the model monitor.
Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ
SLIDE 40 What about the physical model?
⇨
Observe & compute reward
φ
Use a theorem prover to prove: (init→ [{{accel∪brake};ODEs}*](safe)) ↔ φ
{pos’=vel,vel’=acc} ≠ (safe?) Policy
SLIDE 41 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
SLIDE 42 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 43 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 44 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate
SLIDE 45 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate Obstacle!
SLIDE 46 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Expected Reality
SLIDE 47 Speculation is Justified
Observe & compute reward {brake, accel, turn}
Expected (safe) Reality (crash!)
SLIDE 48 Leveraging Verification Results to Learn Better
Observe & compute reward {brake, accel, turn}
Use a real-valued version of the model monitor as a reward signal
SLIDE 49 Conclusion
Justified Speculative Control provides the best of logic and learning:
⇨
Policy
φ
SLIDE 50 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
⇨
Policy
φ
SLIDE 51 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
⇨
Policy
φ
SLIDE 52 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
- Leverage theorem proving to transfer proofs to learned policies.
⇨
Policy
φ
SLIDE 53 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models.
- Leverage theorem proving to transfer proofs to learned policies.
- Unsafe speculation is justified when model deviates from reality
⇨
Policy
φ
SLIDE 54 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models
- Leverage theorem proving to transfer proofs to learned policies
- Unsafe speculation is justified when model deviates from reality,
but verification results can still be helpful!
⇨
Policy
φ
SLIDE 55 Conclusion
Justified Speculative Control provides the best of logic and learning:
- Formally model the control system (control + physics)
- Learn how to resolve non-determinism in models
- Leverage theorem proving to transfer proofs to learned policies
- Unsafe speculation is justified when model deviates from reality,
but verification results can still be helpful!
⇨
Policy
φ