SLIDE 1 Safe AI for CPS
André Platzer Carnegie Mellon University Joint work with Nathan Fulton
SLIDE 2
Safety-Critical Systems
"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing
SLIDE 3
Safety-Critical Systems
"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing
SLIDE 4 Ensure the safety of Autonomous Cyber-Physical Systems. Best of both worlds: learning together with CPS safety
- Flexibility of learning
- Guarantees of CPS formal methods
Diametrically opposed: flexibility+adaptability versus predictability+simplicity
- 1. Cyber-Physical Systems with Differential Dynamic Logic
- 2. Sandboxed reinforcement learning is provably safe
- 3. Model-update learning addresses uncertainty with multiple models
This Talk
SLIDE 5 Airborne Collision Avoidance System ACAS X
- Developed by FAA to replace current TCAS in aircraft
- Approximately optimizes MDP on a grid
- Advisory from lookup tables with 5D interpolation regions
- Identified safe region per advisory and proved in KeYmaera X
1 1 2 3 4 5 6
dela y
STTT’17
SLIDE 6 10,000 20 9,600
Comparison: ACAS X issues DNC
But CL1500 or no change would not lead to a collision
intruder path 10,200 10,400 10,80 h (ft) 5 10 15 20 5 10 15 time to crossing (s)
path without DNC
path with DNC Following advisory DNC No change trajectory
SLIDE 7
Model-Based Verification
φ
Reinforcement Learning
SLIDE 8 Model-Based Verification
pos < stopSign
Reinforcement Learning
SLIDE 9 Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 10 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 11 Approach: prove that control software achieves a specification with respect to a model of the physical system.
Model-Based Verification
pos < stopSign
Reinforcement Learning
ctrl
SLIDE 12 Benefits:
- Strong safety guarantees
- Automated analysis
Model-Based Verification
φ
Reinforcement Learning
SLIDE 13 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 14 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
SLIDE 15 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
SLIDE 16 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
SLIDE 17 Benefits:
- Strong safety guarantees
- Automated analysis
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
SLIDE 18 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
SLIDE 19 Benefits:
- Strong safety guarantees
- Aomputational aids (ATP)
Drawbacks:
- Control policies are typically
non-deterministic: answers “what is safe”, not “what is useful”
Model-Based Verification
φ
Reinforcement Learning
Observe Act
Benefits:
- No need for complete model
- Optimal (effective) policies
Drawbacks:
- No strong safety guarantees
- Proofs are obtained and
checked by hand
- Formal proofs = decades-long
proof development
Goal: Provably correct reinforcement learning
- 1. Learn Safety
- 2. Learn a Safe Policy
- 3. Justify claims of safety
SLIDE 20
Part I: Differential Dynamic Logic
Trustworthy Proofs for Hybrid Systems
SLIDE 21 Hybrid Programs
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 22 Hybrid Programs
a;b
a;b a b
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 23 Hybrid Programs
?P a;b
a;b a b If P is true: no change If P is false: terminate
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 24 Hybrid Programs
?P a;b
a;b a b If P is true: no change If P is false: terminate
a*
a ...a...
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 25 Hybrid Programs
a∪b ?P a;b
a;b a b If P is true: no change If P is false: terminate
a*
a ...a...
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 26 Hybrid Programs
a∪b ?P a;b
a;b a b If P is true: no change If P is false: terminate
a* x’=f(x)
x=x0 ... x=F(0) ... x=F(T) ... ⋮ a ...a...
x := t
x=x0 y=y0 z=z0
...
x=t y=y0 z=z0
...
SLIDE 27 Approaching a Stopped Car
Is this property true?
Stopped Car Own Car
[ { {accel ∪ brake}; t:=0; {pos’=vel,vel’=accel,t’=1 & vel≥0 & t≤T} }* ](pos <= stoppedCarPos)
SLIDE 28 Approaching a Stopped Car
Assuming we only accelerate when it’s safe to do so, is this property true?
Stopped Car Own Car
[ { {accel ∪ brake}; t:=0; {pos’=vel,vel’=accel,t’=1 & vel≥0 & t≤T} }* ](pos <= stoppedCarPos)
SLIDE 29 Approaching a Stopped Car
Stopped Car Own Car safeDistance(pos,vel,stoppedCarPos,B)
safeDistance(pos,vel,stoppedCarPos,B) → [ { {accel ∪ brake}; t:=0; {pos’=vel,vel’=accel,t’=1 & vel≥0 & t≤T} }* ](pos <= stoppedCarPos) if we also assume the system is safe initially:
SLIDE 30 safeDistance(pos,vel,stoppedCarPos,B) → [ { {accel ∪ brake}; t:=0; {pos’=vel,vel’=accel,t’=1 & vel≥0 & t≤T} }* ](pos <= stoppedCarPos)
Approaching a Stopped Car
Stopped Car Own Car safeDistance(pos,vel,stoppedCarPos,B)
SLIDE 31
The Fundamental Question
Proofs give strong mathematical evidence of safety. Why would our program not work if we have a proof?
SLIDE 32 The Fundamental Question
Why would our program not work if we have a proof?
- 1. Was the proof correct?
SLIDE 33 The Fundamental Question
Why would our program not work if we have a proof?
- 1. Was the proof correct?
- 2. Was the model accurate enough?
≠
SLIDE 34 The Fundamental Question
Why would our program not work if we have a proof?
- 1. Was the proof correct? KeYmaera X
- 2. Was the model accurate enough?
DI Axiom: [{x'=f&Q}]P↔([?Q]P←(Q→[{x'=f&Q}]P')) Example: [v’=rpv2-g,t’=1]v ≥ v0 - gt ↔ … ↔ [v’:=rpv2-g][t’:=1]v’ ≥ -g*t’ ↔ rpv2-g ≥ -g ↔ H→rp≥0 Side derivation: (v ≥ v0 - gt)’ ↔ ...↔ ...↔ ... dI Tactic: H=rp≥0 & ra≥0 & g>0 & ...
Axioms
KyX
qed
ODE & Controls Tooling Clever Bellerophon Programs
SLIDE 35 The Fundamental Question
Why would our program not work if we have a proof?
- 1. Was the proof correct? KeYmaera X
- 2. Was the model accurate enough? Safe RL
DI Axiom: [{x'=f&Q}]P↔([?Q]P←(Q→[{x'=f&Q}]P')) Example: [v’=rpv2-g,t’=1]v ≥ v0 - gt ↔ … ↔ [v’:=rpv2-g][t’:=1]v’ ≥ -g*t’ ↔ rpv2-g ≥ -g ↔ H→rp≥0 Side derivation: (v ≥ v0 - gt)’ ↔ ...↔ ...↔ ... dI Tactic: H=rp≥0 & ra≥0 & g>0 & ...
Axioms
KyX
qed
ODE & Controls Tooling Clever Bellerophon Programs
SLIDE 36 Part II: Justified Speculative Control
Safe reinforcement learning in partially modeled environments
≠
AAAI 2018
SLIDE 37
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
SLIDE 38
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete control Continuous motion
SLIDE 39
Model-Based Verification
Accurate, analyzable models often exist!
{ {?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
discrete, non-deterministic control Continuous motion
SLIDE 40
Model-Based Verification
Accurate, analyzable models often exist!
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 41
Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign
SLIDE 42 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification.
SLIDE 43 Model-Based Verification
Accurate, analyzable models often exist! formal verification gives strong safety guarantees
=
- Computer-checked proofs
- f safety specification
- Formal proofs mapping
model to runtime monitors
SLIDE 44
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
SLIDE 45
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*
How to implement? Only accurate sometimes
SLIDE 46
Model-Based Verification Isn’t Enough
Perfect, analyzable models don’t exist!
{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*
How to implement? Only accurate sometimes
SLIDE 47 Safe RL Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve nondeterminism without
sacrificing formal safety results
SLIDE 48 Safe RL Contribution
Justified Speculative Control is an approach toward provably safe reinforcement learning that:
- 1. learns to resolve nondeterminism without
sacrificing formal safety results
- 2. allows and directs speculation whenever
model mismatches occur
SLIDE 49 Learning to Resolve Non-determinism
Observe & compute reward Act
SLIDE 50 Learning to Resolve Non-determinism
Observe & compute reward
accel ∪ brake U turn
SLIDE 51 Learning to Resolve Non-determinism
Observe & compute reward
{accel,brake,turn}
SLIDE 52 Learning to Resolve Non-determinism
⇨
Observe & compute reward
Policy
{accel,brake,turn}
SLIDE 53 Learning to Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
{accel,brake,turn}
SLIDE 54 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
Useful to stay safe during learning Crucial after deployment
SLIDE 55 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
Safety Monitor
≠ “Trust Me”
SLIDE 56 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
SLIDE 57 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
SLIDE 58 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the non-
deterministic model to the learned (deterministic) policy
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
SLIDE 59 (safe?) Policy
Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
φ
Main Theorem: If the ODEs are accurate, then
- ur formal proofs transfer from the non-
deterministic model to the learned (deterministic) policy via the model monitor.
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
SLIDE 60 What about the physical model?
⇨
Observe & compute reward
φ
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
{pos’=vel,vel’=acc} ≠ (safe?) Policy
SLIDE 61 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
SLIDE 62 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 63 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
SLIDE 64 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate
SLIDE 65 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Model is accurate.
Model is inaccurate Obstacle!
SLIDE 66 What About the Physical Model?
Observe & compute reward {brake, accel, turn}
Expected Reality
SLIDE 67 Speculation is Justified
Observe & compute reward {brake, accel, turn}
Expected (safe) Reality (crash!)
SLIDE 68 Leveraging Verification Results to Learn Better
Observe & compute reward {brake, accel, turn}
Use a real-valued version of the model monitor as a reward signal
SLIDE 69
Safe RL: How?
Details: ☐ Detect modeled vs unmodeled state space correctly at runtime. ☐ Convert monitors into reward signals
SLIDE 70 Detecting unmodeled State Space
The ModelPlex algorithm, implemented using Bellerophon, generates verified runtime monitors.
[x:=t]f(x) ↔ f(t) [a;b]P ↔ [a][b]P [a∪b]P ↔ ([a]P & [b]P) [x’=f&Q]P → (Q → P) ...
AXIOM BASE
KeYmaera X Core
Q.E.D. Programming Languages
Standard Library
ModelPlex
SLIDE 71 Detecting unmodeled State Space
- ldPos := read_sensor(GPS)
actuate(accel) newPos := read_sensor(GPS) if (∃t. model_after(t) == newPos): # No model deviation. else: # Model deviation…?
SLIDE 72 Detecting unmodeled State Space
- ldPos := read_sensor(GPS)
actuate(accel) newPos := read_sensor(GPS) if (∃t. model_after(t) == newPos): # No model deviation. else: # Model deviation…?
SLIDE 73 Detecting unmodeled State Space
- ldPos := read_sensor(GPS)
actuate(accel) newPos := read_sensor(GPS) if (QE(∃t. model_after(t) == newPos)): # No model deviation. else: # Model deviation…?
SLIDE 74
Safe RL: How?
Details: Runtime monitoring separates modeled from unmodeled state space. ☐ Convert monitors into reward signals
SLIDE 75
Safe RL: How?
Details: Runtime monitoring separates modeled from unmodeled state space. ☐ Convert monitors into reward signals: (ℝn→") → (ℝn→ℝ)!?
SLIDE 76
An Example
init → [{ {?safeAccel;accel ∪ brake ∪ ?safeMaint; maintVel}; {pos’ = vel, vel’ = acc, t’=1} }*]safe
SLIDE 77 An Example Monitor
init → [{ {?safeAccel;accel ∪ brake ∪ ?safeMaintain; maintainVel}; {pos’ = vel, vel’ = acc, t’=1} }*]safe (tpost >= 0 ∧ apost = acc ∧ vpost = acc tpost + v ∧ ppost = acc tpost2/2 + v tpost + p) ∨ (tpost >= 0 ∧ apost = 0 ∧ vpost = v ∧ ppost = vtpost + p) ∨ Etc.
SLIDE 78 An Example Monitor
init → [{ {?safeAccel;accel ∪ brake ∪ ?safeMaintain; maintainVel}; {pos’ = vel, vel’ = acc, t’=1} }*]safe (tpost >= 0 ∧ apost = accel ∧ vpost = acc tpost + v ∧ ppost = acc tpost2/2 + v tpost + p) ∨ (tpost >= 0 ∧ apost = 0 ∧ vpost = v ∧ ppost = vtpost + p) ∨ Etc.
SLIDE 79 An Example Monitor
init → [{ {?safeAccel;accel ∪ brake ∪ ?safeMaintain; maintainVel}; {pos’ = vel, vel’ = acc, t’=1} }*]safe (tpost >= 0 ∧ apost = acc ∧ vpost = accel tpost + v ∧ ppost = acc tpost2/2 + v tpost + p) ∨ (tpost >= 0 ∧ apost = 0 ∧ vpost = v ∧ ppost = vtpost + p) ∨ Etc.
SLIDE 80 An Example Monitor
init → [{ {?safeAccel;accel ∪ brake ∪ ?safeMaintain; maintainVel}; {pos’ = vel, vel’ = acc, t’=1} }*]safe (tpost >= 0 ∧ apost = acc ∧ vpost = accel tpost + v ∧ ppost = acc tpost2/2 + v tpost + p) ∨ (tpost >= 0 ∧ apost = 0 ∧ vpost = v ∧ ppost = vtpost + p) ∨ Etc.
- Q.E. for RCF
- ODE solutions backed
by proofs
Quantitative monitor as reward signal
SLIDE 81
Safe RL: How?
Details: Runtime monitoring separates modeled from unmodeled state space. Convert monitors into gradients: (ℝn→") → (ℝn→ℝ)
SLIDE 82
Safe RL: How?
Details: Runtime monitoring separates modeled from unmodeled state space. Convert models into gradients: ModelPlex (ℝn→") → (ℝn→ℝ)
SLIDE 83 Learning to Safely Resolve Non-determinism
⇨
Observe & compute reward
(safe?) Policy
φ
Use a theorem prover to extract: (init→[{{accel∪brake};ODEs}*](safe))
φ
SLIDE 84 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
(init→[{ctrli;ODEi}*](safe))
φi
SLIDE 85 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
φ1
(init→[{ctrli;ODEi}*](safe))
φi φ2 φn
SLIDE 86 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
φ1
(init→[{ctrli;ODEi}*](safe))
φi φ2 φn
Monitor Conjunction
SLIDE 87 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
φ1
(init→[{ctrli;ODEi}*](safe))
φi φ2 φn
Differentiating Experiment
SLIDE 88 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
(init→[{ctrli;ODEi}*](safe))
φi φ2 φn
Differentiating Experiment
SLIDE 89 Learning to Safely Handle Multiple Models
⇨
Observe & compute reward
(safe?) Policy
(init→[{ctrli;ODEi}*](safe))
φi φn
Differentiating Experiment
SLIDE 90 Conclusion
KeYmaera X + Justified Speculative Control provide strong safety guarantees for learning-enabled CPS.
- 1. Was the proof correct?
- 2. Was the model accurate enough?
≠
SLIDE 91 Conclusion
DI Axiom: [{x'=f&Q}]P↔([?Q]P←(Q→[{x'=f&Q}]P')) Example: [v’=rpv2-g,t’=1]v ≥ v0 - gt ↔ … ↔ [v’:=rpv2-g][t’:=1]v’ ≥ -g*t’ ↔ rpv2-g ≥ -g ↔ H→rp≥0 Side derivation: (v ≥ v0 - gt)’ ↔ ...↔ ...↔ ... dI Tactic: H=rp≥0 & ra≥0 & g>0 & ...
Axioms
KyX
qed
ODE & Controls Tooling Clever Bellerophon Programs
KeYmaera X + Justified Speculative Control provide strong safety guarantees for learning-enabled CPS.
- 1. Was the proof correct? KeYmaera X
- 2. Was the model accurate enough?
SLIDE 92 Conclusion
=
KeYmaera X + Justified Speculative Control provide strong safety guarantees for learning-enabled CPS.
- 1. Was the proof correct? KeYmaera X
- 2. Was the model accurate enough? Justified Speculation
Get to here... ...from here
SLIDE 93 Conclusion
KeYmaera X + Justified Speculative Control provide strong safety guarantees for learning-enabled CPS.
- 1. Was the proof correct? KeYmaera X
- 2. Was the model accurate enough? Justified Speculation
- 3. With multiple possible models? µ-learning
- 4. When off-model? Verification-preserving model update
SLIDE 94
Acknowledgments
SLIDE 95