[PPT] - Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr PowerPoint Presentation

SLIDE 1

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

SLIDE 2

Safety-Critical Systems

"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

SLIDE 3

Autonomous Safety-Critical Systems

How can we provide people with autonomous cyber-physical systems they can bet their lives on?

SLIDE 4

Model-Based Verification

φ

Reinforcement Learning

SLIDE 5

Model-Based Verification

pos < stopSign

Reinforcement Learning

SLIDE 6

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 7

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 8

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

SLIDE 9

Benefits:

Strong safety guarantees
Automated analysis

Model-Based Verification

φ

Reinforcement Learning

SLIDE 10

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Model-Based Verification

φ

Reinforcement Learning

SLIDE 11

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

SLIDE 12

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

SLIDE 13

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

SLIDE 14

Benefits:

Strong safety guarantees
Automated analysis

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

SLIDE 15

Benefits:

Strong safety guarantees
Aomputational aids (ATP)

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

SLIDE 16

Benefits:

Strong safety guarantees
Aomputational aids (ATP)

Drawbacks:

Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

No need for complete model
Optimal (effective) policies

Drawbacks:

No strong safety guarantees
Proofs are obtained and

checked by hand

Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

1. Learn Safety
2. Learn a Safe Policy
3. Justify claims of safety

SLIDE 17

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

SLIDE 18

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete control Continuous motion

SLIDE 19

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete, non-deterministic control Continuous motion

SLIDE 20

Model-Based Verification

Accurate, analyzable models often exist!

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

SLIDE 21

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

SLIDE 22

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

Computer-checked proofs
f safety specification.

SLIDE 23

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

Computer-checked proofs
f safety specification
Formal proofs mapping

model to runtime monitors

SLIDE 24

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

SLIDE 25

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

How to implement? Only accurate sometimes

SLIDE 26

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=wy, dy’=-wx, ...} }*

How to implement? Only accurate sometimes

SLIDE 27

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

1. learns to resolve non-determinism without

sacrificing formal safety results

SLIDE 28

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

1. learns to resolve non-determinism without

sacrificing formal safety results

2. allows and directs speculation whenever

model mismatches occur

SLIDE 29

Learning to Resolve Non-determinism

Observe & compute reward Act

SLIDE 30

Learning to Resolve Non-determinism

Observe & compute reward

accel ∪ brake U turn

SLIDE 31

Learning to Resolve Non-determinism

Observe & compute reward

{accel,brake,turn}

SLIDE 32

Learning to Resolve Non-determinism

⇨

Observe & compute reward

Policy

{accel,brake,turn}

SLIDE 33

Learning to Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

{accel,brake,turn}

SLIDE 34

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

Safety Monitor

SLIDE 35

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

Safety Monitor

≠ “Trust Me”

SLIDE 36

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 37

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 38

(safe?) Policy

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 39

(safe?) Policy

Learning to Safely Resolve Non-determinism

⇨

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy via the model monitor.

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

SLIDE 40

What about the physical model?

⇨

Observe & compute reward

φ