Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - - PowerPoint PPT Presentation

safe reinforcement learning via formal methods
SMART_READER_LITE
LIVE PREVIEW

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safety-Critical Systems "How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing


slide-1
SLIDE 1

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

slide-2
SLIDE 2

Safety-Critical Systems

"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

slide-3
SLIDE 3

Autonomous Safety-Critical Systems

How can we provide people with autonomous cyber-physical systems they can bet their lives on?

slide-4
SLIDE 4

Model-Based Verification

φ

Reinforcement Learning

slide-5
SLIDE 5

Model-Based Verification

pos < stopSign

Reinforcement Learning

slide-6
SLIDE 6

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-7
SLIDE 7

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-8
SLIDE 8

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-9
SLIDE 9

Benefits:

  • Strong safety guarantees
  • Automated analysis

Model-Based Verification

φ

Reinforcement Learning

slide-10
SLIDE 10

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Model-Based Verification

φ

Reinforcement Learning

slide-11
SLIDE 11

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

slide-12
SLIDE 12

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

slide-13
SLIDE 13

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies
slide-14
SLIDE 14

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

slide-15
SLIDE 15

Benefits:

  • Strong safety guarantees
  • Aomputational aids (ATP)

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

slide-16
SLIDE 16

Benefits:

  • Strong safety guarantees
  • Aomputational aids (ATP)

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

  • 1. Learn Safety
  • 2. Learn a Safe Policy
  • 3. Justify claims of safety
slide-17
SLIDE 17

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

slide-18
SLIDE 18

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete control Continuous motion

slide-19
SLIDE 19

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete, non-deterministic control Continuous motion

slide-20
SLIDE 20

Model-Based Verification

Accurate, analyzable models often exist!

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

slide-21
SLIDE 21

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

slide-22
SLIDE 22

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

  • Computer-checked proofs
  • f safety specification.
slide-23
SLIDE 23

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

  • Computer-checked proofs
  • f safety specification
  • Formal proofs mapping

model to runtime monitors

slide-24
SLIDE 24

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

slide-25
SLIDE 25

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

How to implement? Only accurate sometimes

slide-26
SLIDE 26

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*

How to implement? Only accurate sometimes

slide-27
SLIDE 27

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

  • 1. learns to resolve non-determinism without

sacrificing formal safety results

slide-28
SLIDE 28

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

  • 1. learns to resolve non-determinism without

sacrificing formal safety results

  • 2. allows and directs speculation whenever

model mismatches occur

slide-29
SLIDE 29

Learning to Resolve Non-determinism

Observe & compute reward Act

slide-30
SLIDE 30

Learning to Resolve Non-determinism

Observe & compute reward

accel ∪ brake U turn

slide-31
SLIDE 31

Learning to Resolve Non-determinism

Observe & compute reward

{accel,brake,turn}

slide-32
SLIDE 32

Learning to Resolve Non-determinism

Observe & compute reward

Policy

{accel,brake,turn}

slide-33
SLIDE 33

Learning to Resolve Non-determinism

Observe & compute reward

(safe?) Policy

{accel,brake,turn}

slide-34
SLIDE 34

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

Safety Monitor

slide-35
SLIDE 35

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

Safety Monitor

≠ “Trust Me”

slide-36
SLIDE 36

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-37
SLIDE 37

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-38
SLIDE 38

(safe?) Policy

Learning to Safely Resolve Non-determinism

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

  • ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-39
SLIDE 39

(safe?) Policy

Learning to Safely Resolve Non-determinism

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

  • ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy via the model monitor.

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-40
SLIDE 40

What about the physical model?

Observe & compute reward

φ

Use a theorem prover to prove: (init→ [{{accel∪brake};ODEs}*](safe)) ↔ φ

{pos’=vel,vel’=acc} ≠ (safe?) Policy

slide-41
SLIDE 41

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

slide-42
SLIDE 42

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-43
SLIDE 43

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-44
SLIDE 44

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate

slide-45
SLIDE 45

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate Obstacle!

slide-46
SLIDE 46

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected Reality

slide-47
SLIDE 47

Speculation is Justified

Observe & compute reward {brake, accel, turn}

Expected (safe) Reality (crash!)

slide-48
SLIDE 48

Leveraging Verification Results to Learn Better

Observe & compute reward {brake, accel, turn}

Use a real-valued version of the model monitor as a reward signal

slide-49
SLIDE 49

Conclusion

Justified Speculative Control provides the best of logic and learning:

Policy

φ

slide-50
SLIDE 50

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)

Policy

φ

slide-51
SLIDE 51

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.

Policy

φ

slide-52
SLIDE 52

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.
  • Leverage theorem proving to transfer proofs to learned policies.

Policy

φ

slide-53
SLIDE 53

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.
  • Leverage theorem proving to transfer proofs to learned policies.
  • Unsafe speculation is justified when model deviates from reality

Policy

φ

slide-54
SLIDE 54

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models
  • Leverage theorem proving to transfer proofs to learned policies
  • Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

Policy

φ

slide-55
SLIDE 55

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models
  • Leverage theorem proving to transfer proofs to learned policies
  • Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

Policy

φ