Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - - PowerPoint PPT Presentation

safe reinforcement learning via formal methods
SMART_READER_LITE
LIVE PREVIEW

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr - - PowerPoint PPT Presentation

Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safe Reinforcement Learning via Formal Methods Nathan Fulton and Andr Platzer Carnegie Mellon University Safety-Critical Systems


slide-1
SLIDE 1

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

slide-2
SLIDE 2

Safe Reinforcement Learning via Formal Methods

Nathan Fulton and André Platzer Carnegie Mellon University

slide-3
SLIDE 3

Safety-Critical Systems

"How can we provide people with cyber-physical systems they can bet their lives on?" - Jeannette Wing

slide-4
SLIDE 4

Autonomous Safety-Critical Systems

How can we provide people with autonomous cyber-physical systems they can bet their lives on?

slide-5
SLIDE 5

Model-Based Verification

φ

Reinforcement Learning

slide-6
SLIDE 6

Model-Based Verification

pos < stopSign

Reinforcement Learning

slide-7
SLIDE 7

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-8
SLIDE 8

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-9
SLIDE 9

Approach: prove that control software achieves a specification with respect to a model of the physical system.

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl

slide-10
SLIDE 10

Benefits:

  • Strong safety guarantees
  • Automated analysis

Model-Based Verification

φ

Reinforcement Learning

slide-11
SLIDE 11

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

Model-Based Verification

φ

Reinforcement Learning

slide-12
SLIDE 12

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

slide-13
SLIDE 13

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

slide-14
SLIDE 14

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies
slide-15
SLIDE 15

Benefits:

  • Strong safety guarantees
  • Automated analysis

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model.

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

slide-16
SLIDE 16

Benefits:

  • Strong safety guarantees
  • Aomputational aids (ATP)

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

slide-17
SLIDE 17

Benefits:

  • Strong safety guarantees
  • Aomputational aids (ATP)

Drawbacks:

  • Control policies are typically

non-deterministic: answers “what is safe”, not “what is useful”

  • Assumes accurate model

Model-Based Verification

φ

Reinforcement Learning

Observe Act

Benefits:

  • No need for complete model
  • Optimal (effective) policies

Drawbacks:

  • No strong safety guarantees
  • Proofs are obtained and

checked by hand

  • Formal proofs = decades-long

proof development

Goal: Provably correct reinforcement learning

  • 1. Learn Safety
  • 2. Learn a Safe Policy
  • 3. Justify claims of safety
slide-18
SLIDE 18

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

slide-19
SLIDE 19

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete control Continuous motion

slide-20
SLIDE 20

Model-Based Verification

Accurate, analyzable models often exist!

{ {?safeAccel;accel∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

discrete, non-deterministic control Continuous motion

slide-21
SLIDE 21

Model-Based Verification

Accurate, analyzable models often exist!

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

slide-22
SLIDE 22

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

init → [{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*]pos < stopSign

slide-23
SLIDE 23

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

  • Computer-checked proofs
  • f safety specification.
slide-24
SLIDE 24

Model-Based Verification

Accurate, analyzable models often exist! formal verification gives strong safety guarantees

=

  • Computer-checked proofs
  • f safety specification
  • Formal proofs mapping

model to runtime monitors

slide-25
SLIDE 25

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

slide-26
SLIDE 26

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {pos’ = vel, vel’ = acc} }*

How to implement? Only accurate sometimes

slide-27
SLIDE 27

Model-Based Verification Isn’t Enough

Perfect, analyzable models don’t exist!

{ { ?safeAccel;accel ∪ brake ∪ ?safeTurn; turn}; {dx’=w*y, dy’=-w*x, ...} }*

How to implement? Only accurate sometimes

slide-28
SLIDE 28

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

  • 1. learns to resolve non-determinism without

sacrificing formal safety results

slide-29
SLIDE 29

Our Contribution

Justified Speculative Control is an approach toward provably safe reinforcement learning that:

  • 1. learns to resolve non-determinism without

sacrificing formal safety results

  • 2. allows and directs speculation whenever

model mismatches occur

slide-30
SLIDE 30

Learning to Resolve Non-determinism

Observe & compute reward Act

slide-31
SLIDE 31

Learning to Resolve Non-determinism

Observe & compute reward

accel ∪ brake U turn

slide-32
SLIDE 32

Learning to Resolve Non-determinism

Observe & compute reward

{accel,brake,turn}

slide-33
SLIDE 33

Learning to Resolve Non-determinism

Observe & compute reward

Policy

{accel,brake,turn}

slide-34
SLIDE 34

Learning to Resolve Non-determinism

Observe & compute reward

(safe?) Policy

{accel,brake,turn}

slide-35
SLIDE 35

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

Safety Monitor

slide-36
SLIDE 36

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

Safety Monitor

≠ “Trust Me”

slide-37
SLIDE 37

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-38
SLIDE 38

Learning to Safely Resolve Non-determinism

Observe & compute reward

(safe?) Policy

φ

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-39
SLIDE 39

(safe?) Policy

Learning to Safely Resolve Non-determinism

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

  • ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-40
SLIDE 40

(safe?) Policy

Learning to Safely Resolve Non-determinism

Observe & compute reward

φ

Main Theorem: If the ODEs are accurate, then

  • ur formal proofs transfer from the

non-deterministic model to the learned (deterministic) policy via the model monitor.

Use a theorem prover to prove: (init→[{{accel∪brake};ODEs}*](safe)) ↔ φ

slide-41
SLIDE 41

What about the physical model?

Observe & compute reward

φ

Use a theorem prover to prove: (init→ [{{accel∪brake};ODEs}*](safe)) ↔ φ

{pos’=vel,vel’=acc} ≠ (safe?) Policy

slide-42
SLIDE 42

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

slide-43
SLIDE 43

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-44
SLIDE 44

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-45
SLIDE 45

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate

slide-46
SLIDE 46

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

Model is inaccurate Obstacle!

slide-47
SLIDE 47

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected Reality

slide-48
SLIDE 48

Speculation is Justified

Observe & compute reward {brake, accel, turn}

Expected (safe) Reality (crash!)

slide-49
SLIDE 49

Leveraging Verification Results to Learn Better

Observe & compute reward {brake, accel, turn}

Use a real-valued version of the model monitor as a reward signal

slide-50
SLIDE 50

Conclusion

Justified Speculative Control provides the best of logic and learning:

Policy

φ

slide-51
SLIDE 51

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)

Policy

φ

slide-52
SLIDE 52

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.

Policy

φ

slide-53
SLIDE 53

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.
  • Leverage theorem proving to transfer proofs to learned policies.

Policy

φ

slide-54
SLIDE 54

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models.
  • Leverage theorem proving to transfer proofs to learned policies.
  • Unsafe speculation is justified when model deviates from reality

Policy

φ

slide-55
SLIDE 55

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models
  • Leverage theorem proving to transfer proofs to learned policies
  • Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

Policy

φ

slide-56
SLIDE 56

Conclusion

Justified Speculative Control provides the best of logic and learning:

  • Formally model the control system (control + physics)
  • Learn how to resolve non-determinism in models
  • Leverage theorem proving to transfer proofs to learned policies
  • Unsafe speculation is justified when model deviates from reality,

but verification results can still be helpful!

Policy

φ

slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60

Justified Speculative Control

Learn over a constrained action space

slide-61
SLIDE 61

Justified Speculative Control

Learn over a constrained action space

slide-62
SLIDE 62

Safe Reinforcement Learning?

Observe & compute reward

unverified Policy

Policy deviates from model:

  • 1. Policy is deterministic, verification result is

set-valued.

{accel,brake,turn}

slide-63
SLIDE 63

Some Actions Aren’t Always Safe

Observe & compute reward

unverified Policy

Policy deviates from model:

  • 1. Policy is deterministic, verification result is

set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake

slide-64
SLIDE 64

Some Actions Aren’t Always Safe

Observe & compute reward

unverified Policy

Policy deviates from model:

  • 1. Policy is deterministic, verification result is

set-valued. {accel,brake,turn} ≠ ?safeAccel; accel ∪ brake

slide-65
SLIDE 65

Safe Reinforcement Learning?

unverified Policy

Policy deviates from model:

  • 1. Policy is deterministic, verification result is

set-valued.

Observe & compute reward

?safeAccel; accel ∪ brake ≠

slide-66
SLIDE 66

Physical Models are Approximations

Policy deviates from model:

  • 1. Policy is deterministic, verification result is

set-valued.

  • 2. Environment may not be accurately modeled.

Observe & compute reward

unverified Policy

{accel,brake,turn}

≠ pos’=vel, vel’=acc

slide-67
SLIDE 67

Safety resolving non-determinism

unverified Policy

?safeAccel; accel ∪ brake ≠

slide-68
SLIDE 68

Sandboxing Reinforcement Learning

“Accurate modulo determinism”

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

slide-69
SLIDE 69

Sandboxing Reinforcement Learning

Learn over a constrained action space “Accurate modulo determinism”

slide-70
SLIDE 70

Sandboxing Reinforcement Learning

Learn over a constrained action space “Accurate modulo determinism”

slide-71
SLIDE 71

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

Policy

Constrained Actions Observe & compute reward

slide-72
SLIDE 72

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

slide-73
SLIDE 73

Sandboxing Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved during learning and by learned policies.

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

slide-74
SLIDE 74

Sandboxing Safe Reinforcement Learning

Theorem: If the physical model is accurate then verification results are preserved by learned policies.

Observe & compute reward

Policy

Constrained Actions

init → [{ {accel ∪ brake}; t:=0; continuousMotion }*](safe)

slide-75
SLIDE 75

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

slide-76
SLIDE 76

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-77
SLIDE 77

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is accurate.

slide-78
SLIDE 78

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is correct.

Model is inaccurate

slide-79
SLIDE 79

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Model is correct.

Model is inaccurate Obstacle!

slide-80
SLIDE 80

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected Reality

slide-81
SLIDE 81

What About the Physical Model?

Observe & compute reward {brake, accel, turn}

Expected (safe) Reality (crash!)

slide-82
SLIDE 82

Justified Speculative Control

Learn over a constrained action space

slide-83
SLIDE 83

Justified Speculative Control

Learn over a constrained action space

slide-84
SLIDE 84

Justified Speculative Control

Some Questions:

  • 1. How do we know when we’re in unmodeled state space?
  • 2. What do we do when we are in modeled state space?

Learn over a constrained action space

Learn

slide-85
SLIDE 85

Justified Speculative Control

Some Questions:

  • 1. How do we know when we’re in unmodeled state space?
  • 2. What do we do when we are in modeled state space?

Learn over a constrained action space

Learn

slide-86
SLIDE 86

Justified Speculative Control

Theorem: Verification results are preserved outside of red

  • region. But:

☒ How do we know when we’re in unmodeled state space? ☐ What do we do when we are in modeled state space? Learn over a constrained action space

Learn

slide-87
SLIDE 87

What do we do in unmodeled state-space?

slide-88
SLIDE 88

What do we do in unmodeled state-space?

slide-89
SLIDE 89

What do we do in unmodeled state-space?

slide-90
SLIDE 90

What do we do in unmodeled state-space?

Get from here...

slide-91
SLIDE 91

What do we do in unmodeled state-space?

...to here Get from here...

slide-92
SLIDE 92

Leveraging Formal Methods during Learning

Leader Own Car

slide-93
SLIDE 93

Leveraging Formal Methods during Learning

Perturbation “Don’t hit the leader”

“Get back to modeled state space”

5% 3 2 25% 18 16 50% 41 24

Leader Own Car

slide-94
SLIDE 94

Conclusion

KeYmaera X + Justified Speculative Control:

  • 1. Transfer formal verification results for

non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.

slide-95
SLIDE 95

Conclusion

KeYmaera X + Justified Speculative Control:

  • 1. Transfer formal verification results for

non-deterministic control policies to policies obtained via a generic reinforcement learning algorithm.

  • 2. Leverages insights obtained during verification to direct

future learning.

slide-96
SLIDE 96

init → [{ {?safeAccel; accel ∪ brake}; t:=0; {pos’=vel,vel’=acc} }*](pos < stopSign)

Model-Based Verification

pos < stopSign

Reinforcement Learning

ctrl