Learning Small Strategies Fast Jan K ret nsk y Technical - - PowerPoint PPT Presentation

learning small strategies fast
SMART_READER_LITE
LIVE PREVIEW

Learning Small Strategies Fast Jan K ret nsk y Technical - - PowerPoint PPT Presentation

Learning Small Strategies Fast Jan K ret nsk y Technical University of Munich, Germany joint work with P . Ashok, E. Kelmendi, J. Kr amer, T. Meggendorfer, M. Weininger (TUM) T. Br azdil (Masaryk University Brno), K.


slide-1
SLIDE 1

Learning Small Strategies Fast

Jan Kˇ ret´ ınsk´ y

Technical University of Munich, Germany joint work with P . Ashok, E. Kelmendi, J. Kr¨ amer, T. Meggendorfer, M. Weininger (TUM)

  • T. Br´

azdil (Masaryk University Brno),

  • K. Chatterjee, M. Chmel´

ık, P . Daca, A. Fellner, T. Henzinger, T. Petrov, V. Toman (IST Austria),

  • V. Forejt, M. Kwiatkowska, M. Ujma (Oxford University)
  • D. Parker (University of Birmingham)

Logic and Learning The Alan Turing Institute January 12, 2018

slide-2
SLIDE 2

Controller synthesis and verification

2/13

slide-3
SLIDE 3

Controller synthesis and verification

2/13

slide-4
SLIDE 4

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues

slide-5
SLIDE 5

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues

MEM-OUT

slide-6
SLIDE 6

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues

slide-7
SLIDE 7

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues – can be hard to use Learning – weaker guarantees + scalable + simpler solutions different objectives

slide-8
SLIDE 8

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues – can be hard to use Learning – weaker guarantees + scalable + simpler solutions

slide-9
SLIDE 9

Formal methods and machine learning

3/13

Formal methods + precise – scalability issues – can be hard to use Learning – weaker guarantees + scalable + simpler solutions precise computation focus on important stuff

slide-10
SLIDE 10

Examples

4/13

◮ Reinforcement learning for efficient strategy synthesis

◮ MDP with functional spec (reachability, LTL)1 2 ◮ MDP with performance spec (mean payoff/average reward)3 4 ◮ Simple stochastic games (reachability)5

◮ Decision tree learning for efficient strategy representation

◮ MDP6 ◮ Games7

1Brazdil, Chatterjee, Chmelik, Forejt, K., Kwiatkowska, Parker, Ujma: Verification of

Markov Decision Processes Using Learning Algorithms. ATVA 2014

2Daca, Henzinger, K., Petrov: Faster Statistical Model Checking for Unbounded

Temporal Properties. TACAS 2016

3Ashok, Chatterjee, Daca, K., Meggendorfer: Value Iteration for Long-run Average

Reward in Markov Decision Processes. CAV 2017

4K., Meggendorfer: Efficient Strategy Iteration for Mean Payoff in Markov Decision

  • Processes. ATVA 2017

5draft 6Brazdil, Chatterjee, Chmelik, Fellner, K.: Counterexample Explanation by Learning

Small Strategies in Markov Decision Processes. CAV 2015

7Brazdil, Chatterjee, K., Toman: Strategy Representation by Decision Trees

in Reactive Synthesis. TACAS 2018

slide-11
SLIDE 11

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1

slide-12
SLIDE 12

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1

max

strategy σ Pσ[ goal]

slide-13
SLIDE 13

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1 down 0.01 0.99 b 0.5 0.5 c 1

max

strategy σ Pσ[ goal]

slide-14
SLIDE 14

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1 down 0.01 0.99 b 0.5 0.5 c 1

max

strategy σ Pσ[ goal]

slide-15
SLIDE 15

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1 down 0.01 0.99 b 0.5 0.5 c 1

max

strategy σ Pσ[ goal]

slide-16
SLIDE 16

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1 down 0.01 0.99

max

strategy σ Pσ[ goal]

slide-17
SLIDE 17

Example: Markov decision processes

5/13

init

  • p. . .

s · · · v1 t goal up 1 down 0.01 0.99 b 0.5 0.5 a 1 c 1 down 0.01 0.99

max

strategy σ Pσ[ goal]

ACTION = down Y N

slide-18
SLIDE 18

Example 1: Computing strategies faster

6/13

1: repeat 3:

for all transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ 1: procedure Update(s

a

−→)

2:

UpBound(s, a) :=

s′∈S ∆(s, a, s′) · UpBound(s′)

3:

LoBound(s, a) :=

s′∈S ∆(s, a, s′) · LoBound(s′)

4:

UpBound(s) := maxa∈A UpBound(s, a)

5:

LoBound(s) := maxa∈A LoBound(s, a)

slide-19
SLIDE 19

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently

1: repeat 3:

for all transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

slide-20
SLIDE 20

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently

1: repeat 2:

sample a path from sinit

3:

for all visited transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

slide-21
SLIDE 21

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently by reasonably good strategies

1: repeat 2:

sample a path from sinit

3:

for all visited transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

slide-22
SLIDE 22

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently by reasonably good strategies

1: repeat 2:

sample a path from sinit

3:

for all visited transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

slide-23
SLIDE 23

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently by reasonably good strategies

1: repeat 2:

sample a path from sinit ⊲ pick action arg max

a

UpBound(s

a

−→)

3:

for all visited transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

slide-24
SLIDE 24

Example 1: Computing strategies faster

6/13

More frequently update what is visited more frequently by reasonably good strategies

1: repeat 2:

sample a path from sinit ⊲ pick action arg max

a

UpBound(s

a

−→)

3:

for all visited transitions s

a

−→ do

4:

Update(s

a

−→)

5: until UpBound(sinit) − LoBound(sinit) < ǫ

faster & sure updates important parts of the system

slide-25
SLIDE 25

Example 1: Experimental results

7/13

Example Visited states PRISM with RL zeroconf 4,427,159 977 wlan 5,007,548 1,995 firewire 19,213,802 32,214 mer 26,583,064 1,950

slide-26
SLIDE 26

Example 2: Computing small strategies

8/13

◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree)

slide-27
SLIDE 27

Example 2: Computing small strategies

8/13

◮ explicit map σ : S → A ◮ BDD (binary decision diagrams) encoding its bit representation ◮ DT (decision tree)

slide-28
SLIDE 28

Example 2: Computing small strategies

9/13

precise decisions DT, importance of decisions Cut off states with zero importance (un- reachable or useless) Cut off states with low importance (small error, ε-optimal strategy) How to make use of the exact quantities? Importance of a decision in s with respect to goal and strategy σ:

slide-29
SLIDE 29

Example 2: Computing small strategies

9/13

precise decisions DT, importance of decisions Cut off states with zero importance (un- reachable or useless) Cut off states with low importance (small error, ε-optimal strategy) How to make use of the exact quantities? Importance of a decision in s with respect to goal and strategy σ: Pσ[s | goal]

slide-30
SLIDE 30

Example 2: Experimental results

10/13

Example #states Value Explicit BDD DT Rel.err(DT) % firewire 481,136 1.0 479,834 4233 1 0.0 investor 35,893 0.958 28,151 783 27 0.886 mer 1,773,664 0.200016 ——— MEM-OUT ——— * zeroconf 89,586 0.00863 60,463 409 7 0.106

slide-31
SLIDE 31

Example 2: Experimental results

10/13

Example #states Value Explicit BDD DT Rel.err(DT) % firewire 481,136 1.0 479,834 4233 1 0.0 investor 35,893 0.958 28,151 783 27 0.886 mer 1,773,664 0.200016 ——— MEM-OUT ——— * zeroconf 89,586 0.00863 60,463 409 7 0.106 * MEM-OUT in PRISM, whereas RL yields: 1887 619 13 0.00014

slide-32
SLIDE 32

Some related work

11/13

Reinforcement learning in verification

◮ Junges, Jansen, Dehnert, Topcu, Katoen: Safety-Constrained

Reinforcement Learning for MDPs. TACAS 2016

◮ David, Jensen, Larsen, Legay, Lime, Sorensen, Taankvist: On Time

with Minimal Expected Cost! ATVA 2014 Strategy representation learning

◮ Neider, Topcu: An Automaton Learning Approach to Solving Safety

Games over Infinite Graphs. TACAS 2016 Invariants generation, theorem provers guidance, . . .

slide-33
SLIDE 33

Summary

12/13

Machine learning in verification

◮ Scalable heuristics ◮ Example 1: Speeding up value iteration

◮ technique: reinforcement learning, BRTDP ◮ idea: focus on updating “most important parts”

= most often visited by good strategies

◮ Example 2: Small and readable strategies

◮ technique: decision tree learning ◮ idea: based on the importance of states,

feed the decisions to the learning algorithm

◮ Learning in Verification (LiVe) at ETAPS

slide-34
SLIDE 34

Summary

12/13

Machine learning in verification

◮ Scalable heuristics ◮ Example 1: Speeding up value iteration

◮ technique: reinforcement learning, BRTDP ◮ idea: focus on updating “most important parts”

= most often visited by good strategies

◮ Example 2: Small and readable strategies

◮ technique: decision tree learning ◮ idea: based on the importance of states,

feed the decisions to the learning algorithm

◮ Learning in Verification (LiVe) at ETAPS

Thank you

slide-35
SLIDE 35

Discussion

13/13

Verification using machine learning

◮ How far do we want to compromise? ◮ Do we have to compromise?

◮ BRTDP

, invariant generation, strategy representation don’t

◮ Don’t we want more than ML?

◮ (ε-)optimal controllers? ◮ arbitrary controllers – is it still verification?

◮ What do we actually want?

◮ scalability shouldn’t overrule guarantees? ◮ oracle usage seems fine ◮ when is PAC enough?

slide-36
SLIDE 36

Discussion

13/13

Verification using machine learning

◮ How far do we want to compromise? ◮ Do we have to compromise?

◮ BRTDP

, invariant generation, strategy representation don’t

◮ Don’t we want more than ML?

◮ (ε-)optimal controllers? ◮ arbitrary controllers – is it still verification?

◮ What do we actually want?

◮ scalability shouldn’t overrule guarantees? ◮ oracle usage seems fine ◮ when is PAC enough?

Thank you