Model Checking Regular Expressions Arlen Cox 5-9 May 2019 IDA - - PowerPoint PPT Presentation

model checking regular expressions
SMART_READER_LITE
LIVE PREVIEW

Model Checking Regular Expressions Arlen Cox 5-9 May 2019 IDA - - PowerPoint PPT Presentation

Model Checking Regular Expressions Arlen Cox 5-9 May 2019 IDA Center for Computing Sciences 1 Managing a corpus of regular expressions Does the language of the corpus grow? 2 Managing a corpus of regular expressions s . s L ( R )


slide-1
SLIDE 1

Model Checking Regular Expressions

Arlen Cox 5-9 May 2019

IDA – Center for Computing Sciences 1

slide-2
SLIDE 2

Managing a corpus of regular expressions

Does the language of the corpus grow?

2

slide-3
SLIDE 3

Managing a corpus of regular expressions

∃s. s ∈ L(R) ∧ s / ∈ L(C)

How do different solvers perform on this problem?

Adapted from Hooimeijer, Weimer 2010

3

slide-4
SLIDE 4

Managing a corpus of regular expressions

∃s. s ∈ L(R) ∧ s / ∈ L(C)

How do different solvers perform on this problem? R = ˆ[01]*1[01]{n}$ C = ˆ[01]*0[01]{n − 1}$

Adapted from Hooimeijer, Weimer 2010

3

slide-5
SLIDE 5

Regular expression difference

5 10 15 20 25 30 parameter 20 40 60 80 100 time (s) Qzy CVC4 Z3 Ostrich Sloth

4

slide-6
SLIDE 6

Qzy has quadratic scaling in n

500 1000 1500 2000 2500 3000 3500 4000 parameter 20 40 60 80 100 120 time (s) Qzy CVC4 Z3 Ostrich Sloth

5

slide-7
SLIDE 7

Existing solvers are too slow

C is really a corpus of regular expressions. ∃s. s ∈ L(R) ∧ s / ∈ L(C1) ∧ · · · ∧ s / ∈ L(Cn) It only gets worse...

6

slide-8
SLIDE 8

Existing solvers are too slow

C is really a corpus of regular expressions. ∃s. s ∈ L(R) ∧ s / ∈ L(C1) ∧ · · · ∧ s / ∈ L(Cn) It only gets worse...

I built Qzy to solve this

6

slide-9
SLIDE 9

Email address corpus

129 email address regular expressions from Regexlib R = one regular expression from corpus C = remaining 128 regular expressions

7

slide-10
SLIDE 10

Email address corpus

129 email address regular expressions from Regexlib R = one regular expression from corpus C = remaining 128 regular expressions Solver Result CVC4 Can’t encode (non-printable character ranges) Z3 Time out after 24 hours (1 core) Ostrich Time out after 24 hours (44 cores!) Sloth Memory out (2G) after 10 minutes

7

slide-11
SLIDE 11

Qzy is fast for email address corpus

100 200 300 400 500 time (s) 100 101 102 103 104 count

8

slide-12
SLIDE 12

Qzy is fast for email address corpus

Running the whole suite of 128 cases takes:

  • 15m 2s using 1 core.
  • 97s using 32 cores of a 36 core computer.

9

slide-13
SLIDE 13

Overview

  • 1. Encoding regular expression constraints for model checking
  • 2. Implementation and optimization
  • 3. Ongoing project: Capture groups

10

slide-14
SLIDE 14

Encoding regular expression constraints for model checking

slide-15
SLIDE 15

Tabakov/Vardi universality encoding2

Regex NFA TS

  • Universality is encoded as a safety property of the transition

system.

  • Use an off-the-shelf model checker to check that property.
  • Equivalent to a backward BFA encoding1.

1Cox, Leasure. Model Checking Regular Language Constraints. 2017 2Tabakov, Vardi. Experimental Evaluation of Classical Automata

  • Constructions. 2005

11

slide-16
SLIDE 16

Tabakov/Vardi universality encoding example

Example regular expression: aa|[ab]* q0 start q1 q2 q3 a a|b a|b a a|b a|b

12

slide-17
SLIDE 17

One bit per NFA state transition system

I(q0, q1, q2, q3) = q0 ∧ ¬q1 ∧ ¬q2 ∧ ¬q3 T

  • q0, q1, q2, q3,

q′

0, q′ 1, q′ 2, q′ 3, x

  • =

        ¬q′

0 ∧

q′

1 = q0 ∧ x ∈ { a } ∧

q′

2 = (q0 ∨ q2) ∧ x ∈ { a, b } ∧

q′

3 =

  • q1 ∧ x ∈ { a } ∨

(q0 ∨ q2) ∧ x ∈ { a, b }

       P(q0, q1, q2, q3) = q0 ∨ q3

13

slide-18
SLIDE 18

Emptiness and universality

Emptiness can be checked with a model checker

  • If P is satisfied with input string ¯

x, ¯ x is in the language.

  • If P is unsatisfiable for any input string, the language is empty.

T is really a transition function, so

  • If ¬P is satisfied with input string ¯

x, ¯ x is not in the language.

  • If ¬P is unsatisfiable for any input string, the language is

universal.

14

slide-19
SLIDE 19

With determinism, language combinators follow

With a transition function, given an input, the set state bits (state set) are deterministic. Consequently the following equivalences hold L1 \ L2 ⇔ P1 ∧ ¬P2 L1 ∪ L2 ⇔ P1 ∨ P2 L1 ∩ L2 ⇔ P1 ∧ P2

15

slide-20
SLIDE 20

SMT solving with regular expressions

Using these Boolean combinators, I built Qzy, an SMT solver regular expressions.

16

slide-21
SLIDE 21

Implementation and optimization

slide-22
SLIDE 22

Implementation

Built as a C++ library with Python and C++ APIs. API similar to SMT solvers:

  • Multiple variables
  • Arbitrary Boolean combinators

Goal: feature compatible with RE2:

  • UTF-8 character classes
  • Begin/end of string/line markers
  • Word boundaries
  • Capture groups (working on it – more later)
  • Back references (not supported by RE2)
  • Look ahead (not supported by RE2)

17

slide-23
SLIDE 23

Start and end tags

Extend alphabet with special start and end characters ˆ is (start|\n|\r|\r\n) (depending on matching mode) $ is (end|\n|\r|\r\n) (depending on matching mode) Enables:

  • Unanchored regular expressions
  • Begin/end of string/line markers
  • Multiple variables

18

slide-24
SLIDE 24

Multiple variables

Use a wide encoding: if a character is 8 bits wide, input for two variables is 16 bits. Strings for different variables can have different lengths. Start and end characters pad out strings so that all have the same length. Start and end characters reveal the start and end of strings within counterexamples.

19

slide-25
SLIDE 25

Optimizations

  • Alphabet compression
  • Regex structural hashing
  • Transition system structural hashing
  • SAT-simplification
  • Preprocessing-free IC3

20

slide-26
SLIDE 26

Ongoing project: Capture groups

slide-27
SLIDE 27

Capture group example

Anchored regular expression: (aa)|(([ab])*) Input Group 1 Group 2 Group 3 a – a a aa aa – – ba – ba a Rules:

  • Left gets priority
  • Last gets priority

21

slide-28
SLIDE 28

Capture group example

Anchored regular expression: (aa)|(([ab])*) Input Group 1 Group 2 Group 3 a – a a aa aa – – ba – ba a Rules:

  • Left gets priority: prioritized state vector
  • Last gets priority

21

slide-29
SLIDE 29

Capture group example

Anchored regular expression: (aa)|(([ab])*) Input Group 1 Group 2 Group 3 a – a a aa aa – – ba – ba a Rules:

  • Left gets priority: prioritized state vector
  • Last gets priority: most-recent tag policy

21

slide-30
SLIDE 30

Configuration is a prioritized state set

Almost identical encoding. Before:

  • Configuration is a set of states

22

slide-31
SLIDE 31

Configuration is a prioritized state set

Almost identical encoding. Before:

  • Configuration is a set of states

After:

  • Configuration is a sequence of states/tags
  • Each group has a start/end tag
  • Each tag is a bit encoding when the group starts/ends
  • Sequence encodes priority of a particular state

22

slide-32
SLIDE 32

Encoding is non-trivial in bits

Before n states uses n bits Now n states and m groups uses n2 · 2m bits. I plan on implementing this naive encoding. It is likely that lazy instantiation of these bits will be required for efficiency. This requires a more custom model checker.

23

slide-33
SLIDE 33

Conclusions

Qzy is an efficient (in practice!) and complete procedure for Boolean combinations of regular expression constraints. It supports all features of RE2 except for capture groups (for now): UTF-8, case folding, complex character classes, anchors, word boundaries, etc. It uses a linear time encoding to transition systems. It uses IC3 to solve the resulting transition systems.

24

slide-34
SLIDE 34

Extra Slides

slide-35
SLIDE 35

Regular expression difference (unsat)

R = ˆ[01]*11[01]{n}$ C = ˆ[01]*1[01]{n + 1}$

slide-36
SLIDE 36

Regular expression difference (unsat)

5 10 15 20 25 30 parameter 10 20 30 40 50 60 70 time (s) Qzy Z3 Ostrich Sloth

slide-37
SLIDE 37

Regular expression difference (unsat)

500 1000 1500 2000 2500 parameter 20 40 60 80 100 120 time (s) Qzy Z3 Ostrich Sloth

slide-38
SLIDE 38

Regular expression intersection (sat)

∃x. x ∈ L(R) ∧ x ∈ L(C) R = ˆ[01]*1[01]{n}$ C = ˆ[01]*0[01]{n − 1}$

slide-39
SLIDE 39

Regular expression intersection (sat)

5 10 15 20 25 30 parameter 20 40 60 80 time (s) Qzy CVC4 Z3 Ostrich Sloth

slide-40
SLIDE 40

Regular expression intersection (sat)

1000 2000 3000 4000 parameter 20 40 60 80 100 120 time (s) Qzy CVC4 Z3 Ostrich Sloth

slide-41
SLIDE 41

Regular expression intersection (unsat)

∃x. x ∈ L(R) ∧ x ∈ L(C) R = ˆ[01]*1[01]{n}$ C = ˆ[01]*0[01]{n}$

slide-42
SLIDE 42

Regular expression intersection (unsat)

5 10 15 20 25 30 parameter 10 20 30 40 50 60 70 time (s) Qzy CVC4 Z3 Ostrich Sloth

slide-43
SLIDE 43

Regular expression intersection (unsat)

500 1000 1500 2000 2500 parameter 20 40 60 80 100 120 time (s) Qzy CVC4 Z3 Ostrich Sloth