parsing with regular expressions and extensions to kleene algebra - - PowerPoint PPT Presentation

parsing with regular expressions and extensions to kleene
SMART_READER_LITE
LIVE PREVIEW

parsing with regular expressions and extensions to kleene algebra - - PowerPoint PPT Presentation

parsing with regular expressions and extensions to kleene algebra Niels Bjrn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church string rewriting 1,John,john@gmail.com,male,123456,DK


slide-1
SLIDE 1

parsing with regular expressions and extensions to kleene algebra

Niels Bjørn Bugge Grathwohl DIKU, November 4th 2015

PhD Thesis defense

Kleene Meets Church

slide-2
SLIDE 2

string rewriting

1,John,john@gmail.com,male,123456,DK 2,Benny,benny@hotmail.com,male,98234,UK → John 123456 Benny 98234 Want:

  • Streaming – i.e., output while reading input.
  • Fast – several Gbps throughput per core.
  • Linear running time in the size of the input.
slide-3
SLIDE 3

regular expressions

Program is essentially a regular expression with outputs. Regular expression syntax E ::= 0 | 1 | a | E1 + E2 | E1E2 | E⋆

1

(a ∈ Σ) Examples (Σ = {a, b}) a (ab)⋆ + (a + b)⋆ (a + b)⋆

slide-4
SLIDE 4

what is regular expression “matching”?

Expression (ab)⋆ + (a + b)⋆ Input s = ababab

  • acceptance testing—is input string member of language?

Answer: “Yes!”

  • subgroup matching—substrings in input for subterms in

expression. Answer: [0, 6], [4, 2]

  • parsing—what is the parse tree of the input?

ab ab ab ()

slide-5
SLIDE 5

acceptance testing

Input s matches E iff s ∈ L[ [E] ]. Language interpretation L[ [0] ] = ∅ L[ [1] ] = {ϵ} L[ [a] ] = {a} L[ [E + F] ] = {s | s ∈ L[ [E] ]} ∪ {t | t ∈ L[ [F] ]} L[ [EF] ] = {s · t | s ∈ L[ [E] ], t ∈ L[ [F] ]} L[ [E⋆] ] = L[ [E] ]⋆

slide-6
SLIDE 6

acceptance testing

Example L[ [(ab)⋆ + (a + b)⋆] ] = L[ [(ab)⋆] ] ∪ L[ [(a + b)⋆] ] = L[ [ab] ]⋆ ∪ L[ [a + b] ]⋆ = {ab}⋆ ∪ {a, b}⋆ = {ϵ, ab, abab, . . .} ∪ {ϵ, a, b, ab, ba, aba, . . .} = {ϵ, a, b, aa, ab, aaa, aab, . . .}

slide-7
SLIDE 7

parsing

Construct parse tree from input s such that flattening of parse tree is s. Type interpretation [FC’04;HN’11] T [ [0] ] = ∅ T [ [1] ] = {()} T [ [a] ] = {a} T [ [E + F] ] = {inl v | v ∈ T [ [E] ]} ∪ {inr w | w ∈ T [ [F] ]} T [ [EF] ] = T [ [E] ] × T [ [F] ] T [ [E⋆] ] = {[v1, . . . , vn] | n ≥ 0, vi ∈ T [ [E] ]} Values in T [ [E] ] are parse trees.

slide-8
SLIDE 8

parsing

Example T [ [(ab)⋆ + (a + b)⋆] ] contains the parse trees:

  • inl [(a, b), (a, b), (a, b)]
  • inr [inl a, inr b, inl a, inr b, inl a, inr b]

which are not in T [ [(a + b)⋆] ]! So T [ [(ab)⋆ + (a + b)⋆] ] ̸= T [ [(a + b)⋆] ], whereas L[ [(ab)⋆ + (a + b)⋆] ] = L[ [(a + b)⋆] ]

slide-9
SLIDE 9

ambiguity

One input string can be parsed in multiple ways: ababab under E = (ab)⋆ + (a + b)⋆ can be parsed both as inl [(a, b), (a, b), (a, b)] and inr [inl a, inr b, inl a, inr b, inl a, inr b] Disambiguation policy: the left-most option is always

  • prioritized. “Greedy parsing.”
slide-10
SLIDE 10

ambiguity

One input string can be parsed in multiple ways: ababab under E = (ab)⋆ + (a + b)⋆ can be parsed both as inl [(a, b), (a, b), (a, b)] and inr [inl a, inr b, inl a, inr b, inl a, inr b] Disambiguation policy: the left-most option is always

  • prioritized. “Greedy parsing.”
slide-11
SLIDE 11

bit-coding

Bit-coded parse trees: only store choices. Parse tree as stream of bits; meaningless without expression! Example E = (ab)⋆ + (a + b)⋆, ababab: inl [(a, b), (a, b), (a, b)] 00001 inr [inl a, inr b, inl a, inr b, inl a, inr b] 10001000100011

slide-12
SLIDE 12

finite state transducers

  • Thompsons FSTs with input alphabet Σ, output alphabet

{0, 1}.

  • Construction:

E N(E, qs, qf)

qs start qf

1

qs start

(qf = qs)

a

qs start qf

a/ϵ

slide-13
SLIDE 13

finite state transducers

E N(E, qs, qf) E1E2

qs start q′ qf N(E1, qs, q′) N(E2, q′, qf)

E1 + E2

qs start qs

1

qf

1

qs

2

qf

2

qf

ϵ/0 ϵ/1 ϵ/ϵ ϵ/ϵ

N(E1, qs

1, qf 1)

N(E2, qs

2, qf 2)

E⋆

qs start q′ qs qf qf

ϵ/0 ϵ/1 ϵ/ϵ ϵ/ϵ

N(E0, qs

0, qf 0)

slide-14
SLIDE 14

parse trees as paths

Theorem (Brüggemann-Klein 1993, GHNR 2013) 1-to-1 correspondence between

  • parse trees for E,
  • paths in Thompson FST for E,
  • bit-coded parse trees.

Constructing the parse tree corresponds to finding a path through the FST.

slide-15
SLIDE 15
  • ptimal streaming

Optimally streaming parsing Output the longest common prefix of possible parse trees af- ter reading each input symbol. Example E = (aaa + aa)⋆ Possible parse tree prefixes after aaaa: {01011, 000 . . .} Possible parse tree prefixes after aaaaa: {00011, 0000 . . .}

slide-16
SLIDE 16

greedy parsing

Time Space Aux Answer Parse (3-p)1 O(mn) O(m) O(n) greedy parse Parse (2-p)2 O(mn) O(m) O(n) greedy parse Parse (str.)3 O(mn + 2m log m) O(m) O(n) greedy parse (n size of input, m size of expression)

1Frisch, Cardelli (2004) 2Grathwohl, Henglein, Nielsen, Rasmussen (2013) 3Grathwohl, Henglein, Rasmussen (2014)

slide-17
SLIDE 17

greedy parsing

Time Space Aux Answer Parse (3-p)1 O(mn) O(m) O(n) greedy parse Parse (2-p)2 O(mn) O(m) O(n) greedy parse Parse (str.)3 O(mn + 2m log m) O(m) O(n) greedy parse (n size of input, m size of expression)

1Frisch, Cardelli (2004) 2Grathwohl, Henglein, Nielsen, Rasmussen (2013) 3Grathwohl, Henglein, Rasmussen (2014)

slide-18
SLIDE 18

fst simulation

Optimally streaming algorithm

  • Preprocessing step of FST: compute coverage of state sets.
  • Maintain a path tree during FST simulation, recording the

path taken to each state in the FST.

  • Prune states that are covered by higher-prioritized states.
  • Output on the stem of the path tree is longest common

prefix of any succeeding parse. Theorem (GHR’14) Optimally streaming algorithm computes the optimally stream- ing parsing function in time O(mn + 2m log m).

slide-19
SLIDE 19

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

slide-20
SLIDE 20

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2

3 7 11

ϵ

slide-21
SLIDE 21

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2 3 7 11

4 8

a

slide-22
SLIDE 22

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2 3 7 11 4 8 a 9 10 1 2

7 11 5

a

slide-23
SLIDE 23

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2

3 7 11 8

a

slide-24
SLIDE 24

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2 3 7 11 8 a

4 8

9 10 1

11

a

slide-25
SLIDE 25

path tree example: (aaa + aa)⋆

1 11 2 3 4 5 6 7 8 9 10 ϵ/ϵ ϵ/0 ϵ/0 a/ϵ a/ϵ a/ϵ ϵ/ϵ ϵ/1 a/ϵ a/ϵ ϵ/ϵ ϵ/ϵ ϵ/1

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2 3 7 11 8 a 4 8 9 10 1 11 a

5

9 10 1 2

7 11

a 00 00

slide-26
SLIDE 26

kleenex

Observation Approach is not limited to Thompson FSTs outputting bit-coded parse trees. Kleenex is a surface language for specifying FSTs and their

  • utput:
  • grammar with greedy disambiguation;
  • embedded output actions.
  • Essentially optimally streaming behaviour.
  • Linear running time in size of input string.
  • Fast. >1 Gbps common.
slide-27
SLIDE 27

kleenex

”100000000000” → ”100,000,000,000”

  • Problem: need to read entire number; no bounded

lookahead!

  • But: each newline ends a number, so output.
  • Optimal streaming gives this for free!
slide-28
SLIDE 28

determinization

Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.

slide-29
SLIDE 29

determinization

Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.

slide-30
SLIDE 30

determinization

Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.

slide-31
SLIDE 31

determinization

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2 3 7 11 8 a

4 8

9 10 1

11

a

4 8 11

1 1 1 1

4 8 11

x x0 x00 x01 x1

x x0 00 x1 1011 x00 x01 1

slide-32
SLIDE 32

determinization

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2 3 7 11 8 a

4 8

9 10 1

11

a

4 8 11

1 1 1 1

4 8 11

x x0 x00 x01 x1

x x0 00 x1 1011 x00 x01 1

slide-33
SLIDE 33

determinization

1 2 3 7 11 4 8 a 9 10 1 2 7 11 5 a 6 10 1 2 3 7 11 8 a

4 8

9 10 1

11

a

4 8 11

1 1 1 1

4 8 11

xϵ x0 x00 x01 x1

xϵ := x0 := 00 x1 := 1011 x00 := x01 := 1

slide-34
SLIDE 34

determinization

4 8 11

xϵ x0 x00 x01 x1

xϵ := x0 := 00 x1 := 1011 x00 := x01 := 1

5 7 11

x x x0 x0 x00 x1 x01

slide-35
SLIDE 35

determinization

4 8 11

xϵ x0 x00 x01 x1

xϵ := x0 := 00 x1 := 1011 x00 := x01 := 1

5 7 11

x′

ϵ := xϵ · x0

x′

0 := x00

x′

1 := x01

slide-36
SLIDE 36

determinization

  • Streaming string transducer:
  • deterministic finite automata,
  • each state equipped with fixed number of registers

containing strings

  • registers updated on transititon by affine function;
  • Alur, D’Antoni, Raghothaman (2015).

Theorem FSTs with greedy order semantics correspond to SSTs.

  • States are contracted path trees.
  • Edges in contracted path trees

registers in SST.

slide-37
SLIDE 37

determinization

  • Streaming string transducer:
  • deterministic finite automata,
  • each state equipped with fixed number of registers

containing strings

  • registers updated on transititon by affine function;
  • Alur, D’Antoni, Raghothaman (2015).

Theorem FSTs with greedy order semantics correspond to SSTs.

  • States are contracted path trees.
  • Edges in contracted path trees ∼

= registers in SST.

slide-38
SLIDE 38

determinization

x0, x00, x10, x100 := 0 x01, x1, x11, x101 := 1 a/ x0 := (x0)(x00) x1 := (x1)(x10)(x100) x00, x100, x10 := 0 x01, x101, x11 := 1 b/ x0 := (x0)(x01) x1 := (x1)(x10)(x101)0 x10 := 0 x11 := 1 b/ xϵ := (xϵ)(x1)(x11) x0, x00 := 0 x1, x01 := 1 a/ xϵ := (xϵ)(x1)(x10) x0, x00 := 0 x1, x01 := 1 a/ xϵ := (xϵ)(x0)(x00) x0, x00 := 0 x1, x01 := 1 b/ x0, x00 := 0 x1, x01 := 1 xϵ := (xϵ)(x0)(x01)

slide-39
SLIDE 39

implementation

Haskell implementation Kleenex source → FST → SST → C C code compiled with GCC/clang Performance comparison with regular expression libraries:

  • AWK, Perl, Python, Sed, Tcl
  • RE2/RE2j
  • Ragel state machine compiler

https://github.com/diku-kmc/kleenexlang

slide-40
SLIDE 40

performance

slide-41
SLIDE 41

future work

  • Program fragments as output actions
  • Memoization techniques à la NFA/DFA memoization in RE2.
  • Applications – bioinformatics, finance, log digging, ....;
  • Parallel processing: read >8 bits in parallel;
  • Approximate matching — necessary in biological

applications;

  • Expressiveness, visibly pushdown automata;
  • Automatically generate interfaces for various

programming languages.

slide-42
SLIDE 42

Kleene algebra

slide-43
SLIDE 43

kleene algebra

Kleene algebra A structure (K, +, ·, ⋆, 0, 1):

  • A set of elements K,
  • binary operators + and ·,
  • unary operator ⋆,
  • special elements 0 and 1,

that satisfies the Kleene algebra axioms.

slide-44
SLIDE 44

kleene algebra

Semiring x · (y · z) = (x · y) · z x + (y + z) = (x + y) + z 1 · x = x = x · 1 0 + x = x = x + 0 x + y = y + x x · (y + z) = x · y + x · z (x + y) · z = x · z + y · z 0 · x = 0 = x · 0

  • idempotence: x + x = x
  • partial order: x ≤ y ⇐

⇒ x + y = y

slide-45
SLIDE 45

kleene algebra

Kleene algebra Idempotent semiring with ⋆ operator: 1 + x · x⋆ ≤ x⋆ b + a · x ≤ x = ⇒ a⋆ · b ≤ x 1 + x⋆ · x ≤ x⋆ b + x · a ≤ x = ⇒ b · a⋆ ≤ x

slide-46
SLIDE 46

kleene algebra models

Any structure with these operators that satisfies the axioms is a Kleene algebra.

  • Languages: (L, ∪, ·, ⋆, ∅, {ϵ}).
  • L is set of strings over an alphabet,
  • ∪ is set union,
  • · is string concatenation,
  • ⋆ is repetition of strings,
  • partial order ≤ is subset inclusion ⊆.

Language interpretation of regular expressions from before.

  • Relation model, tropical semiring, ...
slide-47
SLIDE 47

kleene algebra

“Regular expressions:” syntax to describe elements in a Kleene algebra. Canonical interpretation The canonical interpretation of a term E is the regular lan- guage interpretation: LΣ(x) = {x} LΣ(e0 + e1) = LΣ(e0) ∪ LΣ(e1) LΣ(0) = ∅ LΣ(e0e1) = {vw | v ∈ LΣ(e0), w ∈ LΣ(e1)} LΣ(1) = {ϵ} LΣ(e⋆) = ∪

n≥0

LΣ(en).

slide-48
SLIDE 48

polynomials

Polynomials Given idempotent semiring C and a set of variables X, form polynomials over C and X: a ax2 + bxy3 + 1 1 + a + ax + by System of polynomial inequalities 1 + aB + bA ≤ S A + aS + bAA ≤ A bS + aBB ≤ B

slide-49
SLIDE 49

chomsky algebra

Solution: valuation of variables in X such that the inequalities are satisfied. A semiring C is algebraically closed if all finite systems of polynomials have least solutions. Definition A Chomsky algebra is a an algebraically closed idempotent semiring.

slide-50
SLIDE 50

chomsky algebra

Context-free languages over symbols from X are not Kleene algebras, but they are Chomsky algebras. Context-free grammar corresponds to system of polynomial inequalities: S → ϵ | aB | bA A → aS | bAA B → bS | aBB 1 + aB + bA ≤ S aS + bAA ≤ A bS + aBB ≤ B

slide-51
SLIDE 51

µ-terms

  • Regular expressions: denote elements in Kleene algebra.
  • µ-terms: denote elements in Chomsky algebra.

µ-terms T X are µ-terms over an alphabet X: t ::= 0 | 1 | x | t + t | t · t | µx.t x ∈ X

slide-52
SLIDE 52

µ-terms

n-fold composition 0x.t ≡ 0 (n + 1)x.t ≡ t[x/nx.t] Examples 0x.axb + 1 = 0 1x.axb + 1 = a(0x.axb + 1)b + 1 = a0b + 1 = 1 2x.axb + 1 = a(1x.axb + 1)b + 1 = ab + 1

slide-53
SLIDE 53

µ-terms, interpretation

Given interpretation of literals: σ : X → C, interpretation of µ-terms over Chomsky algebra C. Function σ : T X → C where: σ(0) = σ(a + b) = σ(a) + σ(b) σ(1) = 1 σ(a · b) = σ(a) · σ(b) σ(µx.t) = least a ∈ C such that σ[x/a](t) ≤ a

slide-54
SLIDE 54

µ-terms, interpretation

Canonical interpretation Canonical interpretation as context-free languages: LX(x) = {x} LX(t0 + t1) = LX(t0) ∪ LX(t1) LX(0) = ∅ LX(t0 · t1) = {vw | v ∈ LX(t0), w ∈ LX(t1)} LX(1) = {ϵ} LX(µx.t) = ∪

n≥0 LX(nx.t).

slide-55
SLIDE 55

µ-continuity

n≥0 tn denotes supremum with respect to partial order ≤

µ-continuity A Chomsky algebra C is µ-continuous if σ (a(µx.t)b) = ∑

n≥0

σ (a(nx.t)n) for any interpretation σ over C. Canonical interpretation as context-free language is µ-continuous: LX(µx.t) = ∪

n≥0

LX(nx.t).

slide-56
SLIDE 56

main result

Theorem The following are equivalent: (i) s = t holds in all µ-continous Chomsky algebras, (ii) LX(s) = LX(t) holds in the canonical interpretation as a context-free language over variables X.

slide-57
SLIDE 57

axiomatization

Two context free languages LX(s) and LX(t) are equivalent if and only if s = t is provable from the axioms of µ-continuous Chomsky algebra: Axioms x · (y · z) = (x · y) · z x + (y + z) = (x + y) + z 1 · x = x = x · 1 0 + x = x = x + 0 x · (y + z) = x · y + x · z x + y = y + x (x + y) · z = x · z + y · z x + x = x 0 · x = 0 = x · 0 a(µx.t)b = ∑

n≥0

a(nx.t)b

slide-58
SLIDE 58

axiomatization

µ-continuity axiom is infinitary: a(nx.t)b ≤ a(µx.t)b, n ≥ 0   ∧

n≥0

(a(nx.t)b ≤ w)   = ⇒ a(µx.t)b ≤ w

  • Equivalence of context-free languages is undecidable.
  • To use inference, one must establish infinitely many

premises.

slide-59
SLIDE 59

summary, further directions

  • Extend Chomsky algebra with test symbols, analogously to

Kleene algebra with tests.

  • Coalgebraic treatment of Chomsky algebra?
  • Applications to program verification, like Kleene algebra?
  • “Visibly pushdown” Chomsky algebra?
  • KAT+B! is an extension to Kleene algebra with tests adding

mutable state:

  • elements correspond to square matrices with regular

language entries.

  • extend Kleenex with mutable state?
slide-60
SLIDE 60

Thank you