parsing with regular expressions and extensions to kleene algebra - PowerPoint PPT Presentation

parsing with regular expressions and extensions to kleene algebra Niels Bjørn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church

string rewriting 1,John,john@gmail.com,male,123456,DK 2,Benny,benny@hotmail.com,male,98234,UK John 123456 Benny 98234 Want: • Streaming – i.e., output while reading input. • Fast – several Gbps throughput per core. • Linear running time in the size of the input. →

regular expressions Program is essentially a regular expression with outputs. Regular expression syntax 1 Examples a E ::= 0 | 1 | a | E 1 + E 2 | E 1 E 2 | E ⋆ ( a ∈ Σ ) ( Σ = { a , b } ) ( ab ) ⋆ + ( a + b ) ⋆ ( a + b ) ⋆

what is regular expression “matching”? Answer: “Yes!” expression. ab ab ab () Expression ( ab ) ⋆ + ( a + b ) ⋆ Input s = ababab • acceptance testing—is input string member of language? • subgroup matching—substrings in input for subterms in Answer: [ 0 , 6 ] , [ 4 , 2 ] • parsing—what is the parse tree of the input?

acceptance testing Language interpretation Input s matches E iff s ∈ L [ [ E ] ] . L [ [ 0 ] ] = ∅ L [ [ 1 ] ] = { ϵ } L [ [ a ] ] = { a } L [ { s | s ∈ L [ ] } [ E + F ] ] = [ E ] ∪ { t | t ∈ L [ [ F ] ] } L [ [ EF ] ] = { s · t | s ∈ L [ [ E ] ] , t ∈ L [ [ F ] ] } L [ [ E ⋆ ] ] = L [ [ E ] ] ⋆

acceptance testing Example [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( ab ) ⋆ ] ] ∪ L [ [( a + b ) ⋆ ] ] ] ⋆ ∪ L [ = L [ [ ab ] [ a + b ] ] ⋆ { ab } ⋆ ∪ { a , b } ⋆ = { ϵ, ab , abab , . . . } ∪ { ϵ, a , b , ab , ba , aba , . . . } = = { ϵ, a , b , aa , ab , aaa , aab , . . . }

parsing Construct parse tree from input s such that flattening of parse tree is s . Type interpretation [FC’04;HN’11] T [ ∅ [ 0 ] ] = T [ [ 1 ] ] = { () } T [ [ a ] ] = { a } T [ [ E + F ] ] = { inl v | v ∈ T [ [ E ] ] } ∪ { inr w | w ∈ T [ [ F ] ] } T [ T [ ] × T [ [ EF ] ] = [ E ] [ F ] ] T [ [ E ⋆ ] ] = { [ v 1 , . . . , v n ] | n ≥ 0 , v i ∈ T [ [ E ] ] } Values in T [ [ E ] ] are parse trees .

parsing whereas So Example [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] contains the parse trees: • inl [( a , b ) , ( a , b ) , ( a , b )] • inr [ inl a , inr b , inl a , inr b , inl a , inr b ] which are not in T [ [( a + b ) ⋆ ] ] ! [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] ̸ = T [ [( a + b ) ⋆ ] ] , [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( a + b ) ⋆ ] ]

ambiguity One input string can be parsed in multiple ways: ababab and prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ] Disambiguation policy : the left-most option is always

ambiguity One input string can be parsed in multiple ways: ababab and Disambiguation policy : the left-most option is always prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]

bit-coding Bit-coded parse trees: only store choices . Example 00001 10001000100011 Parse tree as stream of bits; meaningless without expression! E = ( ab ) ⋆ + ( a + b ) ⋆ , ababab : inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]

finite state transducers 1 q f start q s a start q s q f start q s 0 E • Construction: • Thompsons FSTs with input alphabet Σ , output alphabet { 0 , 1 } . N ( E , q s , q f ) ( q f = q s ) a /ϵ

finite state transducers q f start q s 0 q f E 0 q f 2 q f 2 q s 0 1 1 q f E 1 E 2 q s start q s q s q f q s start N ( E , q s , q f ) N ( E 1 , q s , q ′ ) N ( E 2 , q ′ , q f ) q ′ N ( E 1 , q s 1 ) 1 , q f ϵ/ 0 ϵ/ϵ N ( E 2 , q s 2 ) 2 , q f ϵ/ 1 ϵ/ϵ E 1 + E 2 N ( E 0 , q s 0 ) 0 , q f ϵ/ 0 ϵ/ϵ q ′ ϵ/ϵ ϵ/ 1 E ⋆

parse trees as paths Theorem (Brüggemann-Klein 1993, GHNR 2013) 1-to-1 correspondence between • parse trees for E , • paths in Thompson FST for E , • bit-coded parse trees. Constructing the parse tree corresponds to finding a path through the FST.

optimal streaming Optimally streaming parsing Output the longest common prefix of possible parse trees after reading each input symbol. Example Possible parse tree prefixes after aaaa : Possible parse tree prefixes after aaaaa : E = ( aaa + aa ) ⋆ { 01011 , 000 . . . } { 00011 , 0000 . . . }

greedy parsing Parse (2-p) 2 2 Grathwohl, Henglein, Nielsen, Rasmussen (2013) 1 Frisch, Cardelli (2004) ( n size of input, m size of expression) greedy parse Parse (str.) 3 greedy parse Time 3 Grathwohl, Henglein, Rasmussen (2014) greedy parse Parse (3-p) 1 Answer Aux Space O ( mn ) O ( m ) O ( n ) O ( mn ) O ( m ) O ( n ) O ( mn + 2 m log m ) O ( m ) O ( n )

fst simulation Optimally streaming algorithm • Preprocessing step of FST: compute coverage of state sets. • Maintain a path tree during FST simulation, recording the path taken to each state in the FST. • Prune states that are covered by higher-prioritized states. • Output on the stem of the path tree is longest common prefix of any succeeding parse. Theorem (GHR’14) Optimally streaming algorithm computes the optimally streaming parsing function in time O ( mn + 2 m log m ) .

10 0 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1

1 8 2 0 3 7 11 10 9 7 0 6 5 4 3 2 11 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 ϵ

0 10 2 3 7 0 11 4 8 a 0 9 0 8 7 6 5 4 3 2 11 1 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1

0 5 2 3 7 11 0 4 8 a 2 7 11 a 1 10 9 8 7 6 5 4 3 2 11 1 0 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1

0 2 4 11 7 3 2 1 0 2 0 7 11 5 a 6 10 1 3 a 6 1 11 2 3 4 5 7 7 8 9 10 a 8 11 8 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1

0 a 3 7 11 4 8 a 2 7 11 5 6 10 1 0 2 3 7 11 8 a 4 8 11 a 2 1 0 10 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1 9 10 1

00 11 4 8 a 2 7 11 5 a 6 10 1 2 3 7 8 7 a 4 8 11 a 5 2 7 11 a 0 00 0 11 3 2 1 0 1 11 2 3 4 5 6 7 8 9 10 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 9 10 1 9 10 1 9 10 1

kleenex Observation Approach is not limited to Thompson FSTs outputting bit-coded parse trees. Kleenex is a surface language for specifying FSTs and their output: • grammar with greedy disambiguation; • embedded output actions . • Essentially optimally streaming behaviour. • Linear running time in size of input string. • Fast . >1 Gbps common.

kleenex lookahead! • But: each newline ends a number, so output. • Optimal streaming gives this for free! ”100000000000” → ”100,000,000,000” • Problem: need to read entire number; no bounded

determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.

determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Solution: contract unary paths in path trees and store output in registers. Problem: Inifinite number of path trees!

determinization x 0 0 0 1 1 0 1 1 4 8 11 x 0 11 x 00 x 01 x 1 x 0 x 0 00 x 1 1011 x 00 0 x 01 0 8 0 5 1 2 3 7 11 4 8 a 2 7 11 a 4 6 10 1 2 3 7 11 8 a 4 8 11 a 1 9 10 1 9 10 1

determinization 11 0 0 0 0 1 1 0 1 1 4 8 x 0 8 x 00 x 01 x 1 0 x 0 00 x 1 1011 x 00 0 x 01 11 0 4 5 1 2 3 7 11 4 8 a 2 a 11 7 a 11 11 8 4 a 8 1 3 7 2 6 10 1 9 10 1 9 10 1 x ϵ := x ϵ := := := :=

parsing with regular expressions and extensions to kleene algebra - PowerPoint PPT Presentation

parsing with regular expressions and extensions to kleene algebra Niels Bjrn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church string rewriting 1,John,john@gmail.com,male,123456,DK

Kleene Algebras: The Algebra of Regular Expressions Adam Braude University of Puget Sound May

Concurrent Kleene Algebra Tobias Kapp e University College London BCTCS 2018 What is Kleene

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regular Languages Today we continue looking at our first class Kleene Theorem I of

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

MA/CSSE 474 Theory of Computation Kleene's Theorem Practical Regular Expressions Kleenes

Regular Expressions A regular expression describes a language using three operations. Regular

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Regular Expressions Regular Expressions and Automata and Automata Berlin Chen 2003 References:

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Type Systems TODAY: Winter Semester 2006 1. pairs, options, variants 2. recursion 3. state

Last time: Simply typed lambda calculus A B x:A.M M N ... with products A B M, N

Building Algebraic Structures with Combinators Vilius Naud zi unas Timothy G. Griffin

ASEAN Stars Conference 2012 1 March 2012 Asias First Listed Indian Property Trust Asias

The Great Type Hope Philip Wadler, Avaya Labs wadler@avaya.com Part I A logical coincidence

CS 671 Automated Reasoning Extending Nuprls Type Theory 1. Design Decisions for Nuprls Type

Compositional Data Types A Report from the Field Patrick Bahr paba@diku.dk University of

Data Types Gabriele Keller Ron Vanderfeesten Compound types What are types? So far, we