Formalising Boost POSIX Regular Expression Matching 15th - PowerPoint PPT Presentation

Martin Berglund, Willem Bester & Brink van der Merwe Formalising Boost POSIX Regular Expression Matching 15th International Colloquium on Theoretical Aspects of Computing 18 October 2018, Stellenbosch, South Africa

What we’ve been doing We’ve been thinking about ◮ regular expression matching semantics ◮ Perl-Compatible Regular Expression (PCRE) engines ◮ POSIX-compliant engines ◮ ambiguity — “more than one way to match” ◮ capture groups Why Boost? ◮ “very powerful” C ++ library ◮ mature (1999– ) ◮ online peer-reviewed QA process ◮ regular expression engine that has a POSIX mode Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 2 / 16

Leftmost-greedy vs leftmost-longest matching Match “ aba ” Match “ aba ” with E 1 = (ab|ba|a)* with E 2 = (a|ab|ba)* ambiguous ambiguous [ ab ][ a ] [ a ][ ba ] [ ab ][ a ] [ a ][ ba ] [ ab ][ a ] [ a ][ ba ] Leftmost-greedy Leftmost-greedy [ ab ][ a ] [ ab ][ a ] Leftmost-longest Leftmost-longest ◮ E 2 defines the same language as E 1 , but subexpression order differs ◮ Compare E 1 = (ab|ba|a)* to E 2 = (a|ab|ba)* ◮ Leftmost-longest: matcher seemingly considers all possible matches for subexpressions [ more on this later ] Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 3 / 16

The POSIX regular expression specification POSIX specifies leftmost-longest matching: “The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where ‘first’ is defined to mean ‘begins earliest in the string’. If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched.... Consistent with the whole match being the longest of the leftmost matches, each subpattern from left to right shall match the longest possible string.” Fowler’s complaint: “Subpattern” only used here; elsewhere it’s “subexpression” (always in the context of grouping). Note: We only consider full matching in this work. Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 4 / 16

An eccentric reading of the POSIX standard? Match “ aba ” POSIX with (ab|ba|a)* ◮ Full matching with submatch [ ab ][ a ] addressing Regex-TDFA: [ a ][ ba ] ◮ Position and extent of substrings Boost: matched by subexpressions must be available Match “ aba ” with (a|ab|ba)* Boost POSIX Mode ◮ Maximises what is reported for marked [ ab ][ a ] Regex-TDFA: [ a ][ ba ] subexpressions (those surrounded by Boost: parentheses) ◮ Essentially, reading POSIX with: Regex-TDFA written in Haskell. s/subpattern/marked subexpression/ Boost written in C ++ . Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 5 / 16

More examples Match “ aa ” Match “ aa ” with ( 0 ( 1 a*) 1 ( 2 a*) 2 ) 0 with ( 0 a*( 1 a*) 1 ) 0 Captures Boost RTDFA Captures Boost RTDFA [ 0 [ 1 aa ] 1 [ 2 ] 2 ] 0 [ 0 aa [ 1 ] 1 ] 0 � � � [ 0 [ 1 a ] 1 [ 2 a ] 2 ] 0 [ 0 a [ 1 a ] 1 ] 0 [ 0 [ 1 ] 1 [ 2 aa ] 2 ] 0 [ 0 [ 1 aa ] 1 ] 0 � Note: All non-atomic subexpressions are parenthesised. ◮ Regex-TDFA maximises lengths of all subexpressions in the order they occur in the regular expression ◮ Boost maximises lengths of (capture) groups in the order they occur in the regular expression Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 6 / 16

Capturing regular expressions and forests Capturing Regular Expressions Over a finite alphabet Σ and an index set I : ∅ ATOM empty language ǫ empty string ATOM a symbols a ∈ Σ ATOM ( r 0 · r 1 ) concatenation of capturing regular expression r 0 , r 1 ( r 0 + r 1 ) alternation of capturing regular expressions r 0 , r 1 ( r ∗ ) closure of capturing regular expression r ( i r ) i capture group i ∈ Σ of capturing regular expression r Set of Forests Note Over a finite alphabet Σ If I is non-empty: the strings and an index set I : over Σ properly contained in • ( Σ ∪ { ǫ } ) is a forest the set of forests. • So is f 1 f 2 for forests f 1 and f 2 If I is empty: they are equal. • And [ i f ] i for forest f and i ∈ I Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 7 / 16

Forest and String Languages Forest Language ◮ For string w over Σ , and Σ ′ ⊆ Σ : � ( r ) for a capturing regular expression r π Σ ′ ( w ) is the maximal subsequence of w that contains only symbols � ( ∅ ) = ∅ from Σ ′ . � ( ǫ ) = { ǫ } ◮ The string language described by � ( a ) = { a } the capturing regular expression r over Σ is the set π Σ ( � ( r )) . � ( r 0 · r 1 ) = � ( r 0 ) · � ( r 1 ) � ( r 0 + r 1 ) = � ( r 0 ) ∪ � ( r 1 ) � ( r ∗ ) = � ( r ) ∗ � (( i r ) i ) = { [ i } · � ( r ) · { ] i } Also: By extension, we also handle r + = rr ∗ , r m , n = r ··· r r ? = ( r + ǫ ) , ( r + ǫ ) ··· ( r + ǫ ) and �� m times n times Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 8 / 16

From forest to captures Strategy to compute capture information 1. collect the matching forests 2. determine the capture history C ( f ) and final capture history C fin ( f ) for each forest f 3. order forests by Boost partial order ≺ B on C fin values 4. return the greatest C fin value as determined by ≺ B Capture history • informally, a function C ( f , i ) for forest f and group i • returns a pair ( s , ℓ ) for each substring captured by group i • s ← substring start index, ℓ ← substring length Final capture history • C last ( f , i ) is the pair ( s , ℓ ) in C ( f , i ) with the greatest s � � • C fin ( f ) is the set ( j , C last ( f , j ) | j ∈ I Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 9 / 16

Boost partial order and captures Boost partial order • denote as ≺ B • assume π Σ ( f 1 ) = π Σ ( f 2 ) Then C fin ( f 1 ) ≺ B C fin ( f 2 ) if for the smallest j ∈ I such that ( j , s 1 , ℓ 1 ) � = ( j , s 2 , ℓ 2 ) , where ( j , s i , ℓ i ) ∈ C fin ( f i ) , we have 1. s 1 > s 2 , or 2. s 1 = s 2 but ℓ 1 < ℓ 2 Boost captures • capturing regular expression r • w ∈ π Σ ( � ( r )) • the Boost captures of matching w with r : the largest element in { C fin ( f ) | f ∈ � ( r ) , π Σ ( f ) = w } determined by ≺ B Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 10 / 16

Examples Match w = “ ab ” with a?( 1 ab) 1 ?b? Forests: f 1 = [ 0 ab ] 0 and f 2 = [ 0 [ 1 ab ] 1 ] 0 C ( f 1 ,0 ) = { ( 0,2 ) } , C ( f 1 ,1 ) = � , C ( f 2 ,0 ) = { ( 0,2 ) } , C ( f 2 ,1 ) = { ( 0,2 ) } C fin ( f 1 ) = { ( 0,0,2 ) , ( 1, ⊤ , ⊥ ) } , C fin ( f 2 ) = { ( 0,0,2 ) , ( 1,0,2 ) } At j = 1, we find s 1 = ⊤ and s 2 = 0, so that s 1 > s 2 . Therefore, C fin ( f 1 ) ≺ B C fin ( f 2 ) . Match w with ( 1 a?) 1 ( 2 ab) 2 ?( 3 b?) 3 Forests: f 3 = [ 0 [ 1 a ] 1 [ 3 b ] 3 ] 0 and f 4 = [ 0 [ 1 ] 1 [ 2 ab ] 2 [ 3 ] 3 ] 0 C fin ( f 3 ) = { ( 0,0,2 ) , ( 1,0,1 ) , ( 2, ⊤ , ⊥ ) , ( 3,1,1 ) } C fin ( f 4 ) = { ( 0,0,2 ) , ( 1,0,0 ) , ( 2,0,2 ) , ( 3,2,0 ) } At j = 1, we find s 3 = s 4 = 0, ℓ 3 = 1, and ℓ 4 = 0, so that ℓ 4 < ℓ 3 . Therefore, C fin ( f 4 ) ≺ B C fin ( f 3 ) . Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 11 / 16

POSIX matching algorithm in Boost Inside Boost: ◮ complete Perl-Compatible Regular Expression (PCRE) engine ◮ implemented by depth-first backtracking POSIX matching algorithm: 1. • apply the PCRE-style matching engine to the input • record the resulting parse tree t • if engine rejects, then reject string 2. • apply PCRE-style matching engine to the input • each time it would accept on parse tree t ′ • if C fin ( t ) ≺ B C fin ( t ′ ) , then t ← t ′ • reject, causing engine to backtrack 3. output t as POSIX-style result Theorem Boost captures can be computed in time O ( k | w || r | log | w | ) when matching input string w with regular expression r , and k is the number of distinct capturing indices. Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 12 / 16

Experimental results Two testing frameworks in Python ◮ small one for existing matchers ◮ larger, extensible one for exploring different disambiguation policies Sanity check: Almost 3 000 000 generated test cases — ◮ over the atoms a , b , . and the operators | , * , + , ? ◮ input strings over Σ = { a , b , c } . Fowler’s test cases ◮ 93 examples to test POSIX compliance ◮ 47 ERE; 37 without partial matching + 19 of our own ◮ use a Boost runner as oracle ◮ our formalism passed all but 2 Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 13 / 16

Formalising Boost POSIX Regular Expression Matching 15th - PowerPoint PPT Presentation

Martin Berglund, Willem Bester & Brink van der Merwe Formalising Boost POSIX Regular Expression Matching 15th International Colloquium on Theoretical Aspects of Computing 18 October 2018, Stellenbosch, South Africa What weve been doing

POSIX IPC: Overview primitive POSIX function description message queues create or access

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

BOOST Quoc-Tuan Le Optimal seminar group, HUS Overview 1 Introduction to BOOST 2 BOOSTs

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Regular Expressions A regular expression describes a language using three operations. Regular

Posix-Free File Systems in the Cloud Jeff Chase Duke University Beyond Posix

www.pdl.cmu.edu/posix/ December 14, 2005 APIs for HPC IO POSIX IO APIs (open, close, read,

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Lec 03. Regular expression, Pumping lemma Eunjung Kim F ORMAL DEFINITION OF R EGULAR EXPRESSION

Formalising an intermediate language for POSIX shell Nicolas Jeannerod S eminaire Gallium,

Leftmost Longest Regular Expression Matching in Reconfigurable Logic Kubilay Atasu IBM Research

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Formalising Regular Language Theory with Regular Expressions, Only rst

Example: Mentor Graphics POSIX Implementation ( Nucleus ) Mentor Graphics Nucleus User Guide

POSIX Thread Synchronization Mutex Locks Condition Variables Read-Write Locks

POSIX mini-challenge Leo Freitas and Jim Woodcock University of York December 2006 @ TC Dublin

Classification, Object Detection Artificial Intelligence @ Allegheny College Janyl Jumadinova

Automata Learning: An Algebraic Approach Henning Urbat joint work with Lutz Schr oder

How inner planetary systems relate to inner and outer debris belts Mark Wyatt Institute of

A general mechanism of diffusion in Hamiltonian Systems DYNAMICAL SYSTEMS: FROM GEOMETRY TO

r Pts

Q4 2015 9 February 2016 Q4 Highlights Production increase continues: new production record

Practical Migration, deMigration and Velocity Modeling Bituni Bay Bee Bednar Panorama

8 families of ( { a , b } , k )-spheres: fullerenes ( { 5 , 6 } , 3)- and 7 analogs Michel DEZA