C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk - - PowerPoint PPT Presentation

c 0x
SMART_READER_LITE
LIVE PREVIEW

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk - - PowerPoint PPT Presentation

Regular Expressions C++0x Sources C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns Universitet Maj 16, 2008 Regular Expressions C++0x Sources Regular Expressions Regular Expression, regex or


slide-1
SLIDE 1

Regular Expressions C++0x Sources

C++0x

Regular Expressions Simon Andreas Frimann Lund

Datalogisk Institut Københavns Universitet

Maj 16, 2008

slide-2
SLIDE 2

Regular Expressions C++0x Sources

Regular Expressions

Regular Expression, regex or regexp for short. ”A set of characters, metacharacters, and operators that define a string or group of strings in a search pattern.”

❼ "regex"

(simple regex matching the text ”regex”)

The set of metacharacters, operators and other features are usually called a regex flavor.

slide-3
SLIDE 3

Regular Expressions C++0x Sources

Regular Expressions

Regular Expression, regex or regexp for short. ”A set of characters, metacharacters, and operators that define a string or group of strings in a search pattern.”

❼ "regex"

(simple regex matching the text ”regex”)

❼ "[-+]?([0-9]*.[0-9]+|[0-9]+)"

(simple regular expression matching... what?) The set of metacharacters, operators and other features are usually called a regex flavor.

slide-4
SLIDE 4

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ ❼ ❼ ❼ ❼

How are these these tasty flavours implemented?

slide-5
SLIDE 5

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ GNU BRE / ERE, GNU extensions of the standard used in

GNU tools such as grep.

❼ ❼ ❼ ❼

How are these these tasty flavours implemented?

slide-6
SLIDE 6

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ GNU BRE / ERE, GNU extensions of the standard used in

GNU tools such as grep.

❼ The languages D, Haskell, .NET, Java, ECMA

(JavaScript), Python, Ruby all have their own flavors.

❼ ❼ ❼

How are these these tasty flavours implemented?

slide-7
SLIDE 7

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ GNU BRE / ERE, GNU extensions of the standard used in

GNU tools such as grep.

❼ The languages D, Haskell, .NET, Java, ECMA

(JavaScript), Python, Ruby all have their own flavors.

❼ The languages Perl and Tcl has their own flavors as build

in language constructs.

❼ ❼

How are these these tasty flavours implemented?

slide-8
SLIDE 8

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ GNU BRE / ERE, GNU extensions of the standard used in

GNU tools such as grep.

❼ The languages D, Haskell, .NET, Java, ECMA

(JavaScript), Python, Ruby all have their own flavors.

❼ The languages Perl and Tcl has their own flavors as build

in language constructs.

❼ Libraries such as PCRE (used in PHP), Boost.Regex,

Boost.Xpressive, QT/QRegExp each their own flavor.

How are these these tasty flavours implemented?

slide-9
SLIDE 9

Regular Expressions C++0x Sources

Flavors

There exist 15+ popular regex flavours in various languages and tools of which only two are standardized:

❼ The POSIX Standard Basic Regex / Extended Regex. ❼ GNU BRE / ERE, GNU extensions of the standard used in

GNU tools such as grep.

❼ The languages D, Haskell, .NET, Java, ECMA

(JavaScript), Python, Ruby all have their own flavors.

❼ The languages Perl and Tcl has their own flavors as build

in language constructs.

❼ Libraries such as PCRE (used in PHP), Boost.Regex,

Boost.Xpressive, QT/QRegExp each their own flavor.

❼ And the list goes on...

How are these these tasty flavours implemented?

slide-10
SLIDE 10

Regular Expressions C++0x Sources

Implementations

Basicly all the different flavours are implemented with a NFA (non-deterministic finite automaton) or DFA. Machine size of M character expression, pattern recognition complexity for an N character sequence of S states. Algo Machine size Complexity DFA O(2M) O(N) bit-par non-backtracking NFA O(M) ∨ (2M) O(1 + (S/B))N) non-backtracking NFA O(M) ∨ (2M) O(SN) backtracking NFA O(M) O(2N) Currently many different implementations for C++ exist, some being procedural others object oriented. Supporting various different flavours, but most are simply object oriented wrappers for c libraries.

slide-11
SLIDE 11

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ ❼ ❼ ❼ ❼ ❼

slide-12
SLIDE 12

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ ❼ ❼ ❼ ❼

slide-13
SLIDE 13

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ Localization features of POSIX is required since ECMA is not

capable of localization.

❼ ❼ ❼ ❼

slide-14
SLIDE 14

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ Localization features of POSIX is required since ECMA is not

capable of localization.

❼ Performance is low, due to rich expression features. ❼ ❼ ❼

slide-15
SLIDE 15

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ Localization features of POSIX is required since ECMA is not

capable of localization.

❼ Performance is low, due to rich expression features. ❼ There are given NO performance guarantees. ❼ ❼

slide-16
SLIDE 16

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ Localization features of POSIX is required since ECMA is not

capable of localization.

❼ Performance is low, due to rich expression features. ❼ There are given NO performance guarantees. ❼ Boost has a way to monitor the runtime complexity of

expressions and stopping them.

slide-17
SLIDE 17

Regular Expressions C++0x Sources

Flavor

The regex support as of TR1 is an extension of std based on Boost.regex, with the following proposed changes/consequences:

❼ Default ECMAScript syntax. ❼ Optional support for POSIX BRE/ERE/awk/grep/egrep/sed

syntax.

❼ Localization features of POSIX is required since ECMA is not

capable of localization.

❼ Performance is low, due to rich expression features. ❼ There are given NO performance guarantees. ❼ Boost has a way to monitor the runtime complexity of

expressions and stopping them.

❼ Customizing the expression syntax with trait classes. Nice!

slide-18
SLIDE 18

Regular Expressions C++0x Sources

Implementation

A full implementation in C++, not a wrapper! Available in the header file <regex> Representation:

❼ basic regex, holder of expressions, looks like a

basic string.

❼ match results, iterator of match results

Methods:

❼ ❼ ❼

slide-19
SLIDE 19

Regular Expressions C++0x Sources

Implementation

A full implementation in C++, not a wrapper! Available in the header file <regex> Representation:

❼ basic regex, holder of expressions, looks like a

basic string.

❼ match results, iterator of match results

Methods:

❼ bool regex match(basic string, basic regex) ❼ bool regex search(basic string, match results,

basic regex)

❼ basic string regex replace(basic string,

basic regex, basic string )

slide-20
SLIDE 20

Regular Expressions C++0x Sources

C++0x Example

#i n c l u d e <s t d l i b . h> #i n c l u d e <regex> #i n c l u d e <s t r i n g > #i n c l u d e <iostream> using namespace std ; regex e x p r e s s i o n ( ”([0−9]+)(\\−| | ✩ ) ( . ✯ ) ” ) ; // p r o c e s s f t p :

  • n

s u c c e s s r e t u r n s the f t p r espo nse code , and f i l l s // msg with the f t p r espo nse message . i n t p r o c e s s f t p ( const char✯ response , std : : s t r i n g ✯ msg) { cmatch what ; i f ( regex match ( response , what , e x p r e s s i o n )) { // what [ 0 ] c o n t a i n s the whole s t r i n g // what [ 1 ] c o n t a i n s the r espo nse code // what [ 2 ] c o n t a i n s the s e p a r a t o r c h a r a c t e r // what [ 3 ] c o n t a i n s the t e x t message . i f (msg) msg− >a s s i g n ( what [ 3 ] . f i r s t , what [ 3 ] . second ) ; return std : : a t o i ( what [ 1 ] . f i r s t ) ; } // f a i l u r e did not match i f (msg) msg− >e r a s e ( ) ; return −1; }

How is C++0x different from C++?

slide-21
SLIDE 21

Regular Expressions C++0x Sources

C++ Example

#i n c l u d e <s t d l i b . h> #i n c l u d e <boost / regex . hpp> #i n c l u d e <s t r i n g > #i n c l u d e <iostream> using namespace boost ; regex e x p r e s s i o n ( ”([0−9]+)(\\−| | ✩ ) ( . ✯ ) ” ) ; // p r o c e s s f t p :

  • n

s u c c e s s r e t u r n s the f t p r espo nse code , and f i l l s // msg with the f t p r espo nse message . i n t p r o c e s s f t p ( const char✯ response , std : : s t r i n g ✯ msg) { cmatch what ; i f ( regex match ( response , what , e x p r e s s i o n )) { // what [ 0 ] c o n t a i n s the whole s t r i n g // what [ 1 ] c o n t a i n s the r espo nse code // what [ 2 ] c o n t a i n s the s e p a r a t o r c h a r a c t e r // what [ 3 ] c o n t a i n s the t e x t message . i f (msg) msg− >a s s i g n ( what [ 3 ] . f i r s t , what [ 3 ] . second ) ; return std : : a t o i ( what [ 1 ] . f i r s t ) ; } // f a i l u r e did not match i f (msg) msg− >e r a s e ( ) ; return −1; }

It’s regex replace(”std”, sourceCode, ”boost”) different..

slide-22
SLIDE 22

Regular Expressions C++0x Sources

Sources

The C++ Standards Committee, n1429

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n1429.pdf

Wikipedia, C++0x

http://en.wikipedia.org/wiki/C++0x

Regular Expressions

http://www.regular-expressions.info