R E G U L A R L A N G U A G E S , E X P R E S S I O N S A N D A P P L - - PowerPoint PPT Presentation

r e g u l a r l a n g u a g e s
SMART_READER_LITE
LIVE PREVIEW

R E G U L A R L A N G U A G E S , E X P R E S S I O N S A N D A P P L - - PowerPoint PPT Presentation

A U T O M ATA A N D F O R M A L L A N G U A G E S , # C O U R S E [ 1 5 1 0 3 ] R E G U L A R L A N G U A G E S , E X P R E S S I O N S A N D A P P L I C AT I O N S D R . V A D I M Z AY T S E V A . K . A . @ G R A M M A R W A R E R O A D M A P


slide-1
SLIDE 1

R E G U L A R L A N G U A G E S , E X P R E S S I O N S A N D A P P L I C AT I O N S

A U T O M ATA A N D F O R M A L L A N G U A G E S , # C O U R S E [ 1 5 1 0 3 ] D R . V A D I M Z AY T S E V A . K . A . @ G R A M M A R W A R E

slide-2
SLIDE 2

R O A D M A P

  • Chomsky hierarchy revisited
  • How to see if the language is regular?
  • The class of regular languages
  • Tools to work with regular languages
  • Advanced methods

source is given at the bottom of each slide

slide-3
SLIDE 3

C H O M S K Y H I E R A R C H Y

Duncan Rawlinson, Chomsky.jpg, 2004, CC-BY.

slide-4
SLIDE 4

C H O M S K Y H I E R A R C H Y

l a n g u a g e s re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

Noam Chomsky. On Certain Formal Properties of Grammars, Information & Control 2(2):137–167, 1959.

slide-5
SLIDE 5

l a n g u a g e s re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

C H O M S K Y : A U T O M ATA

Tu r i n g m a c h i n e p u s h d o w n a u t o m a t o n f i n i t e s t a t e a u t o m a t o n l i n e a r b o u n d e d a u t o m a t o n

(too many to list)

slide-6
SLIDE 6

l a n g u a g e s re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

C H O M S K Y : T O O L S

i m a g i n a r y g r a m m a r w a re re g e x p c o m p u t e r

slide-7
SLIDE 7

l a n g u a g e s re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

C H O M S K Y : R E W R I T I N G

α → β X → γ X → a X → a B α X β → α γ β

Axel Thue. Probleme über Veränderungen von Zeichenreihen nach gegebenen Regeln, 1914. http://arxiv.org/abs/1308.5858

slide-8
SLIDE 8

R E G E X P S R E V I S I T E D

  • Regular sets by Stephen

Kleene in 1956

  • ∅, ε, letters from Σ
  • concatenation
  • iteration
  • alternation
  • Precisely fit the regular class
  • S. C. Kleene, Representation of Events in Nerve Nets and Finite Automata. In Automata Studies, pp. 3–42, 1956.

photo from: Konrad Jacobs, S. C. Kleene, 1978, MFO.

slide-9
SLIDE 9

D E T E R M I N I S T I C F I N I T E A U T O M AT O N

  • C. E. Shannon, W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, 1949.

(finite state grammars and finite diagrams and finite state Markov processes)

slide-10
SLIDE 10

T O W H I C H C L A S S D O L A N G U A G E S B E L O N G ?

  • {ε}
  • {ε} in a non-empty alphabet
  • {x, y, z}
  • {0ⁿ | n > 1}
  • decimal numbers
  • {0ⁿ1ⁿ | n > 1}
  • {0ⁿ1 ⁿ | n > 1}
  • {0ⁿ1ⁿ2ⁿ | n > 1}

interactive

a l l re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

²

R E G U L A R C - F R E E C - S E N S I T I V E F I N I T E F I N I T E F I N I T E F I N I T E R E G U L A R C - F R E E

slide-11
SLIDE 11

I S A TA S K S O LVA B L E B Y R E G U L A R M E A N S ?

  • Substring search
  • grep, contains(), find(), substring(), …
  • Substring replacement
  • sed, awk, perl, vim, replace(), replaceAll(), …
  • Pretty-printing
  • VS.NET, Sublime, TextMate, …

interactive

✓ ✗ ✓

slide-12
SLIDE 12

I S A TA S K S O LVA B L E B Y R E G U L A R M E A N S ?

  • Counting [non-empty] lines in a file
  • wc -l, grep -c “”
  • grep -v “^$”, sed -n /./p | wc -l, …
  • Parsing HTML
  • <BODY><TABLE><P><A HREF=…
  • Parsing a postcode
  • 1098 XG, …

interactive

✓ ✗

slide-13
SLIDE 13

H OW TO P R OV E W H I C H C L A S S A L A N G UAG E B E L O N G S TO

slide-14
SLIDE 14

P U M P I N G L E M M A

  • In simple terms
  • sufficiently long words have repeatable parts
  • (works for all infinite regular languages)
  • L is regular ⇒ formula holds
  • Formula does not hold ⇒ L is finite or not regular

a l l re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

Jos C.M. Baeten, Models of Computation: Automata, Formal Languages and Communicating Processes, §2.9, p.58.

F O R R E G U L A R L A N G U A G E S

slide-15
SLIDE 15

J O H N M Y H I L L A N D A N I L N E R O D E

Cornell, Faculty and Senior Researcher Profiles. Who's That Mathematician? Paul R. Halmos Collection - Page 36.

slide-16
SLIDE 16

M Y H I L L – N E R O D E T H E O R E M

  • Myhill-Nerode equivalence
  • u~v ⟺ ∀w: (uw∈L ∧ vw∈L) ∨ (uw∉L ∧ vw∉L)
  • Theorem: L is regular iff the number of Myhill-Nerode

equivalence classes is finite.

  • In simple terms
  • few groups of forgettable prefixes
  • Works both ways

Anil Nerode, Linear Automaton Transformations, Proceedings of the AMS 9, 1958.

slide-17
SLIDE 17

L I M I T E D M E M O R Y

  • Advice from teh internetz:
  • how many characters must you remember from the

stream?

  • bounded ⇒ regular
  • unbounded ⇒ ?
  • Correct or not?

Brian M. Scott, http://math.stackexchange.com/questions/282216/determine-if-a-language-is-regular-from-the-first-sight

c o r re c t !

m e m o r y i s l i m i t e d , a l p h a b e t i s l i m i t e d ⇒ p re f i x e s a re l i m i t e d

slide-18
SLIDE 18

N U M B E R O F C O U N T E R S

  • {0ⁱ1ⁿ…}
  • no relation between i and n ⇒ regular
  • 1 counter ⇒ context-free
  • n counters ⇒ context-sensitive
  • ∞ counters ⇒ recursively enumerable

Himanshu Saikia, http://math.stackexchange.com/questions/282216/determine-if-a-language-is-regular-from-the-first-sight

slide-19
SLIDE 19

D I S A S S E M B L E / M A S S A G E

  • {0ⁿ1ⁿ | n > 1}
  • {0ⁱ1ⁿ | n > 1, i > 1, i ≠ n}
  • matching brackets language not regular
  • ⇒ no matching pairs language is regular
  • Many combinations of regular languages are regular
  • Proving by decomposition is valid
slide-20
SLIDE 20

T H E C L A S S O F R E G U L A R L A N G UAG E S

slide-21
SLIDE 21

C L A S S C L O S E D U N D E R C O M P L E M E N T

  • If A is a regular language, then
  • Ā is regular
  • Meaning…
  • grep -v “123” file.txt
  • (Must know the alphabet Σ)
  • (Actually stronger: any finite number of errors)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
  • E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.
slide-22
SLIDE 22

C L A S S C L O S E D U N D E R S E T U N I O N

  • If A and B are regular languages, then
  • A⋃B is regular
  • Meaning…
  • [a-z]
  • x | y | z (in some notations)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-23
SLIDE 23

C L A S S C L O S E D U N D E R I N T E R S E C T I O N

  • If A and B are regular languages, then
  • A⋂B is regular
  • Meaning…
  • cat file.txt | grep “abc” | grep “xyz”
  • (Not true for context-free languages!)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-24
SLIDE 24

C L A S S C L O S E D U N D E R D I F F E R E N C E

  • If A and B are regular languages, then
  • A∖B is regular
  • Meaning…
  • cat file.txt | grep “abc” | grep -v “123”
  • (Not true for context-free languages!)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-25
SLIDE 25

C L A S S C L O S E D U N D E R I T E R AT I O N

  • If A is a regular language, then
  • A* and A⁺ are regular
  • Meaning…
  • [a]*
  • [a]⁺
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-26
SLIDE 26

C L A S S C L O S E D U N D E R C O N C AT E N AT I O N

  • If A and B are regular languages, then
  • AB is regular
  • Meaning…
  • [Bb][Oo][Dd][Yy]
  • (Just glue regexps; in practice, watch out for subgroups)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-27
SLIDE 27

C L A S S [ S O M E T I M E S ] C L O S E D U N D E R

D E C O M P O S I T I O N

  • If A is a regular language, then
  • “front halves” is regular
  • “tail halves” is regular
  • “middle thirds” is regular
  • “arbitrary halves/thirds” is regular
  • NB: glued side thirds is NOT regular
  • E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.
slide-28
SLIDE 28

C L A S S C L O S E D U N D E R H O M O M O R P H I S M

  • If A is a regular language and
  • h : Σ → Σ*
  • then
  • h(A) is regular
  • Meaning that debugging is feasible
  • (Even better for context-free languages: substitutions)
  • J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.
slide-29
SLIDE 29

var whitelist = @"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>| </?s>|</?strike>|</?blockquote>|</?sub>|</?super>| </?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>| </?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";

R E G E X P S YO U N E E D T O D E B U G

Jeff Atwood, If You Like Regular Expressions So Much, Why Don't You Marry Them?, 22 Mar 2005. Jeff Atwood, Regular Expressions: Now You Have Two Problems, 27 Jun 2008.

slide-30
SLIDE 30

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r \n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?: (?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r \n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;: \\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)? [ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)? [ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r \n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)? [ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r \n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*) (?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r \n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^ \"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)? [ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*) (?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r \n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\ [([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r \n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)? [ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

“somewhat pushes the limits

  • f what it is

sensible

to do with regular expressions”

Jeff Atwood, Regex use vs. Regex abuse, 16 Feb 2005. RFC822. Paul Warren, Mail::RFC822::Address: regexp-based address validation, 17/09/2012.

slide-31
SLIDE 31

TO O L S OV E RV I E W

slide-32
SLIDE 32

A L L TO O L S A R E D I F F E R E N T

POSIX standard since 1993

(who the hell uses [[:digit:]] anyway?)

slide-33
SLIDE 33

G R E P

  • Ken Thompson: qed, ed, grep
  • grep “abc” program.c
  • grep \\d file.txt
  • grep ^#*\ \\w README.md

photo from: Archetypal hackers ken (left) and dmr (right).

slide-34
SLIDE 34

S E D

  • sed 's/Finite/Regular/g' oldfile >newfile
  • sed -n 12,18p myfile
  • sed 12,18d myfile
  • sed 12q myfile
  • sed 12,/foo/d myfile
  • sed ‘$d’ myfile
  • sed -n '/[0-9]\{2\}/p' myfile
  • sed ‘5!s/Finite/Regular/g' oldfile >newfile
  • sed ‘/Automaton/!s/Finite/Regular/g’ oldfile >newfile

Lee E. McMahon, sed, Stream EDitor, 1973 or 1974, http://www.columbia.edu/~rh120/ch106.x09 photo from: http://merdivengo.blogspot.com/2012/03/turnuva-sistemleri-uzerine.html

slide-35
SLIDE 35

O R I G I N S O F AW K

slide-36
SLIDE 36

AW K

  • Turing-complete one-liner language with regexps
  • Built-in variables
  • $0, NF, $1, $2, $3, …, $NF
  • FILENAME, NR, FS, OFS, RS, ORS
  • Built-in operators
  • print, printf, length, $
  • Can define own functions & variables
  • A. V. Aho, B. W. Kernighan, P

. J. Weinberger, AWK — A Pattern Scanning and Processing Language. SPE, 9(4): 267-279, 1979.

slide-37
SLIDE 37

AW K I N A C T I O N

slide-38
SLIDE 38

AW K E X A M P L E S

  • {


w += NF
 c += length + 1
 }
 END { print NR, w, c }

  • yes Wikipedia | awk 'NR % 4 == 1 { printf "%6d %s\n", NR,

$0 }' | sed 5q

  • #!/usr/bin/awk -f


BEGIN { print "Hello, world!" }

https://en.wikipedia.org/wiki/AWK

slide-39
SLIDE 39

AW K O U T P U T

slide-40
SLIDE 40

L E X

  • Regexps used for the first phase of parsing since 1968.
  • Wikipedia explains why it is used together with yacc/bison:
  • Collection of regexp patterns with actions
  • [a-zA-Z]+ { printf("Word: %s\n", yytext); }
  • .|\n {}
  • Easy to write a tokeniser
  • W. L. Johnson, J. H. Porter, S. I. Ackley, D. T. Ross, Automatic Generation of Efficient Lexical Processors Using Finite State
  • Techniques. Communications of the ACM 11 (12): 805–813, 1968.
  • M. E. Lesk, LEX - A Lexical Analyzer Generator, CSTR 39, Bell Laboratories, 1975.
slide-41
SLIDE 41

L E X E X A M P L E

int lineno=1;

  • letter

[a-zA-Z] digit [0-9] id {letter}({letter}|{digit})* number {digit}+ %% printf("\nTokeniser running -- ^D to exit\n");

  • ^{id}

{line();printf("<id>");} {id} printf("<id>"); ^{number} {line();printf("<number>");} {number} printf("<number>"); ^[ \t]+ line(); [ \t]+ printf(" "); [\n] ECHO; ^[^a-zA-Z0-9 \t\n]+ {line();printf("\\%s\\",yytext);} [^a-zA-Z0-9 \t\n]+ printf("\\%s\\",yytext); %% line() { printf("%4d: ",lineno++); }

  • M. G. Roth, CS 631, Lex example, https://www.cs.uaf.edu/~cs631/lex_token.txt

c h a ra c t e r- l e v e l g ra m m a r

::= ::= ::= ::=

slide-42
SLIDE 42

P E R L [ , T C L , P Y T H O N , … ]

  • Henry Spencer made advanced regex in 1986
  • his DFA/NFA-based TCL version is faster!
  • Can be used as sed:
  • perl -pi -w -e 's/Perl/Python/g;' *
  • Or, in programs:
  • $bar =~ /foo/
  • Redesigned in Perl 6 (merged with PEG)
slide-43
SLIDE 43

P E R L R E G E X E X A M P L E S

  • Match
  • my ($hs, $ms, $ss) = ($time =~ m/(\d+):(\d+):(\d+)/);
  • Substitute
  • $s =~ s/dog/cat/;
  • Transliterate
  • $uc =~ tr/a-z/A-Z/;

Tutorialspoint, PERL Regular Expressions, http://www.tutorialspoint.com/perl/perl_regular_expression.htm

slide-44
SLIDE 44

P C R E

  • “Perl Compatible Regular Expressions”
  • P.S.: not compatible with Perl
  • P.P.S.: not regular
  • C library by Philip Hazel (stable release Dec. 2013)
  • PCREs are used in other languages
  • PHP, Ruby, JavaScript, …
  • Way beyond regular: backrefs, recursion, assertions, …
  • <(\w+)>.*<\/\1>, \((?R)*\)

http://www.pcre.org

slide-45
SLIDE 45

R A S C A L ( M E TA P R O G R A M M I N G )

  • Java Regex
  • /xyz/ := “xyz”
  • if (/xyz/ := s) {…}
  • if (/x<m:y+>z/ := s) println(m);
  • /[0-9]+ \w*/ := “1098 XG”
  • Lexical grammars
  • lexical Number = [1-9][0-9]*;
  • parse(#Number, file);

http://rascal-mpl.org/

slide-46
SLIDE 46

C O N C L U S I O N

  • Benefits of regular languages:
  • lexical tools are fast & always applicable
  • (relatively) easy to develop
  • Drawbacks:
  • very limited context
  • (usually) many false positives, requires tweaking
slide-47
SLIDE 47

S U M M A R Y

a l l re c u r s i v e l y e n u m e ra b l e c o n t e x t - s e n s i t i v e c o n t e x t - f re e re g u l a r f i n i t e

  • Chomsky hierarchy
  • languages, automata, algorithms, rewriting systems, hardware
  • Judging if regular
  • pumping lemma, Myhill-Nerode, memory, counters, disassembly
  • Class closed under
  • complement, union, intersection,
  • difference, concatenation,
  • homomorphism
  • Tools
  • grep, perl, sed, awk, rsc
  • To read: Jeffrey E. F. Friedl, Mastering Regular Expressions:

Understand Your Data and Be More Productive, O’Reilly, 2006.

slide-48
SLIDE 48

T H A N K S F O R Y O U R AT T E N T I O N !

  • This was Dr. Vadim Zaytsev a.k.a. grammarware
  • grammarware.net, twitter.com/grammarware,

grammarware.github.com, …

  • I usually teach at Master Software Engineering
  • and do research on grammars and software languages
  • Affiliations
  • UvA (2013–2014), CWI (2000, 2010–2013), Uni Koblenz (2008–

2010), VU (2004–2008), UTwente (2002–2004), Rostov State Transport University (1999–2008), Rostov State University (1998–2003), Desk.nl (1999, 2001)

  • Slides are CC-BY-SA: grammarware.net/slides/2014/regular.pdf