algorithms

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 5.4 R EGULAR E - PowerPoint PPT Presentation

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 5.4 R EGULAR E XPRESSIONS regular expressions REs and NFAs NFA simulation Algorithms NFA construction F O U R T H E D I T I O N applications R OBERT S EDGEWICK | K EVIN W AYNE


  1. Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 5.4 R EGULAR E XPRESSIONS ‣ regular expressions ‣ REs and NFAs ‣ NFA simulation Algorithms ‣ NFA construction F O U R T H E D I T I O N ‣ applications R OBERT S EDGEWICK | K EVIN W AYNE http://algs4.cs.princeton.edu

  2. 5.4 R EGULAR E XPRESSIONS ‣ regular expressions ‣ REs and NFAs ‣ NFA simulation Algorithms ‣ NFA construction ‣ applications R OBERT S EDGEWICK | K EVIN W AYNE http://algs4.cs.princeton.edu

  3. Pattern matching Substring search. Find a single string in text. Pattern matching. Find one of a specified set of strings in text. Ex. [genomics] ・ Fragile X syndrome is a common cause of mental retardation. ・ A human's genome is a string. ・ It contains triplet repeats of CGG or AGG , bracketed by GCG at the beginning and CTG at the end. ・ Number of repeats is variable and is correlated to syndrome. pattern GCG(CGG|AGG)*CTG text GCGGCGTGTGTGCGAGAGAGTGGGTTTAAAGCTGGCGCGGAGGCGGCTGGCGCGGAGGCTG 3

  4. Syntax highlighting input output /************************************************************************* * Compilation: javac NFA.java Ada HTML * Execution: java NFA regexp text Asm XHTML * Dependencies: Stack.java Bag.java Digraph.java DirectedDFS.java Applescript LATEX * Awk MediaWiki * % java NFA "(A*B|AC)D" AAAABD Bat ODF * true Bib TEXINFO * Bison ANSI * % java NFA "(A*B|AC)D" AAAAC C/C++ DocBook * false C# * Cobol *************************************************************************/ Caml Changelog public class NFA Css { D private Digraph G; // digraph of epsilon transitions Erlang private String regexp; // regular expression Flex private int M; // number of characters in regular expression Fortran GLSL // Create the NFA for the given RE Haskell public NFA(String regexp) Html { Java this.regexp = regexp; Javalog M = regexp.length(); Javascript Stack<Integer> ops = new Stack<Integer>(); Latex G = new Digraph(M+1); Lisp ... Lua ⋮ GNU source-highlight 3.1.4 4

  5. Google code search http://code.google.com/p/chromium/source/search 5

  6. Pattern matching: applications Test if a string matches some pattern. ・ Scan for virus signatures. ・ Process natural language. ・ Specify a programming language. ・ Access information in digital libraries. ・ Search genome using PROSITE patterns. ・ Filter text (spam, NetNanny, Carnivore, malware). ・ Validate data-entry fields (dates, email, URL, credit card). ... Parse text files. ・ Compile a Java program. ・ Crawl and index the Web. ・ Read in data stored in ad hoc input file format. ・ Create Java documentation from Javadoc comments. ... 6

  7. Regular expressions A regular expression is a notation to specify a set of strings. possibly infinite operation order example RE matches does not match concatenation 3 AABAAB AABAAB every other string AA or 4 AA | BAAB every other string BAAB AA AB closure 2 AB*A ABBBBBBBBA ABABA AAAAB A(A|B)AAB every other string ABAAB parentheses parentheses 1 1 A AA (AB)*A ABABABABABA ABBA 7

  8. Regular expression shortcuts Additional operations are often added for convenience. operation example RE matches does not match CUMULUS SUCCUBUS wildcard .U.U.U. JUGULUM TUMULTUOUS word camelCase character class [A-Za-z][a-z]* Capitalized 4illegal ABCDE ADE at least 1 A(BC)+DE ABCBCDE BCDE 08540-1321 111111111 exactly k [0-9]{5}-[0-9]{4} 19072-5541 166-54-111 Ex. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)* 8

  9. Regular expression examples RE notation is surprisingly expressive. regular expression matches does not match .*SPB.* RASPBERRY SUBSPACE ( substring search ) CRISPBREAD SUBSPECIES [0-9]{3}-[0-9]{2}-[0-9]{4} 166-11-4433 11-55555555 ( U. S. Social Security numbers ) 166-45-1111 8675309 [a-z]+@([a-z]+\.)+(edu|com) wayne@princeton.edu spam@nowhere ( simplified email addresses ) rs@princeton.edu [$_A-Za-z][$_A-Za-z0-9]* ident3 3a ( Java identifiers ) PatternMatcher ident#3 REs play a well-understood role in the theory of computation. 9

  10. Regular expression golf yes no obama romney bush mccain clinton kerry reagan gore … ... washington clinton http://xkcd.com/1313 Ex. Match elected presidents but not opponents (unless they later won). RE. bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls madison harrison 10

  11. Illegally screening a job candidate “ [First name]! and pre/2 [last name] w/7 bush or gore or republican! or democrat! or charg! or accus! or criticiz! or blam! or defend! or iran contra [First name of a candidate]! and pre/2 [last name of a candidate] w/7 bush or gore or republican! or or clinton or spotted owl or florida recount or sex! democrat! or charg! or accus! or criticiz! or blam! or or controvers! or fraud! or investigat! or bankrupt! defend! or iran contra or clinton or spotted owl or or layoff! or downsiz! or PNTR or NAFTA or outsourc! florida recount or sex! or controvers! or racis! or fraud! or indict! or enron or kerry or iraq or wmd! or arrest! or investigat! or bankrupt! or layoff! or downsiz! or or intox! or fired or racis! or intox! or slur! PNTR or NAFTA or outsourc! or indict! or enron or kerry or controvers! or abortion! or gay! or homosexual! or iraq or wmd! or arrest! or intox! or fired or sex! or racis! or intox! or slur! or arrest! or fired or controvers! or gun! or firearm! ” or abortion! or gay! or homosexual! or gun! or firearm! — LexisNexis search string used by Monica Goodling to illegally screen candidates for DOJ positions http://www.justice.gov/oig/special/s0807/final.pdf 11

  12. Can the average web surfer learn to use REs? Google. Supports * for full word wildcard and | for union. 12

  13. Regular expressions to the rescue http://xkcd.com/208 13

Recommend


More recommend