Regular Expressions Principles of Programming Languages Colorado - PowerPoint PPT Presentation

Regular Expressions Principles of Programming Languages Colorado School of Mines https://lambda.mines.edu CSCI-400

You should have researched one of these topics on the LGA: Reference Couting Smart Pointers Valgrind Explain to your group! Learning Group Activity CSCI-400

Regular expression languages describe a search pattern on a string. They are called regular , since they implement a regular language : a language which can be described using a fjnite state machine. Typically used for determining if a string matches a pattern, replacing a pattern in a string, or extracting information from a string. Regular expression languages are a family of languages , rather than just a single language. Many modern regular expression languages were inspired by Perl’s regular expression syntax. Regular Expressions CSCI-400

Python’s regular expression language can be accessed using returns a regular expression object: characters at the end Python's Regular Expressions the re module: >>> import re Regular expressions can be compiled using re.compile . This >>> p = re.compile(r'ab[cd]') There’s a number of things we might want to do with p here: p.match : Match the beginning of a string p.fullmatch : Match the whole string, without allowing p.search : Match anywhere in the string p.finditer : Iterate over all of the matches in the string CSCI-400

only once. Character sets also support a shorthand for ranges of characters, for example: These can even be combined: set: Character Sets [abcd] is a character set. It matches a single a , b , c , or d , [0-9] matches a single digit [a-z] matches a lowercase letter [A-Z] matches an uppercase letter [a-zA-Z2] will match a single lowercase letter, uppercase letter, or the digit 2 . A ^ (caret) at the beginning of a character set negates the [^0-9] will match a single character that is not a digit. CSCI-400

As a convenience, Python gives us access to a few nice character sets: letters, digits, and underscores) Special Character Sets \s matches any whitespace character \S matches any non-whitespace character \d matches any digit \D matches any non-digit \w matches any "word" character (capital letters, lowercase \W matches any non-word character CSCI-400

Any character The . matches any character, exactly once. t.ck will match tick , tock , and tuck , but not truck . To match a literal period, write " \. ". CSCI-400

Match Objects When we call match , fullmatch , or search , we get back a match object , or None if it did not match. When we iterate over finditer , we iterate on all of the match objects found. >>> p = re.compile(r'[cd][ao][tg]') >>> for word in 'cat', 'dog', 'cog', 'dat', 'datt': ... print (bool(p.match(word))) True True True True True >>> for word in 'orange', 'apple', 'datum': ... print (bool(p.match(word))) False False True CSCI-400

Often times, we want to match the previous group a certain number of times: For example: none at all How Many? ? will match 0 or 1 times + will match 1 or more times * will match 0 or more times {n} will match n times, exactly {m,n} will match between m and n times a?b matches ab as well as b [A-Z]* matches any amount of capital letters, including [0-9]+ matches one or more digits .* matches any character, zero or more times CSCI-400

Grouping allows us to: Specify groups of characters to repeat Alternate on difgerent sets of characters Capture the matched group and retrieve it in our match object Groups are written in parentheses, and alternation is specifjed Grouping using a vertical bar ( | ): Thanks?( you)? matches: Thanks Thank Thank you Thanks you Thank(s| you) matches: Thanks Thank you CSCI-400

On our match objects, we can obtain the result of a capture by Grouping: Using Captures calling .group : >>> p = re.compile(r'My name is (\w+) and I like (\w+)') >>> m = p.match('My name is Jack and I like computers') >>> m.group(1) 'Jack' >>> m.group(2) 'computers' >>> m.group(0) # the whole match 'My name is Jack and I like computers' >>> m.groups() # a tuple containing all of the groups > 0 ('Jack', 'computers') CSCI-400

means that they will not provide any visible group in the match object: Non-capturing Groups Groups which begin with ?: are non-capturing groups . This >>> p = re.compile(r'My name is (\w+)(?:,| and) I like (\w+)') >>> m = p.match('My name is Jack and I like computers') >>> m.group(1) 'Jack' >>> m.group(2) 'computers' >>> m = p.match('My name is Jack, I like computers') >>> m.group(1) 'Jack' >>> m.group(2) 'computers' CSCI-400

If we wanted to match as little as possible , we can use the undesired results: match as many characters as possible , this may lead to Greedyness + , * , and ? are called greedy operators since they will try and >>> p = re.compile(r'#(.*)#') >>> for m in p.finditer('#hello# a b c #world#'): ... print (m.group(1)) hello# a b c #world non-greedy version of the operator, which would be +? , *? , or ?? . >>> p = re.compile(r'#(.*?)#') >>> for m in p.finditer('#hello# a b c #world#'): ... print (m.group(1)) hello world CSCI-400

Anchors match a certain kind of occurrence in a string, but not necessarily any characters. end of a string. Examples: Anchors ^ anchors to the beginning of a string, or to the beginning of a line when re.MULTILINE is passed to re.compile $ anchors to the end of a string, or to the end of a line when re.MULTILINE is passed to re.compile \b anchors to the boundary of a word: the transition from a \w to a \W , or visa versa. Also anchors to the beginning or foo\b.* matches foo and foo-dle , but not foodle ^$ matches the empty string //.*(\n$|$) matches // hello and // hello\n , but not // hello\n\n CSCI-400

Sometimes, when regular expressions get long, you need a way to comment them and break up sections to let other programmers (or yourself) know what’s going on. Tip: Making Long REs Readable When you pass re.VERBOSE to re.compile , whitespaces are ignored, and # starts a comment until the end of line: p = re.compile(r''' (\w+) # first name \s+ (\w+) # last name \s+ ([2-9]\d {2} -[2-9]\d {2} -\d {4} ) # phone number ''', re.VERBOSE) CSCI-400

Matching a decimal number: Matching a C/C++ identifjer: Matching a Mines Email address: Tip If you want to test a regular expression, RegExr.com is a great resource. RE Examples, and any Questions? [0-9]+\.?[0-9]* [A-Za-z_][A-Za-z0-9_]* ([A-Za-z0-9.+-]+)@(mymail\.)?mines\.edu CSCI-400

number of states, and can only be in one state at a time. The machine has transitions that move it from one state to another. Figure: A state diagram for your home phone A fjnite state machine is any machine which has a fjnite Finite State Machines s 0 Phone Rings Machine Picks Up s 1 s 4 For Family For You Wrong Number Left Message s 3 Grabs Phone Not Home s 2 Talk Goodbye s 5 Hangup CSCI-400

f e machines as well. Consider the following regular expression: repetition and groups must be encoded using the FSA. Transitions correspond to only a single character, so diagram like this: The transitions have the letters on them. The states do not. placed in double circles . Any state which could be a terminating state should be Required Formalisms e Regular expressions can be represented as fjnite state Regular Expressions as Finite State Machines ^fr?ee$ This matches both free and fee , we can write this in a state s 2 r e s 0 s 1 s 3 s 4 CSCI-400

Recall the regular expression for C and C++ identifjers: A-Za-z_ A-Za-z0-9_ Another Example: C/C++ identifiers [A-Za-z_][A-Za-z0-9_]* s 0 s 1 CSCI-400

This is an open source tool developed by Sam Sartor (took CSCI-400 Spring 2018) to help you visualize regular expressions using fjnite state graphs: Regess! http://gh.samsartor.com/regess/ CSCI-400

With your learning group, translate each of these REs to a state diagram: Write your names on your paper and turn in for bonus learning group participation points. Translating REs to State Diagrams [A-Z]+ [A-Z]?x (try using ϵ for the "no character" transition) ([A-Z][1-5])+ (hint: draw a transition going backwards) CSCI-400

Regular Expressions Principles of Programming Languages Colorado - PowerPoint PPT Presentation

Regular Expressions Principles of Programming Languages Colorado School of Mines https://lambda.mines.edu CSCI-400 You should have researched one of these topics on the LGA: Reference Couting Smart Pointers Valgrind Explain to your group!

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Regular Expressions A regular expression describes a language using three operations. Regular

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Kleene Algebras: The Algebra of Regular Expressions Adam Braude University of Puget Sound May

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions in .NET Regular Expressions in .NET By: Nasser Alshammari College of

Regular Expressions Regular Expressions and Automata and Automata Berlin Chen 2003 References:

Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com

Regular Expressions Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

Technological Advisory Council Receivers and Spectrum Working Group 27 June 2012 Charter The

Introduction to Computing Principles

Information Extraction Kristina Lerman University of Southern California Thanks to Andrew

ECEN 5032 Data Networks Introduction Peter Mathys mathys@colorado.edu University of Colorado,

A Bug or Malware? Catastrophic consequences either way. Ben Holland,

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials

Dont Fear the Project Jared Sorge https://jsorge.net jared@jsorge.net My Work Lets Survey

Regular Expressions Principles of Programming Languages Colorado - PowerPoint PPT Presentation

Regular Expressions Principles of Programming Languages Colorado School of Mines https://lambda.mines.edu CSCI-400 You should have researched one of these topics on the LGA: Reference Couting Smart Pointers Valgrind Explain to your group!

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Theory of Computer Science C3. Regular Languages: Regular Expressions, Pumping Lemma Malte

Regular Expressions A regular expression describes a language using three operations. Regular

Chapter 7 Expressions and Statements Expressions Arithmetic Expressions Conditional

Kleene Algebras: The Algebra of Regular Expressions Adam Braude University of Puget Sound May

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions in .NET Regular Expressions in .NET By: Nasser Alshammari College of

Regular Expressions Regular Expressions and Automata and Automata Berlin Chen 2003 References:

Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com

Regular Expressions Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Regular Languages Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Fem Poble(s): Expressions Meritxell (Txell) Martn Pardo, Ph.D Research associate Data

Technological Advisory Council Receivers and Spectrum Working Group 27 June 2012 Charter The

Introduction to Computing Principles

Information Extraction Kristina Lerman University of Southern California Thanks to Andrew

ECEN 5032 Data Networks Introduction Peter Mathys mathys@colorado.edu University of Colorado,

A Bug or Malware? Catastrophic consequences either way. Ben Holland,

Data Cleaning &amp; Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials

Data Cleaning &amp; Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials

Dont Fear the Project Jared Sorge https://jsorge.net jared@jsorge.net My Work Lets Survey

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials