regular expressions
play

Regular Expressions CS 2110 What is a regular expression? A - PowerPoint PPT Presentation

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A


  1. Regular Expressions CS 2110

  2. What is a regular expression?  A special string for describing a pattern of characters.  Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A single lowercase letter OR number (not both) [a-z0-9] Any one character . A period (“.”) \. 0 to many * 0 or 1 ? 1 or many +

  3. REGEX String  Mark regular expressions as raw strings  Starts with r”  Use square brackets for “any character from inside the bracket”  r“[ bce ]” – matches “b”, or “c”, or “e” (But not “be” or “ bc ”)  Use ranges or classes of characters  r“[A - Z]” – matches any uppercase letter  r“[a - z]” – matches any lowercase letter  r“[0 - 9]” – matches any digit  Searching for hyphens: include – right after the [ or right before ]  r”[ -a- z]” – matches any hyphen OR any lowercase letter

  4. Regex String  r“[ bce ]at”  Matches “bat”, “cat”, “eat”  r“.at”  Matches 3 letter words that end in “at”  r“at \ .”  Matches “at.”

  5. Regex in Python  Import statement  import re  Compiling the regex  regex = re.compile( regular_expression_extring)  regex is now a regular expression tool we can use  Using regex  results = regex.search(text)  results = regex.findall(text)  results = regex.finditer(text)

  6. Regular Expression Examples  Use “^” at the start of a [] for negation:  r“[^a - z]” – match anything except lowercase letters  r“[^0 -9 ]” – match anything except decimal digits  Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only)  i.e., if searching through a list of strings, only match strings that start with the expression  Use $ for the end of the string.

  7. Pre defined characters: Character Meaning Any digit – means the same as [0-9] \d Anything EXCEPT digits – means the same as [^0-9] \D Any whitespace character “ “, “ \ t” “ \ n”, etc. – [ \t\n] \s \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]

  8. Regular Expression Examples  r“[0 -9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0- 9]”  Phone number written as “123 -456- 7890”  Except, that’s a little redundant, right?  We can write the same patter above as  r“[0 -9]{3}-[0-9]{3}-[0- 9]{4}”  {x} means repeat look for the previous pattern to repeat x times  “[ abn ]{6}” would match “banana”, for example (or “ nnnaaa ”)  “[ abn ]{3,6}” would match “ban”, “nan”, “ abba ”, “banana”, etc.

  9. Regex Examples  Most English first names:  r”[A -Z][a- z]+”  Dates:  [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR  [0-9]{4}[/-][0-9]{2}[/-][0-9]{2}  SSN  [0-9]{3}-[0-9]{2}-[0-9]{4}

  10. Regex findall  Find all returns a list of all the strings that match the regex.  Example, let’s consider this pattern for emails:  r"[a-z0-9]+@[a-z]+\.[a- z]+“  Using that, let’s find all the emails at:  https://engineering.virginia.edu/departments/computer- science/faculty

  11. Practice  Use this webpage:  https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt  Find all the phone numbers using regular expressions! (Not text parsing)  Now:  Get the name and phone number of everyone whose first name starts with “J” and whose last name starts with “Neu”  USE REGULAR EXPRESSIONS

  12. Next time  Groups  Using them  Getting individual groups  The match object  More practice

  13. Match Object  Returned by search() and finditer(). Example: <_sre.SRE_Match object; span=(0, 5), match='Frodo'>  This match object can be used as follows  match.span – (0,5)  match.start – 0  match.end – 5  match.group – “Frodo”

  14. Search() and Finditer function  regex.search(text) – Search through text, find the first instance of a match to regex, and return a MATCH object  Returns None if no match object found  Often used as a “does this pattern exist in the text” test  Can also be written as  re.search(regular_expression, text)  FindIter returns an iterable of match objects (that is, you can loop through it)

  15. Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)

  16. But wait…  We get the result: albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org?  Let’s try this pattern:  r"([a-z0-9]+@[a-z\.]+\.(com|edu|org))  That gives us tuples like:  ('albertm@darden.virginia.edu', 'edu ’)  Wait…why tuples?

  17. Groups  Parentheses can be used to isolate “Groups” in the regular expression.  Example:  In this String r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)"  group(0) – The overall match  group(1) – Specifically the match in parentheses (com, edu, or org)  .group() – returns the same as group(0)  .groups() – returns the matching SUB-groups (not the overall match)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend