Regular Expressions
CS 2110
Regular Expressions CS 2110 What is a regular expression? A - - PowerPoint PPT Presentation
Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A
CS 2110
A special string for describing a pattern of characters. Examples:
Regular expression Description [abc] One of three characters (a, b, OR c) [a-z] A single lowercase letter [a-z0-9] A single lowercase letter OR number (not both) . Any one character \. A period (“.”) * 0 to many ? 0 or 1 + 1 or many
Mark regular expressions as raw strings
Starts with r”
Use square brackets for “any character from inside the bracket”
r“[bce]” – matches “b”, or “c”, or “e” (But not “be” or “bc”)
Use ranges or classes of characters
r“[A-Z]” – matches any uppercase letter r“[a-z]” – matches any lowercase letter r“[0-9]” – matches any digit
Searching for hyphens: include – right after the [ or right before ]
r”[-a-z]” – matches any hyphen OR any lowercase letter
r“[bce]at”
Matches “bat”, “cat”, “eat”
r“.at”
Matches 3 letter words that end in “at”
r“at\.”
Matches “at.”
Import statement
import re
Compiling the regex
regex = re.compile(regular_expression_extring) regex is now a regular expression tool we can use
Using regex
results = regex.search(text) results = regex.findall(text) results = regex.finditer(text)
Use “^” at the start of a [] for negation:
r“[^a-z]” – match anything except lowercase letters r“[^0-9]” – match anything except decimal digits
Use ^ at the start of the expression (not inside []) to mean “the start
i.e., if searching through a list of strings, only match strings that start with
the expression
Use $ for the end of the string.
Character Meaning \d Any digit – means the same as [0-9] \D Anything EXCEPT digits – means the same as [^0-9] \s Any whitespace character “ “, “\t” “\n”, etc. – [ \t\n] \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]
r“[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”
Phone number written as “123-456-7890”
Except, that’s a little redundant, right?
We can write the same patter above as r“[0-9]{3}-[0-9]{3}-[0-9]{4}”
{x} means repeat look for the previous pattern to repeat x times
“[abn]{6}” would match “banana”, for example (or “nnnaaa”) “[abn]{3,6}” would match “ban”, “nan”, “abba”, “banana”, etc.
Most English first names:
r”[A-Z][a-z]+”
Dates:
[0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR [0-9]{4}[/-][0-9]{2}[/-][0-9]{2}
SSN
[0-9]{3}-[0-9]{2}-[0-9]{4}
Find all returns a list of all the strings that match the regex. Example, let’s consider this pattern for emails:
r"[a-z0-9]+@[a-z]+\.[a-z]+“
Using that, let’s find all the emails at:
https://engineering.virginia.edu/departments/computer-
science/faculty
Use this webpage:
https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt
Find all the phone numbers using regular expressions! (Not text parsing) Now:
Get the name and phone number of everyone whose first name starts with “J” and
whose last name starts with “Neu”
USE REGULAR EXPRESSIONS
Groups
Using them Getting individual groups
The match object More practice
Returned by search() and finditer(). Example:
<_sre.SRE_Match object; span=(0, 5), match='Frodo'>
This match object can be used as follows
match.span – (0,5) match.start – 0 match.end – 5 match.group – “Frodo”
regex.search(text) – Search through text, find the first instance of a
match to regex, and return a MATCH object
Returns None if no match object found Often used as a “does this pattern exist in the text” test
Can also be written as
re.search(regular_expression, text)
FindIter returns an iterable of match objects (that is, you can loop
through it)
import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)
We get the result:
albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org?
Let’s try this pattern:
r"([a-z0-9]+@[a-z\.]+\.(com|edu|org))
That gives us tuples like:
('albertm@darden.virginia.edu', 'edu’)
Wait…why tuples?
Parentheses can be used to isolate “Groups” in the regular
expression.
Example:
In this String group(0) – The overall match group(1) – Specifically the match in parentheses (com, edu, or org) .group() – returns the same as group(0) .groups() – returns the matching SUB-groups (not the overall match)
r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)"