Regular Expressions CS 2110 What is a regular expression? A - - PowerPoint PPT Presentation

regular expressions
SMART_READER_LITE
LIVE PREVIEW

Regular Expressions CS 2110 What is a regular expression? A - - PowerPoint PPT Presentation

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description One of three characters (a, b, OR c) [abc] A single lowercase letter [a-z] A


slide-1
SLIDE 1

Regular Expressions

CS 2110

slide-2
SLIDE 2

What is a regular expression?

 A special string for describing a pattern of characters.  Examples:

Regular expression Description [abc] One of three characters (a, b, OR c) [a-z] A single lowercase letter [a-z0-9] A single lowercase letter OR number (not both) . Any one character \. A period (“.”) * 0 to many ? 0 or 1 + 1 or many

slide-3
SLIDE 3

REGEX String

 Mark regular expressions as raw strings

 Starts with r”

 Use square brackets for “any character from inside the bracket”

 r“[bce]” – matches “b”, or “c”, or “e” (But not “be” or “bc”)

 Use ranges or classes of characters

 r“[A-Z]” – matches any uppercase letter  r“[a-z]” – matches any lowercase letter  r“[0-9]” – matches any digit

 Searching for hyphens: include – right after the [ or right before ]

 r”[-a-z]” – matches any hyphen OR any lowercase letter

slide-4
SLIDE 4

Regex String

 r“[bce]at”

 Matches “bat”, “cat”, “eat”

 r“.at”

 Matches 3 letter words that end in “at”

 r“at\.”

 Matches “at.”

slide-5
SLIDE 5

Regex in Python

 Import statement

 import re

 Compiling the regex

 regex = re.compile(regular_expression_extring)  regex is now a regular expression tool we can use

 Using regex

 results = regex.search(text)  results = regex.findall(text)  results = regex.finditer(text)

slide-6
SLIDE 6

Regular Expression Examples

 Use “^” at the start of a [] for negation:

 r“[^a-z]” – match anything except lowercase letters  r“[^0-9]” – match anything except decimal digits

 Use ^ at the start of the expression (not inside []) to mean “the start

  • f the string” (i.e., searching from the beginning of the string only)

 i.e., if searching through a list of strings, only match strings that start with

the expression

 Use $ for the end of the string.

slide-7
SLIDE 7

Pre defined characters:

Character Meaning \d Any digit – means the same as [0-9] \D Anything EXCEPT digits – means the same as [^0-9] \s Any whitespace character “ “, “\t” “\n”, etc. – [ \t\n] \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_]

slide-8
SLIDE 8

Regular Expression Examples

 r“[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]”

 Phone number written as “123-456-7890”

 Except, that’s a little redundant, right?

 We can write the same patter above as  r“[0-9]{3}-[0-9]{3}-[0-9]{4}”

 {x} means repeat look for the previous pattern to repeat x times

 “[abn]{6}” would match “banana”, for example (or “nnnaaa”)  “[abn]{3,6}” would match “ban”, “nan”, “abba”, “banana”, etc.

slide-9
SLIDE 9

Regex Examples

 Most English first names:

 r”[A-Z][a-z]+”

 Dates:

 [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR  [0-9]{4}[/-][0-9]{2}[/-][0-9]{2}

 SSN

 [0-9]{3}-[0-9]{2}-[0-9]{4}

slide-10
SLIDE 10

Regex findall

 Find all returns a list of all the strings that match the regex.  Example, let’s consider this pattern for emails:

 r"[a-z0-9]+@[a-z]+\.[a-z]+“

 Using that, let’s find all the emails at:

 https://engineering.virginia.edu/departments/computer-

science/faculty

slide-11
SLIDE 11

Practice

 Use this webpage:

 https://storage.googleapis.com/cs1111/practice/simpsons_phone_book.txt

 Find all the phone numbers using regular expressions! (Not text parsing)  Now:

 Get the name and phone number of everyone whose first name starts with “J” and

whose last name starts with “Neu”

 USE REGULAR EXPRESSIONS

slide-12
SLIDE 12

Next time

 Groups

 Using them  Getting individual groups

 The match object  More practice

slide-13
SLIDE 13

Match Object

 Returned by search() and finditer(). Example:

<_sre.SRE_Match object; span=(0, 5), match='Frodo'>

 This match object can be used as follows

 match.span – (0,5)  match.start – 0  match.end – 5  match.group – “Frodo”

slide-14
SLIDE 14

Search() and Finditer function

 regex.search(text) – Search through text, find the first instance of a

match to regex, and return a MATCH object

 Returns None if no match object found  Often used as a “does this pattern exist in the text” test

 Can also be written as

 re.search(regular_expression, text)

 FindIter returns an iterable of match objects (that is, you can loop

through it)

slide-15
SLIDE 15

Pulling down emails of CS Faculty

import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails)

slide-16
SLIDE 16

But wait…

 We get the result:

albertm@darden.virginia That doesn’t seem right… shouldn’t emails end in com, edu, or org?

 Let’s try this pattern:

 r"([a-z0-9]+@[a-z\.]+\.(com|edu|org))

 That gives us tuples like:

 ('albertm@darden.virginia.edu', 'edu’)

 Wait…why tuples?

slide-17
SLIDE 17

Groups

 Parentheses can be used to isolate “Groups” in the regular

expression.

 Example:

 In this String  group(0) – The overall match  group(1) – Specifically the match in parentheses (com, edu, or org)  .group() – returns the same as group(0)  .groups() – returns the matching SUB-groups (not the overall match)

r"[a-z0-9]+@[a-z\.]+\.(com|edu|org)"