Regular Expressions Devin J. Pohly <djpohly@cse.psu.edu> - - PowerPoint PPT Presentation

regular expressions
SMART_READER_LITE
LIVE PREVIEW

Regular Expressions Devin J. Pohly <djpohly@cse.psu.edu> - - PowerPoint PPT Presentation

Systems and Internet i Infrastructure Security i Institute for Networking and Security Research Department of Computer Science and Engineering Pennsylvania State University, University Park, PA Regular Expressions Devin J. Pohly


slide-1
SLIDE 1

CMPSC 311: Introduction to Systems Programming Page 1

Institute for Networking and Security Research Department of Computer Science and Engineering Pennsylvania State University, University Park, PA

Systems and Internet Infrastructure Security

i i

Regular Expressions

Devin J. Pohly <djpohly@cse.psu.edu>

slide-2
SLIDE 2

Page 2 CMPSC 311: Introduction to Systems Programming

Regular expressions

  • Often shortened to

“regex” or “regexp”

  • Regular expressions are a

language for matching patterns

  • Super-powerful find and

replace tool

  • Can be used on the CLI,

in shell scripts, as a text editor feature, or as part

  • f a program
slide-3
SLIDE 3

Page 3 CMPSC 311: Introduction to Systems Programming

What are they good for?

  • Searching for specifically

formatted text

  • Email address
  • Phone number
  • Anything that follows a

pattern

  • Validating input
  • Same idea
  • Powerful find-and-replace
  • E.g. change “X and Y” to “Y

and X” for any X, Y

slide-4
SLIDE 4

Page 4 CMPSC 311: Introduction to Systems Programming

Regex “flavors”

  • Many languages support

regular expressions

  • Perl
  • JavaScript
  • Python
  • PHP
  • Java, Ruby, .NET, etc.
  • Today we will be learn

standard Unix “extended regular expressions”

slide-5
SLIDE 5

Page 5 CMPSC 311: Introduction to Systems Programming

On the command line

  • The grep command is a

regex filter

  • That’s what the “re” in the

middle stands for

  • We have seen fgrep,

which looks for literal (“fixed”) strings

  • Today we will use egrep
  • E for “extended” regular

expressions

  • Very close to other

languages’ flavors

slide-6
SLIDE 6

Page 6 CMPSC 311: Introduction to Systems Programming

grep command syntax

  • To find matches in files:

egrep regex file(s)

  • To filter standard input:

egrep regex

  • where regex is a regular

expression, and file(s) are the files to search

  • Options:
  • -i: ignore case
  • -v: find non-matching lines
  • -r: search entire directories
  • man grep for more
slide-7
SLIDE 7

Page 7 CMPSC 311: Introduction to Systems Programming

Okay, let’s begin!

$ cd /usr/share/dict $ egrep hello words ... $ cat words | egrep hello ...

slide-8
SLIDE 8

Page 8 CMPSC 311: Introduction to Systems Programming

First lesson

  • Letters, numbers, and a

few other things match literally

  • Find all the words that

match “fgh”

  • Find all the words that

match “lmn”

  • Note: a regex can match

anywhere in the string

  • Doesn’t have to match

the whole string

slide-9
SLIDE 9

Page 9 CMPSC 311: Introduction to Systems Programming

Anchors

  • Caret ^ matches at the

beginning of a line

  • Dollar sign $ matches at the

end of a line

  • Use '…' to protect special

characters from the shell!

  • Try it
  • Find words ending in “gry”
  • Find words starting with “ah”
  • What happens if we use

both ^ and $?

slide-10
SLIDE 10

Page 10 CMPSC 311: Introduction to Systems Programming

Single-character wildcard

  • Dot . matches any single

character (exactly one)

  • Find a 6-letter word

where the second, fourth, and sixth letters are “o”

  • Find any words that have

at least 22 characters

slide-11
SLIDE 11

Page 11 CMPSC 311: Introduction to Systems Programming

Multi-character wildcard

  • Dot-star .* will match 0
  • r more characters
  • We’ll see why on the

next slide

  • Find all the words that

contain a, e, i, o, u in that

  • rder (with anything in

between)

  • How about u, o, i, e, a?
slide-12
SLIDE 12

Page 12 CMPSC 311: Introduction to Systems Programming

Quantifiers

  • How many repetitions of

the previous thing to match?

  • Star *: 0 or more
  • Plus +: at least 1
  • Question mark ?: 0 or 1 (i.e.,
  • ptional)
  • Try it out
  • Spell check: necc?ess?ary
  • Global awareness: colou?r
  • Find words with u, o, i, e, a in

that order and at least one letter in between each

slide-13
SLIDE 13

Page 13 CMPSC 311: Introduction to Systems Programming

Careful!

  • Guess before you try:

what happens if you search for z*?

  • Now try searching for

the empty string

  • Use '' to give the shell

an empty argument

  • Conclusion: make sure

your regex always tries to match something!

slide-14
SLIDE 14

Page 14 CMPSC 311: Introduction to Systems Programming

Character classes

  • Square brackets [abc] will

match any one of the enclosed characters

  • What will [chs]andy

match?

  • You can use quantifiers on

character classes

◾ Find words starting with b

where all the rest of the letters are s, a, or n

◾ Find all the words you can

type with only ASDFJKL

◾ Find all the words you can

type with AOEUHTNS!

slide-15
SLIDE 15

Page 15 CMPSC 311: Introduction to Systems Programming

  • Part of character classes
  • You can specify a range
  • f characters with [a-j]
  • One hex digit: [0-9a-f]
  • Consonants:

[b-df-hj-np-tv-z]

  • Find all the words you can

make with A through E

◾ … that are at least 5

letters long (hint: pipe the

  • utput to another egrep!)

Ranges

slide-16
SLIDE 16

Page 16 CMPSC 311: Introduction to Systems Programming

Negative character classes

  • If the first character is a

caret, matches anything except these characters

  • Consonants: [^aeiou]

◾ Not quite – why?

  • Find words that contain a

q, followed by something

  • ther than u
  • Can be combined with

ranges

◾ Any character that isn’t a

digit: [^0-9]

slide-17
SLIDE 17

Page 17 CMPSC 311: Introduction to Systems Programming

Groups

  • Parentheses (…) create

groups within a regex

  • Quantifiers operate on

the entire group

  • Find words with an m,

followed by “ach” one or more times, followed by e

  • Find words where every
  • ther character, starting

with the first, is an e

slide-18
SLIDE 18

Page 18 CMPSC 311: Introduction to Systems Programming

Branches

  • The pipe | denotes that

either the left or right side matches

  • It’s the “or” operator
  • Useful inside parentheses
  • Guess before you try:
  • book(worm|end)
  • ^(out|lay)+$
slide-19
SLIDE 19

Page 19 CMPSC 311: Introduction to Systems Programming

Special characters

  • We’ve seen a lot already
  • ^$.*+?[]()|\
  • Backslash \ will escape a

special character to search for it literally

  • For example, you could

search your code for the expression ‘int *\*’ to find integer pointers

  • What is the difference

between the two *s?

slide-20
SLIDE 20

Page 20 CMPSC 311: Introduction to Systems Programming

Backreferences

  • Groups in () can be

referred to later

  • Must match exactly the

same characters again

  • Numbered \1, \2, \3

from the start of the regex

  • Try it: (...)to\1
  • Find words that have a

four-character sequence repeated immediately

slide-21
SLIDE 21

Page 21 CMPSC 311: Introduction to Systems Programming

Substituting – a demo

  • The sed program has a lot of functions for modifying text
  • Most useful is the s///g (“substitute”) command: regular

expression find-and-replace

  • Also available in Vim by typing :%s/regex/replacement/g
  • Try it: run this command and type things

$ sed -r 's/([a-z]+) and ([a-z]+)/\2 and \1/g'

slide-22
SLIDE 22

Page 22 CMPSC 311: Introduction to Systems Programming

Enjoy puzzles?

  • regexcrossword.com
  • Great way to practice

your regex-fu

  • Starts with simpler

tutorial puzzles and works up