regular expressions
play

Regular Expressions Devin J. Pohly <djpohly@cse.psu.edu> - PowerPoint PPT Presentation

Systems and Internet i Infrastructure Security i Institute for Networking and Security Research Department of Computer Science and Engineering Pennsylvania State University, University Park, PA Regular Expressions Devin J. Pohly


  1. Systems and Internet i Infrastructure Security i Institute for Networking and Security Research Department of Computer Science and Engineering Pennsylvania State University, University Park, PA Regular Expressions Devin J. Pohly <djpohly@cse.psu.edu> CMPSC 311: Introduction to Systems Programming Page 1

  2. Regular expressions • Often shortened to “regex” or “regexp” • Regular expressions are a language for matching patterns ‣ Super-powerful find and replace tool ‣ Can be used on the CLI, in shell scripts, as a text editor feature, or as part of a program CMPSC 311: Introduction to Systems Programming Page 2

  3. What are they good for? • Searching for specifically formatted text ‣ Email address ‣ Phone number ‣ Anything that follows a pattern • Validating input ‣ Same idea • Powerful find-and-replace ‣ E.g. change “ X and Y ” to “ Y and X ” for any X , Y CMPSC 311: Introduction to Systems Programming Page 3

  4. Regex “flavors” • Many languages support regular expressions ‣ Perl ‣ JavaScript ‣ Python ‣ PHP ‣ Java, Ruby, .NET, etc. • Today we will be learn standard Unix “extended regular expressions” CMPSC 311: Introduction to Systems Programming Page 4

  5. On the command line • The grep command is a regex filter ‣ That’s what the “ re ” in the middle stands for ‣ We have seen fgrep , which looks for literal (“fixed”) strings • Today we will use egrep ‣ E for “extended” regular expressions ‣ Very close to other languages’ flavors CMPSC 311: Introduction to Systems Programming Page 5

  6. grep command syntax • To find matches in files: egrep regex file(s) • To filter standard input: egrep regex ‣ where regex is a regular expression, and file(s) are the files to search • Options: ‣ -i : ignore case ‣ -v : find non -matching lines ‣ -r : search entire directories ‣ man grep for more CMPSC 311: Introduction to Systems Programming Page 6

  7. Okay, let’s begin! $ cd /usr/share/dict $ egrep hello words ... $ cat words | egrep hello ... CMPSC 311: Introduction to Systems Programming Page 7

  8. First lesson • Letters, numbers, and a few other things match literally ‣ Find all the words that match “fgh” ‣ Find all the words that match “lmn” • Note: a regex can match anywhere in the string ‣ Doesn’t have to match the whole string CMPSC 311: Introduction to Systems Programming Page 8

  9. Anchors • Caret ^ matches at the beginning of a line • Dollar sign $ matches at the end of a line ‣ Use '…' to protect special characters from the shell! • Try it ‣ Find words ending in “gry” ‣ Find words starting with “ah” • What happens if we use both ^ and $ ? CMPSC 311: Introduction to Systems Programming Page 9

  10. Single-character wildcard • Dot . matches any single character (exactly one) ‣ Find a 6-letter word where the second, fourth, and sixth letters are “o” ‣ Find any words that have at least 22 characters CMPSC 311: Introduction to Systems Programming Page 10

  11. Multi-character wildcard • Dot-star .* will match 0 or more characters ‣ We’ll see why on the next slide ‣ Find all the words that contain a, e, i, o, u in that order (with anything in between) ‣ How about u, o, i, e, a? CMPSC 311: Introduction to Systems Programming Page 11

  12. Quantifiers • How many repetitions of the previous thing to match? ‣ Star * : 0 or more ‣ Plus + : at least 1 ‣ Question mark ? : 0 or 1 (i.e., optional) • Try it out ‣ Spell check: necc?ess?ary ‣ Global awareness: colou?r ‣ Find words with u, o, i, e, a in that order and at least one letter in between each CMPSC 311: Introduction to Systems Programming Page 12

  13. Careful! • Guess before you try: what happens if you search for z* ? • Now try searching for the empty string ‣ Use '' to give the shell an empty argument • Conclusion: make sure your regex always tries to match something! CMPSC 311: Introduction to Systems Programming Page 13

  14. Character classes • Square brackets [abc] will match any one of the enclosed characters ‣ What will [chs]andy match? ‣ You can use quantifiers on character classes ◾ Find words starting with b where all the rest of the letters are s, a, or n ◾ Find all the words you can type with only ASDFJKL ◾ Find all the words you can type with AOEUHTNS! CMPSC 311: Introduction to Systems Programming Page 14

  15. Ranges • Part of character classes • You can specify a range of characters with [a-j] ‣ One hex digit: [0-9a-f] ‣ Consonants: [b-df-hj-np-tv-z] ‣ Find all the words you can make with A through E ◾ … that are at least 5 letters long (hint: pipe the output to another egrep !) CMPSC 311: Introduction to Systems Programming Page 15

  16. Negative character classes • If the first character is a caret, matches anything except these characters ‣ Consonants: [^aeiou] ◾ Not quite – why? ‣ Find words that contain a q, followed by something other than u ‣ Can be combined with ranges ◾ Any character that isn’t a digit: [^0-9] CMPSC 311: Introduction to Systems Programming Page 16

  17. Groups • Parentheses (…) create groups within a regex ‣ Quantifiers operate on the entire group ‣ Find words with an m, followed by “ach” one or more times, followed by e ‣ Find words where every other character, starting with the first, is an e CMPSC 311: Introduction to Systems Programming Page 17

  18. Branches • The pipe | denotes that either the left or right side matches ‣ It’s the “or” operator ‣ Useful inside parentheses • Guess before you try: ‣ book(worm|end) ‣ ^(out|lay)+$ CMPSC 311: Introduction to Systems Programming Page 18

  19. Special characters • We’ve seen a lot already ‣ ^$.*+?[]()|\ • Backslash \ will escape a special character to search for it literally ‣ For example, you could search your code for the expression ‘ int *\* ’ to find integer pointers ‣ What is the difference between the two * s? CMPSC 311: Introduction to Systems Programming Page 19

  20. Backreferences • Groups in () can be referred to later ‣ Must match exactly the same characters again ‣ Numbered \1 , \2 , \3 from the start of the regex ‣ Try it: (...)to\1 ‣ Find words that have a four-character sequence repeated immediately CMPSC 311: Introduction to Systems Programming Page 20

  21. Substituting – a demo • The sed program has a lot of functions for modifying text • Most useful is the s///g (“ s ubstitute”) command: regular expression find-and-replace ‣ Also available in Vim by typing :%s/ regex / replacement /g • Try it: run this command and type things $ sed -r 's/([a-z]+) and ([a-z]+)/\2 and \1/g' CMPSC 311: Introduction to Systems Programming Page 21

  22. Enjoy puzzles? • regexcrossword.com ‣ Great way to practice your regex-fu ‣ Starts with simpler tutorial puzzles and works up CMPSC 311: Introduction to Systems Programming Page 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend