CSCI 2133 Rapid Programming Techniques for Innovation Week 2 - - PowerPoint PPT Presentation

csci 2133 rapid programming techniques for innovation
SMART_READER_LITE
LIVE PREVIEW

CSCI 2133 Rapid Programming Techniques for Innovation Week 2 - - PowerPoint PPT Presentation

CSCI 2133 Rapid Programming Techniques for Innovation Week 2 Regular Expression and Implementation CSCI 2133 Previous Lecture About the course project Choosing a topic, defining scope, and forming a team Preparing design


slide-1
SLIDE 1

CSCI 2133

CSCI 2133 – Rapid Programming Techniques for Innovation

Week 2 – Regular Expression and Implementation

slide-2
SLIDE 2

CSCI 2133

Previous Lecture

  • About the course project
  • Choosing a topic, defining scope, and forming

a team

  • Preparing design document (P0)
  • Planning the process
  • Interfaces, choices, tools
  • About background knowledge: Java, C, Linux

command line

slide-3
SLIDE 3

CSCI 2133

Regular Expressions

  • Covered in CSCI 2132
  • We will now look at a basic implementation
  • They illustrate a good general tool used in software

development and different implementations

  • An implementation is described in the textbook

[Kernighan and Pike, sec. 9.2, p. 222]

  • Regular expressions are pervasive in Unix environments,

but not so much in other platforms

  • There are different flavours of RegEx’es; we can start

with grep

slide-4
SLIDE 4

CSCI 2133

What Is a Regular Expression?

  • A regular expression (regex) describes a set of

possible input strings.

  • Regular expressions descend from a fundamental

concept in Computer Science called finite automata theory

  • Regular expressions are endemic to Unix
  • vi, ed, sed, and emacs
  • awk, tcl, perl and Python
  • grep, egrep, fgrep
  • compilers
slide-5
SLIDE 5

CSCI 2133

What is a regular expression?

/[a-zA-Z_\-]+@(([a-zA-Z_\-])+\.)+[a-zA-Z]{2,4}/

  • regular expression ("regex"): describes a pattern of

text

  • can test whether a string matches the expr's pattern
  • can use a regex to search/replace characters in a string
  • very powerful, but tough to read
  • regular expressions occur in many places:
  • text editors (TextPad) allow regexes in search/replace
  • languages: JavaScript; Java Scanner, String split
  • Unix/Linux/Mac shell commands (grep, sed, find, etc.)
slide-6
SLIDE 6

CSCI 2133

UNIX Tools rocks.

match

UNIX Tools sucks.

match

UNIX Tools is okay.

no match regular expression

c k s

slide-7
SLIDE 7

CSCI 2133

Regular Expressions

  • A regular expression can match a string in more

than one place.

Scrapple from the apple.

match 1 match 2 regular expression

a p p l e

slide-8
SLIDE 8

CSCI 2133

Regular Expressions

  • The . regular expression can be used to match any

character.

For me to poop on.

match 1 match 2 regular expression

  • .
slide-9
SLIDE 9

CSCI 2133

Character Classes

  • Character classes [] can be used to match any

specific set of characters.

beat a brat on a boat

match 1 match 2 regular expression

b [eor] a t

match 3

slide-10
SLIDE 10

CSCI 2133

Negated Character Classes

  • Character classes can be negated with the [^]

syntax.

beat a brat on a boat

match regular expression

b [^eo] a t

slide-11
SLIDE 11

CSCI 2133

More About Character Classes

  • [aeiou] will match any of the characters a, e, i, o,
  • r u
  • [kK]orn will match korn or Korn
  • Ranges can also be specified in character classes
  • [1-9] is the same as [123456789]
  • [abcde] is equivalent to [a-e]
  • You can also combine multiple ranges
  • [abcde123456789] is equivalent to [a-e1-9]
  • Note that the - character has a special meaning in a

character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3

slide-12
SLIDE 12

CSCI 2133

Named Character Classes

  • Commonly used character classes can be referred

to by name (alpha, lower, upper, alnum, digit, punct, cntrl)

  • Syntax [:name:]
  • [a-zA-Z]

[[:alpha:]]

  • [a-zA-Z0-9]

[[:alnum:]]

  • [45a-z]

[45[:lower:]]

  • Important for portability across languages
slide-13
SLIDE 13

CSCI 2133

Anchors

  • Anchors are used to match at the beginning or end of a line (or both).
  • ^ means beginning of the line
  • $ means end of the line
slide-14
SLIDE 14

CSCI 2133

beat a brat on a boat

match regular expression

^ b [eor] a t

regular expression

b [eor] a t $ beat a brat on a boat

match

^$ ^word$

slide-15
SLIDE 15

CSCI 2133

Repetition

  • The * is used to define zero or more occurrences of

the single regular expression preceding it.

slide-16
SLIDE 16

CSCI 2133

I got mail, yaaaaaaaaaay!

match regular expression

y a * y For me to poop on.

match regular expression

  • a * o

.*

slide-17
SLIDE 17

CSCI 2133

Match length

Scrapple from the apple.

no yes regular expression

a . * e

  • A match will be the longest string that satisfies the

regular expression.

no

slide-18
SLIDE 18

CSCI 2133

Question

  • Can you utilize the knowledge learned by now to

specify one regex to represent phone numbers: 123-234-3455, (123)-234-2344, (123)123-2345, ( 212) 123 - 23445?

18

slide-19
SLIDE 19

CSCI 2133

Repetition Ranges

  • Ranges can also be specified
  • { } notation can specify a range of repetitions for the

immediately preceding regex

  • {n} means exactly n occurrences
  • {n,} means at least n occurrences
  • {n,m} means at least n occurrences but no more than

m occurrences

  • Example:
  • .{0,} same as .*
  • a{2,} same as aaa*
slide-20
SLIDE 20

CSCI 2133

Subexpressions

  • If you want to group part of an expression so that *
  • r { } applies to more than just the previous

character, use ( ) notation

  • Subexpresssions are treated like a single character
  • a* matches 0 or more occurrences of a
  • abc* matches ab, abc, abcc, abccc, …
  • (abc)* matches abc, abcabc, abcabcabc, …
  • (abc){2,3} matches abcabc or abcabcabc
slide-21
SLIDE 21

CSCI 2133

grep

  • grep comes from the ed (Unix text editor) search

command “global regular expression print” or g/re/p

  • This was such a useful command that it was written

as a standalone utility

  • There are two other variants, egrep and fgrep that

comprise the grep family

  • grep is the answer to the moments where you

know you want the file that contains a specific phrase but you can’t remember its name

slide-22
SLIDE 22

CSCI 2133

Family Differences

  • grep - uses regular expressions for pattern

matching

  • fgrep - file grep, does not use regular expressions,
  • nly matches fixed strings but can get search

strings from a file

  • egrep - extended grep, uses a more powerful set of

regular expressions but does not support backreferencing, generally the fastest member of the grep family

  • agrep – approximate grep; not standard
slide-23
SLIDE 23

CSCI 2133

Syntax

  • Regular expression concepts we have seen so far are

common to grep and egrep.

  • grep and egrep have slightly different syntax
  • grep: BREs
  • egrep: EREs (enhanced features we will discuss)
  • Major syntax differences:
  • grep: \( and \), \{ and \}
  • egrep: ( and ), { and }
slide-24
SLIDE 24

CSCI 2133

Protecting Regex Metacharacters

  • Since many of the special characters used in regexs

also have special meaning to the shell, it’s a good idea to get in the habit of single quoting your regexs

  • This will protect any special characters from being
  • perated on by the shell
  • If you habitually do it, you won’t have to worry about

when it is necessary

slide-25
SLIDE 25

CSCI 2133

Escaping Special Characters

  • Even though we are single quoting our regexs so the

shell won’t interpret the special characters, some characters are special to grep (eg * and .)

  • To get literal characters, we escape the character with

a \ (backslash)

  • Suppose we want to search for the character

sequence a*b*

  • Unless we do something special, this will match zero or

more ‘a’s followed by zero or more ‘b’s, not what we want

  • a\*b\* will fix this - now the asterisks are treated as

regular characters

slide-26
SLIDE 26

CSCI 2133

Egrep: Alternation

  • Regex also provides an alternation character | for

matching one or another subexpression

  • (T|Fl)an will match ‘Tan’ or ‘Flan’
  • ^(From|Subject): will match the From and Subject

lines of a typical email message

  • It matches a beginning of line followed by either the characters

‘From’ or ‘Subject’ followed by a ‘:’

  • Subexpressions are used to limit the scope of the

alternation

  • At(ten|nine)tion then matches “Attention” or

“Atninetion”, not “Atten” or “ninetion” as would happen without the parenthesis - Atten|ninetion

slide-27
SLIDE 27

CSCI 2133

Egrep: Repetition Shorthands

  • The * (star) has already been seen to specify zero
  • r more occurrences of the immediately preceding

character

  • + (plus) means “one or more”
  • abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’ but will

not match ‘abd’

  • Equivalent to {1,}
slide-28
SLIDE 28

CSCI 2133

Egrep: Repetition Shorthands cont

  • The ‘?’ (question mark) specifies an optional character, the

single character that immediately precedes it

  • July? will match ‘Jul’ or ‘July’
  • Equivalent to {0,1}
  • Also equivalent to (Jul|July)
  • The *, ?, and + are known as quantifiers because they

specify the quantity of a match

  • Quantifiers can also be used with subexpressions
  • (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not match

‘a’ or a blank line

slide-29
SLIDE 29

CSCI 2133

Grep: Backreferences

  • Sometimes it is handy to be able to refer to a

match that was made earlier in a regex

  • This is done using backreferences
  • \n is the backreference specifier, where n is a number
  • Looks for nth subexpression
  • For example, to find if the first word of a line is the

same as the last:

  • ^\([[:alpha:]]\{1,\}\) .* \1$
  • The \([[:alpha:]]\{1,\}\) matches 1 or more

letters

slide-30
SLIDE 30

CSCI 2133

Practical Regex Examples

  • Variable names in C
  • [a-zA-Z_][a-zA-Z_0-9]*
  • Dollar amount with optional cents
  • \$[0-9]+(\.[0-9][0-9])?
  • Time of day
  • (1[012]|[1-9]):[0-5][0-9] (am|pm)
  • HTML headers <h1> <H1> <h2> …
  • <[hH][1-4]>
slide-31
SLIDE 31

CSCI 2133

Javascript Regex functions

.match(regexp) returns first match for this string against the given regular expression; if global /g flag is used, returns array of all matches .replace(regexp, text) replaces first occurrence of the regular expression with the given text; if global /g flag is used, replaces all occurrences .search(regexp) returns first index where the given regular expression occurs .split(delimiter[,limit]) breaks apart a string into an array

  • f strings using the given regular

as the delimiter; returns the array of tokens

slide-32
SLIDE 32

CSCI 2133

grep Family

  • Syntax

grep [-hilnv] [-e expression] [filename] egrep [-hilnv] [-e expression] [-f filename] [expression] [filename] fgrep [-hilnxv] [-e string] [-f filename] [string] [filename]

  • -h

Do not display filenames

  • -i

Ignore case

  • -l

List only filenames containing matching lines

  • -n

Precede each matching line with its line number

  • -v

Negate matches

  • -x

Match whole line only (fgrep only)

  • -e expression

Specify expression as option

  • -f filename

Take the regular expression (egrep) or a list of strings (fgrep) from filename

slide-33
SLIDE 33

CSCI 2133

grep Examples

  • grep 'men' GrepMe
  • grep 'fo*' GrepMe
  • egrep 'fo+' GrepMe
  • egrep -n '[Tt]he' GrepMe
  • fgrep 'The' GrepMe
  • egrep 'NC+[0-9]*A?' GrepMe
  • fgrep -f expfile GrepMe
  • Find all lines with signed numbers

$ egrep ’[-+][0-9]+\.?[0-9]*’ *.c

  • bsearch. c: return -1;
  • compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst,
  • convert. c: Print integers in a given base 2-16 (default 10)
  • convert. c: sscanf( argv[ i+1], "% d", &base);
  • strcmp. c: return -1;
  • strcmp. c: return +1;
slide-34
SLIDE 34

CSCI 2133

Fun with the Dictionary

  • /usr/dict/words contains about 25,000 words
  • egrep hh /usr/dict/words
  • beachhead
  • highhanded
  • withheld
  • withhold
  • egrep as a simple spelling checker: Specify plausible

alternatives you know

egrep "n(ie|ei)ther" /usr/dict/words neither

  • How many words have 3 a’s one letter apart?
  • egrep a.a.a /usr/dict/words | wc –l
  • 54
  • egrep u.u.u /usr/dict/words
  • cumulus
slide-35
SLIDE 35

CSCI 2133

Other Notes

  • Use /dev/null as an extra file name
  • Will print the name of the file that matched
  • grep test bigfile
  • This is a test.
  • grep test /dev/null bigfile
  • bigfile:This is a test.
  • Return code of grep is useful
  • grep fred filename > /dev/null && rm filename
slide-36
SLIDE 36

CSCI 2133

x xyz Ordinary characters match themselves (NEWLINES and metacharacters excluded) Ordinary strings match themselves \m ^ $ . [xy^$x] [^xy^$z] [a-z] r* r1r2 Matches literal character m Start of line End of line Any single character Any of x, y, ^, $, or z Any one character other than x, y, ^, $, or z Any single character in given range zero or more occurrences of regex r Matches r1 followed by r2 \(r\) \n \{n,m\} Tagged regular expression, matches r Set to what matched the nth tagged expression (n = 1-9) Repetition r+ r? r1|r2 (r1|r2)r3 (r1|r2)* {n,m} One or more occurrences of r Zero or one occurrences of r Either r1 or r2 Either r1r3 or r2r3 Zero or more occurrences of r1|r2, e.g., r1, r1r1, r2r1, r1r1r2r1,…) Repetition

fgrep, grep, egrep grep, egrep grep egrep This is one line of text

  • .*o

input line regular expression

Quick Reference

slide-37
SLIDE 37

CSCI 2133

Grep Regular Expressions

  • literal characters match themselves; except

metacharacters: . [ ] * ˆ $ \

  • . matches any single character
  • [. . . ] matches a character from a set, includes ranges
  • [ˆ. . . ] matches any character not from a set, includes

ranges

  • * is used for repetition, 0 or more times, after another

regular expression

  • \c matches exactly character c unless c is ( ) or a digit
  • \( \) is used for tagging, and \1 . . . matches tagged

regular expressions

slide-38
SLIDE 38

CSCI 2133

egrep — Extended Regular Expressions

  • allows: + ?
  • alternation: | and grouping (...)
slide-39
SLIDE 39

CSCI 2133

Variations of Regular Expressions in Practice

  • grep — basic regular expressions
  • egrep — extended regular expressions
  • fgrep — a fast grep with on or more fixed

strings (parallel search for many fixed strings)

  • agrep — “approximate” grep; search with

errors; not available by default

  • many languages use regular expressions;

Perl-style regular expressions

  • simple versinos: “wild cards” in shells
  • LIKE operator in SQL, Visual Basic, and similar
slide-40
SLIDE 40

CSCI 2133

Implementing Regular Expressions

  • A basic implementation is relatively simple
  • A more efficient implementation would rely on

automata theory and significantly more programming

  • If we were to implement the program ‘grep’,

what would be the pseudocode of the main algorithm?

slide-41
SLIDE 41

CSCI 2133

/* Basic grep algorithm */ while (get a line) if match(regexpr, line) print line /* Some ideas: Regular expression could be a string or translated to an internal representation suitable for efficient matching match shold try different starting positions in the line */

slide-42
SLIDE 42

CSCI 2133

Implementation of ‘match’

  • What would be function prototype of ‘match’ in

C?

  • How would we implement it if we assume that

we want to handle the anchor ‘ˆ’ as well?

slide-43
SLIDE 43

CSCI 2133

/* match: search for regexp anywhere in text */ int match(char *regexp, char *text) { if (regexp[0] == ’ˆ’) return matchhere(regexp+1, text); do {/* must look even if string is empty */ if (matchhere(regexp, text)) return 1; } while (*text++ != ’\0’); return 0; }

slide-44
SLIDE 44

CSCI 2133

Matching Exactly at the Start

  • How could we implement the function to match

exactly at a specific string position?

  • We want to handle a very simple Kleene star.
slide-45
SLIDE 45

CSCI 2133

/* matchhere: search for regexp at beginning of text */ int matchhere(char *regexp, char *text) { if (regexp[0] == ’\0’) return 1; if (regexp[1] == ’*’) return matchstar(regexp[0], regexp+2, text); if (regexp[0] == ’$’ && regexp[1] == ’\0’) return *text == ’\0’; if ( *text!=’\0’ && (regexp[0] regexp[0] == == ’.’ || *text) ) return matchhere(regexp+1, text+1); return 0; }

slide-46
SLIDE 46

CSCI 2133

Matching with Simple Star Operator

  • How could we implement a function to match

with the simple star operator?

slide-47
SLIDE 47

CSCI 2133

/* matchstar: search for c*regexp at beginning

  • f text */

int matchstar(int c, char *regexp, char *text) { do { /* a* matches zero or more instances */ if (matchhere(regexp, text)) return 1; } while (*text != ’\0’ && (*text++ == c || c == ’.’)); return 0; }

slide-48
SLIDE 48

CSCI 2133

Matching with Simple Star Operator

  • Does this function make the shortest or longest

match?

slide-49
SLIDE 49

CSCI 2133

Matching with Simple Star Operator

  • Does this function make the shortest or longest

match?

  • It is the shortest match. The longest would be

as follows. ..

slide-50
SLIDE 50

CSCI 2133

/* matchstar: leftmost longest search for c*regexp */ int matchstar(int c, char *regexp, char *text) { char *t; for (t = text; *t != ’\0’ && (*t == c || c == ’.’); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0; }

slide-51
SLIDE 51

CSCI 2133

Putting a ‘grep’ together

  • How could we put the final grep-like program

together?

  • How does grep behave given

– standard input? – one file? – multiple files?

slide-52
SLIDE 52

CSCI 2133

int main(int argc, char *argv[]) { int i, nmatch; FILE *f; if (argc < 2) fprintf(stderr, "usage: grep regexp [file ...]"); nmatch = 0; if (argc == 2) { if (grep(argv[1], stdin, NULL)) nmatch++; } else { for (i = 2; i < argc; i++) { f = fopen(argv[i], "r"); if (f == NULL) { fprintf(stderr, "can’t open %s:", argv[i]); continue; }

slide-53
SLIDE 53

CSCI 2133

if (grep(argv[1], f, argc>3 ? argv[i] : NULL) > 0) nmatch++; fclose(f); } } return nmatch == 0; }

slide-54
SLIDE 54

CSCI 2133

Previous Lecture

  • Review of regular expressions
  • grep style regular expressions
  • egrep and other variations of regular expressions
  • an approach to implementing basic regular expressions

5 4

slide-55
SLIDE 55

CSCI 2133

Question

  • Match vs strstr, which one is faster?

55

slide-56
SLIDE 56

CSCI 2133

Regular Expressions grep and egrep

slide-57
SLIDE 57

CSCI 2133

Previously

  • Basic UNIX Commands
  • Files: rm, cp, mv, ls, ln
  • Processes: ps, kill
  • Unix Filters
  • cat, head, tail, tee, wc
  • cut, paste
  • find
  • sort, uniq
  • tr
slide-58
SLIDE 58

CSCI 2133

Subtleties of commands

  • Executing commands with find
  • Specification of columns in cut
  • Specification of columns in sort
  • Methods of input
  • Standard in
  • File name arguments
  • Special "-" filename
  • Options for uniq
slide-59
SLIDE 59

CSCI 2133

Today

  • Regular Expressions
  • Allow you to search for text in files
  • grep command
  • Stream manipulation:
  • sed
slide-60
SLIDE 60

CSCI 2133

/* grep: search for regexp in file */ int grep(char *regexp, FILE *f, char *name) { int n, nmatch; char buf[BUFSIZ]; nmatch = 0; while (fgets(buf, sizeof buf, f) != NULL) { n = strlen(buf); if (n > 0 && buf[n-1] == ’\n’) buf[n-1] = ’\0’; if (match(regexp, buf)) { nmatch++; if (name != NULL) printf("%s:", name); printf("%s\n", buf); } } return nmatch; } 6

slide-61
SLIDE 61

CSCI 2133

Some Important Lessons and Ideas

  • Use of tools (grep)
  • Notation (regex)
  • Don’t worry about performance if it does not

matter

  • “Hacking” can improve performance but also

introduce complexity and errors

  • Better algorithm can produce a large gain in

performance

  • Code profiling (tuning) can help later

6 1

slide-62
SLIDE 62

CSCI 2133

Assignment Tasks

  • This will be described as a part of Assignment 1
  • Put together the C program based on provided code and

prepare a testing package

  • Copy and modify code to make version with operators ?

and +

  • Write a Java class by translating C code with some

minor additional functionality (likely for A2)

  • Write a Java class by translating C code with some

minor additional functionality

  • More details will be posted

6 2

slide-63
SLIDE 63

CSCI 2133