CSCI 2133
CSCI 2133 – Rapid Programming Techniques for Innovation
Week 2 – Regular Expression and Implementation
CSCI 2133 Rapid Programming Techniques for Innovation Week 2 - - PowerPoint PPT Presentation
CSCI 2133 Rapid Programming Techniques for Innovation Week 2 Regular Expression and Implementation CSCI 2133 Previous Lecture About the course project Choosing a topic, defining scope, and forming a team Preparing design
CSCI 2133
Week 2 – Regular Expression and Implementation
CSCI 2133
a team
command line
CSCI 2133
Regular Expressions
development and different implementations
[Kernighan and Pike, sec. 9.2, p. 222]
but not so much in other platforms
with grep
CSCI 2133
possible input strings.
concept in Computer Science called finite automata theory
CSCI 2133
/[a-zA-Z_\-]+@(([a-zA-Z_\-])+\.)+[a-zA-Z]{2,4}/
text
CSCI 2133
match
match
no match regular expression
CSCI 2133
than one place.
match 1 match 2 regular expression
CSCI 2133
character.
match 1 match 2 regular expression
CSCI 2133
specific set of characters.
match 1 match 2 regular expression
match 3
CSCI 2133
syntax.
match regular expression
CSCI 2133
character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3
CSCI 2133
to by name (alpha, lower, upper, alnum, digit, punct, cntrl)
[[:alpha:]]
[[:alnum:]]
[45[:lower:]]
CSCI 2133
CSCI 2133
match regular expression
regular expression
match
^$ ^word$
CSCI 2133
the single regular expression preceding it.
CSCI 2133
match regular expression
match regular expression
.*
CSCI 2133
no yes regular expression
regular expression.
no
CSCI 2133
specify one regex to represent phone numbers: 123-234-3455, (123)-234-2344, (123)123-2345, ( 212) 123 - 23445?
18
CSCI 2133
immediately preceding regex
m occurrences
CSCI 2133
character, use ( ) notation
CSCI 2133
command “global regular expression print” or g/re/p
as a standalone utility
comprise the grep family
know you want the file that contains a specific phrase but you can’t remember its name
CSCI 2133
matching
strings from a file
regular expressions but does not support backreferencing, generally the fastest member of the grep family
CSCI 2133
common to grep and egrep.
CSCI 2133
also have special meaning to the shell, it’s a good idea to get in the habit of single quoting your regexs
when it is necessary
CSCI 2133
shell won’t interpret the special characters, some characters are special to grep (eg * and .)
a \ (backslash)
sequence a*b*
more ‘a’s followed by zero or more ‘b’s, not what we want
regular characters
CSCI 2133
matching one or another subexpression
lines of a typical email message
‘From’ or ‘Subject’ followed by a ‘:’
alternation
“Atninetion”, not “Atten” or “ninetion” as would happen without the parenthesis - Atten|ninetion
CSCI 2133
character
not match ‘abd’
CSCI 2133
single character that immediately precedes it
specify the quantity of a match
‘a’ or a blank line
CSCI 2133
match that was made earlier in a regex
same as the last:
letters
CSCI 2133
CSCI 2133
.match(regexp) returns first match for this string against the given regular expression; if global /g flag is used, returns array of all matches .replace(regexp, text) replaces first occurrence of the regular expression with the given text; if global /g flag is used, replaces all occurrences .search(regexp) returns first index where the given regular expression occurs .split(delimiter[,limit]) breaks apart a string into an array
as the delimiter; returns the array of tokens
CSCI 2133
grep [-hilnv] [-e expression] [filename] egrep [-hilnv] [-e expression] [-f filename] [expression] [filename] fgrep [-hilnxv] [-e string] [-f filename] [string] [filename]
Do not display filenames
Ignore case
List only filenames containing matching lines
Precede each matching line with its line number
Negate matches
Match whole line only (fgrep only)
Specify expression as option
Take the regular expression (egrep) or a list of strings (fgrep) from filename
CSCI 2133
$ egrep ’[-+][0-9]+\.?[0-9]*’ *.c
CSCI 2133
alternatives you know
egrep "n(ie|ei)ther" /usr/dict/words neither
CSCI 2133
CSCI 2133
x xyz Ordinary characters match themselves (NEWLINES and metacharacters excluded) Ordinary strings match themselves \m ^ $ . [xy^$x] [^xy^$z] [a-z] r* r1r2 Matches literal character m Start of line End of line Any single character Any of x, y, ^, $, or z Any one character other than x, y, ^, $, or z Any single character in given range zero or more occurrences of regex r Matches r1 followed by r2 \(r\) \n \{n,m\} Tagged regular expression, matches r Set to what matched the nth tagged expression (n = 1-9) Repetition r+ r? r1|r2 (r1|r2)r3 (r1|r2)* {n,m} One or more occurrences of r Zero or one occurrences of r Either r1 or r2 Either r1r3 or r2r3 Zero or more occurrences of r1|r2, e.g., r1, r1r1, r2r1, r1r1r2r1,…) Repetition
fgrep, grep, egrep grep, egrep grep egrep This is one line of text
input line regular expression
CSCI 2133
Grep Regular Expressions
metacharacters: . [ ] * ˆ $ \
ranges
regular expression
regular expressions
CSCI 2133
egrep — Extended Regular Expressions
CSCI 2133
Variations of Regular Expressions in Practice
strings (parallel search for many fixed strings)
errors; not available by default
Perl-style regular expressions
CSCI 2133
automata theory and significantly more programming
what would be the pseudocode of the main algorithm?
CSCI 2133
/* Basic grep algorithm */ while (get a line) if match(regexpr, line) print line /* Some ideas: Regular expression could be a string or translated to an internal representation suitable for efficient matching match shold try different starting positions in the line */
CSCI 2133
C?
we want to handle the anchor ‘ˆ’ as well?
CSCI 2133
/* match: search for regexp anywhere in text */ int match(char *regexp, char *text) { if (regexp[0] == ’ˆ’) return matchhere(regexp+1, text); do {/* must look even if string is empty */ if (matchhere(regexp, text)) return 1; } while (*text++ != ’\0’); return 0; }
CSCI 2133
exactly at a specific string position?
CSCI 2133
/* matchhere: search for regexp at beginning of text */ int matchhere(char *regexp, char *text) { if (regexp[0] == ’\0’) return 1; if (regexp[1] == ’*’) return matchstar(regexp[0], regexp+2, text); if (regexp[0] == ’$’ && regexp[1] == ’\0’) return *text == ’\0’; if ( *text!=’\0’ && (regexp[0] regexp[0] == == ’.’ || *text) ) return matchhere(regexp+1, text+1); return 0; }
CSCI 2133
with the simple star operator?
CSCI 2133
/* matchstar: search for c*regexp at beginning
int matchstar(int c, char *regexp, char *text) { do { /* a* matches zero or more instances */ if (matchhere(regexp, text)) return 1; } while (*text != ’\0’ && (*text++ == c || c == ’.’)); return 0; }
CSCI 2133
match?
CSCI 2133
match?
as follows. ..
CSCI 2133
/* matchstar: leftmost longest search for c*regexp */ int matchstar(int c, char *regexp, char *text) { char *t; for (t = text; *t != ’\0’ && (*t == c || c == ’.’); t++) ; do { /* * matches zero or more */ if (matchhere(regexp, t)) return 1; } while (t-- > text); return 0; }
CSCI 2133
together?
– standard input? – one file? – multiple files?
CSCI 2133
int main(int argc, char *argv[]) { int i, nmatch; FILE *f; if (argc < 2) fprintf(stderr, "usage: grep regexp [file ...]"); nmatch = 0; if (argc == 2) { if (grep(argv[1], stdin, NULL)) nmatch++; } else { for (i = 2; i < argc; i++) { f = fopen(argv[i], "r"); if (f == NULL) { fprintf(stderr, "can’t open %s:", argv[i]); continue; }
CSCI 2133
if (grep(argv[1], f, argc>3 ? argv[i] : NULL) > 0) nmatch++; fclose(f); } } return nmatch == 0; }
CSCI 2133
5 4
CSCI 2133
55
CSCI 2133
CSCI 2133
CSCI 2133
CSCI 2133
CSCI 2133
/* grep: search for regexp in file */ int grep(char *regexp, FILE *f, char *name) { int n, nmatch; char buf[BUFSIZ]; nmatch = 0; while (fgets(buf, sizeof buf, f) != NULL) { n = strlen(buf); if (n > 0 && buf[n-1] == ’\n’) buf[n-1] = ’\0’; if (match(regexp, buf)) { nmatch++; if (name != NULL) printf("%s:", name); printf("%s\n", buf); } } return nmatch; } 6
CSCI 2133
matter
introduce complexity and errors
performance
6 1
CSCI 2133
prepare a testing package
and +
minor additional functionality (likely for A2)
minor additional functionality
6 2
CSCI 2133