Introduction to lex (or flex) Some slides borrowed from M Scherger - - PowerPoint PPT Presentation
Introduction to lex (or flex) Some slides borrowed from M Scherger - - PowerPoint PPT Presentation
Introduction to lex (or flex) Some slides borrowed from M Scherger Lex/Flex: A Scanner Generator in C Regular Expression Thomsons Construction Nondeterministic Finite Automaton Subset Construction Deterministic Finite
Lex/Flex: A Scanner Generator in C
Fall 2012 Introduction to lex (or flex) 2
Regular Expression
Nondeterministic Finite Automaton Deterministic Finite Automaton
Table-driven Scanner So why not do this with a tool?
Thomson’s Construction “Subset” Construction
Lex
Fall 2012 Introduction to lex (or flex) 3
Lex is a such tool for creating lexical analyzers
M. E. Lesk and E. Schmidt 1975
Lexical analyzers tokenize input streams Regular expressions define tokens Tokens are the terminals of a language Converts regular expressions into DFAs DFAs are implemented as table driven state machines Some versions of Lex are proprietary and so not all versions
- f *nix come with an open source version
flex – Fast Lexical Analyzer is an open source version
Vern Paxson
The Basic Process
Fall 2012 Introduction to lex (or flex) 4
Lex compiler C Compiler a.out Lex source program any.l lex.yy.c Input stream Sequence
- f tokens
a.out lex.yy.c
Format of a lex File
Fall 2012 Introduction to lex (or flex) 5
Definitions %% Rules %% User code
1st section holds declarations of simple name definitions and
start conditions
2nd section holds pattern-action pairs 3rd section is copied directly to lex.yy.c
C code and comments
Typical file extensions: .l .lex .flex
Compiling and Running
Fall 2012 Introduction to lex (or flex) 6
> flex linenos.flex > gcc lexyy.c -lfl > a.out < infile > outfile
yywrap() issue
Regular Expressions and Lex
Fall 2012 Introduction to lex (or flex) 7
A regular expression is an expression that matches sets of
strings
(the “language” of the regular expression).
In its basic form, a regular expression is built up out of basic
expressions (individual symbols) and the operations
choice (|), concatenation (no operator), and repetition (*).
A regular expression may also contain certain other
metasymbols:
parentheses for grouping (to change precedence, just as in
arithmetic)
others as needed to extend the operator set in useful ways
Regular Expressions in Lex
Fall 2012 Introduction to lex (or flex) 8
c - c is a single character
Matches the character c
\c – c is a single character
Use this to escape special characters
“str” - str is a string
Matches entire string str
[str]- str is a string
Matches any single character from str RE Matches A A x x d d \. . \n Newline \t tab
“Abc” Abc “The” The [aeiou] Lowercase vowels [abcde] The letters a to e
Regular Expressions – Character Classes
Fall 2012 Introduction to lex (or flex) 9
[x-y] – x and y are characters
All characters in the range x-y
These can be combined [^str] – str is a string
RE Matches [a-z] All lowercase characters [0-9] All digits [a-df-z] lowercase characters except e [a-z0-9A-Z] Alphanumeric characters [A-Zaeiou] Upper case letters and lc vowels [^ \n\t] all non whitespace [^aeiou] matches anything but lowercase vowels
Regular Expressions
Fall 2012 Introduction to lex (or flex) 10
p* – p is a pattern
Zero or more occurrences of p
p+ – p is a pattern
One or more occurrences of p A* A AA AAA .... r* r rr ...
ab*c* a ab ac abb abc acc abbb abbc abcc accc ... A+ A AA AAA AAAA ... ab+ ab abb abbb .... a*b+ b ab bb aab abb bbb ..
Regular Expressions
Fall 2012 Introduction to lex (or flex) 11
p? - p is a pattern
Zero or one occurrences of p
p{m,n} – p is a pattern, m and n are ints
Matches m through n occurrences of p
if ,n is missing, n = m, if just n is missing n = ∞
A? A ab?c? a ab ac abc a{1,3} a aa aaa a{1,1} a a{1} a a{3,} aaa aaaa aaaaa …
Regular Expressions
Fall 2012 Introduction to lex (or flex) 12
p1p2 – p1 and p2 are patterns
Matches p1 followed by p2
(p) - p is a pattern
Used to override precedence (group things)
p1|p2 – p1 and p2 are patterns
Matches either p1 or p2
Notice precedence
ab ab a+b+ ab aab abb (abc)+ abc abcabc abcabcabc … abc+ abc abcc abccc … a|an|the a an the ba|ed ba ed b(a|e)d bed bad
Regular Expression - Extra Things
Fall 2012 Introduction to lex (or flex) 13
p1/p2 – p1 and p2 are patterns
Matches p1 only if it's followed by p2 p2 is not part of yytext
RE: a+/bc Input: aaabc bc aaaad matches first aaa only..
^p – p is a pattern
matches p only if it is at the start of a line
p$ – p is a pattern
matches p only if it is at the end of a line
Two more complex examples
Fall 2012 Introduction to lex (or flex) 14
[-+]?[0-9]+(\.[0-9]+)?([Ee][-+]?[0-9]+)?
- r:
nat = [0-9]+ signedNat = [-+]? nat number = signedNat(\. nat)?
([Ee] signedNat)?
C comments
/\*/*(\**[^/*]/*)*\**\*/
Pattern Matching Examples
Fall 2012 Introduction to lex (or flex) 15
Format of a lex File
Fall 2012 Introduction to lex (or flex) 16
Definitions %% Rules %% User code
1st section holds declarations of simple name definitions
and start conditions
2nd section holds pattern-action pairs 3rd section is copied directly to lex.yy.c C code and comments
Definitions
Fall 2012 Introduction to lex (or flex) 17
Definitions are of the form:
name definition
A name begins with a letter or underscore followed by 0 or more letters, digits,
'-', or '_'.
You access it with {name}
Example definitions:
Digit [0-9] Char [A-Z] AlphaNum [a-zA-Z0-9] ws [ \n\t] IntegerConst [0-9]+
Definitions Example
Fall 2012 Introduction to lex (or flex) 18
Digit [0-9] Char [a-zA-Z] AlphaNum [a-zA-Z0-9] %% {Digit}+”.”{Digit}+ ({Char}|_)({AlphaNum}|[_-])* {printf(“A name '%s'\n”, yytext);} %%
Rules
Fall 2012 Introduction to lex (or flex) 19
Rules are of the form:
pattern action
pattern is the RE to match and action is what to do when it is
matched
Default rule is to echo the input Lex matches the longest string possible If a tie, it matches the 1st rule in the spec Actions can be empty – do nothing Actions can be complex Use {} if multi-lined
don't forget ';'s
yytext contains the string matched
Example Rules
Fall 2012 Introduction to lex (or flex) 20
\n linecount++; [0-9]+ sum+=atoi(yytext); {ws}+ a|an|the printf(“found an article\n”); [aeiou]+ { printf(“A string of vowels\n”); vcnt++; }
Predefined Rules
Fall 2012 Introduction to lex (or flex) 21
ECHO
Copy yytext to output
[a-z]+ ECHO; REJECT
Go to the next alternative, that is the second choice rule to be
selected and it’s action taken she s++; he h++;
Won’t count the imbedded he
she {s++; REJECT;} he {h++; REJECT;} \n
But this will
Rules Example
ex1.l The commands
Fall 2012 Introduction to lex (or flex) 22
%% a*b printf(“Token 1 found\n”); c+ printf(“Token 2 found\n”); %% main() { yylex(); }
lex ex1.l
produces lex.yy.c
cc -o ex1 lex.yy.c – ll
create executable May need –lfl if using flex
./ex1
to execute
aaaaaaabbccd Token 1 found Token 1 found Token 2 found d
Default is stdin and stdout so type aaaaaaaabbccd <return>
An Example Count chars, words, lines
Fall 2012 Introduction to lex (or flex) 23
%{ unsigned ccnt=0, wcnt = 0, lcnt = 0; %} word [^ \t\n]+ eol \n %% {word}{wcnt++;ccnt+=yyleng;} {eol} {ccnt++;lcnt++;} . ccnt++; %% main() {yylex(); }
The %{ %} pair allow you to make declarations for your lexer
About lex
Fall 2012 Introduction to lex (or flex) 24
Lex uses some predefined functions stored in lex library
(link with -ll or -lfl)
By default lex copies input to output By default lex reads stdin, writes stdout Lex reads its input (a lex script) and produced lex.yy.c Use %{ and %} in definitions section to declare globals
and put #includes
You can use flex instead Not all 'lex'es are equal! Man page has more info!
Example 1: The Simplest Example
Fall 2012 Introduction to lex (or flex) 25
The simplest example of a lex program is a scanner that acts
like the UNIX `cat`program %% . |\n ECHO; %%
Or it could be written as…
%% . ECHO; \n ECHO; %%
Lex Predefined Variables
Fall 2012 Introduction to lex (or flex) 26
Flex Internal Names
Fall 2012 Introduction to lex (or flex) 27
Lex internal name Meaning/Use
lex.yy.c or lexyy.c
Lex output file name
yylex
Lex scanning routine
yytext
string matched on current action
yyleng
length of yytext
yyin
Lex input file (default: stdin)
yyout
Lex output file (default: stdout)
input
Lex buffered input routine
ECHO
Lex default action (print yytext to yyout)
See the Flex documentation for others
Flex Operational Conventions
Fall 2012 Introduction to lex (or flex) 28
yylex() runs until it is stopped by a return ambiguity is resolved by order any text not explicitly matched is echoed to stdout EOF is automatically matched and returns 0 from yylex()
(unless yywrap() is suitably defined)
yylex() returns an int which can be a token
Example 2: wc
Fall 2012 Introduction to lex (or flex) 29
Here is a scanner that is similar to the UNIX `wc` command
%{ unsigned charCount = 0, wordCount = 0, lineCount = 0; %} %% [^ \t\n] { wordCount++; charCount += yyleng; } \n { charCount++; lineCount++; } . charCount++; %% int main() { yylex(); printf("%d %d %d\n",charCount, wordCount, lineCount); return 0; }
Example 3: Line Numbers (p. 84)
Fall 2012 Introduction to lex (or flex) 30
%{ /* a Lex program that adds line numbers to lines of stdin, printing to stdout */ #include <stdio.h> int lineno = 1; %} line .*\n %% {line} { printf("%5d %s",lineno++,yytext); } %% main() { yylex(); return 0; }
Example 4: (pp. 86-87)
Fall 2012 Introduction to lex (or flex) 31
%{/* Selects only lines that end or begin with the letter 'a'. */ #include <stdio.h> %} ends_with_a .*a\n begins_with_a a.*\n %% {ends_with_a} ECHO; {begins_with_a} ECHO; .*\n ; %% main() { yylex(); return 0; }
Example 5: wc again!
Fall 2012 Introduction to lex (or flex) 32
%{ unsigned charCount = 0, wordCount = 0, lineCount = 0; %} word [^ \t\n]+ eol \n %% {word} { wordCount++; charCount += yyleng; } {eol} { charCount++; lineCount++; } . charCount++;
Example 5: wc again! (cont.)
Fall 2012 Introduction to lex (or flex) 33
%% int main(int argc,char *argv[]) { if (argc > 1) { FILE *file; file = fopen(argv[1], "r"); if (!file) { fprintf(stderr,"could not open %s\n",argv[1]); exit(1); } yyin = file; } yylex(); printf("%d %d %d\n",charCount, wordCount, lineCount); return 0; }
Example 6: html (not in book)
Fall 2012 Introduction to lex (or flex) 34
%{/* a Lex program that produces html, making all C comments italic */ #include <stdio.h> %} %% "/*" { printf("<i><font color=\"blue\">/*"); } "*/" { printf("*/</font></i>"); } \n { printf("<br>\n"); } %% main() { printf("<html><tt><b>\n"); yylex(); printf("</b></tt></html>"); return 0; }
Example 7: A Scanner to Recognize Specific Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 35
%{ /* * We expand upon the first example by adding * recognition of some other parts of speech. */ %}
Example 7: A Scanner to Recognize Specific Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 36
%% /* ignore white space */ ; [\t ]+ is | am | are | were | was | be | being | been | do | does | did | will | would | should | can | could | has | have | had | go { printf("%s: is a verb\n", yytext); }
Example 7: A Scanner to Recognize Specific Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 37
very | simply | gently | quietly | calmly | angrily { printf("%s: is an adverb\n", yytext); } to | from | behind | above | below | between | below { printf("%s: is a preposition\n", yytext); }
Example 7: A Scanner to Recognize Specific Tokens
Fall 2012 Introduction to lex (or flex) 38
if | then | and | but |
- r { printf("%s: is a conjunction\n", yytext); }
their | my | your | his | her | its { printf("%s: is an adjective\n", yytext); }
Example 7: A Scanner to Recognize Specific Tokens (cont.)
Fall 2012 Introduction to lex (or flex) 39 I | you | he | she | we | they { printf("%s: in a pronoun\n", yytext); } [a-zA-Z]+ { printf("%s: don't recognize, might be a noun\n", yytext); } \&.|\n { ECHO; /* normal default anyway */ } %% main() { yylex(); }
But What About Those Pesky C Comments?
Fall 2012 Introduction to lex (or flex) 40
Match with \/\*\/*(\**[^/*]\/*)*\**\*\/ Or with “/*””/”*(“*”*[^/*]”/”*)*”*”*”*/” But what if we want to process stuff inside a comment
(like \n, for example)?
Do it by hand matching (Ex 2.23, pp. 87-88 and tiny.l) Use a new feature of flex that allows explicit state management
Final Example (flex documentation)
Fall 2012 Introduction to lex (or flex) 41
%x comment %% int line_num = 1; "/*" BEGIN(comment); /* eat anything that's not a '*' */ <comment>[^*\n]* /* eat up '*'s not followed by '/'s */ <comment>"*"+[^*/\n]* <comment>\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL);
Beware
Fall 2012 Introduction to lex (or flex) 42