Introduction to lex (or flex) Some slides borrowed from M Scherger - - PowerPoint PPT Presentation

introduction to lex or flex
SMART_READER_LITE
LIVE PREVIEW

Introduction to lex (or flex) Some slides borrowed from M Scherger - - PowerPoint PPT Presentation

Introduction to lex (or flex) Some slides borrowed from M Scherger Lex/Flex: A Scanner Generator in C Regular Expression Thomsons Construction Nondeterministic Finite Automaton Subset Construction Deterministic Finite


slide-1
SLIDE 1

Introduction to lex (or flex)

Some slides borrowed from M Scherger

slide-2
SLIDE 2

Lex/Flex: A Scanner Generator in C

Fall 2012 Introduction to lex (or flex) 2

 Regular Expression

Nondeterministic Finite Automaton  Deterministic Finite Automaton

Table-driven Scanner  So why not do this with a tool?

Thomson’s Construction “Subset” Construction

slide-3
SLIDE 3

Lex

Fall 2012 Introduction to lex (or flex) 3

 Lex is a such tool for creating lexical analyzers

 M. E. Lesk and E. Schmidt 1975

 Lexical analyzers tokenize input streams  Regular expressions define tokens  Tokens are the terminals of a language  Converts regular expressions into DFAs  DFAs are implemented as table driven state machines  Some versions of Lex are proprietary and so not all versions

  • f *nix come with an open source version

 flex – Fast Lexical Analyzer is an open source version

 Vern Paxson

slide-4
SLIDE 4

The Basic Process

Fall 2012 Introduction to lex (or flex) 4

Lex compiler C Compiler a.out Lex source program any.l lex.yy.c Input stream Sequence

  • f tokens

a.out lex.yy.c

slide-5
SLIDE 5

Format of a lex File

Fall 2012 Introduction to lex (or flex) 5

Definitions %% Rules %% User code

 1st section holds declarations of simple name definitions and

start conditions

 2nd section holds pattern-action pairs  3rd section is copied directly to lex.yy.c

 C code and comments

 Typical file extensions: .l .lex .flex

slide-6
SLIDE 6

Compiling and Running

Fall 2012 Introduction to lex (or flex) 6

> flex linenos.flex > gcc lexyy.c -lfl > a.out < infile > outfile

yywrap() issue

slide-7
SLIDE 7

Regular Expressions and Lex

Fall 2012 Introduction to lex (or flex) 7

 A regular expression is an expression that matches sets of

strings

 (the “language” of the regular expression).

 In its basic form, a regular expression is built up out of basic

expressions (individual symbols) and the operations

 choice (|),  concatenation (no operator),  and repetition (*).

 A regular expression may also contain certain other

metasymbols:

 parentheses for grouping (to change precedence, just as in

arithmetic)

 others as needed to extend the operator set in useful ways

slide-8
SLIDE 8

Regular Expressions in Lex

Fall 2012 Introduction to lex (or flex) 8

 c - c is a single character

 Matches the character c

 \c – c is a single character

 Use this to escape special characters

 “str” - str is a string

 Matches entire string str

 [str]- str is a string

 Matches any single character from str RE Matches A A x x d d \. . \n Newline \t tab

“Abc” Abc “The” The [aeiou] Lowercase vowels [abcde] The letters a to e

slide-9
SLIDE 9

Regular Expressions – Character Classes

Fall 2012 Introduction to lex (or flex) 9

 [x-y] – x and y are characters

 All characters in the range x-y

 These can be combined  [^str] – str is a string

RE Matches [a-z] All lowercase characters [0-9] All digits [a-df-z] lowercase characters except e [a-z0-9A-Z] Alphanumeric characters [A-Zaeiou] Upper case letters and lc vowels [^ \n\t] all non whitespace [^aeiou] matches anything but lowercase vowels

slide-10
SLIDE 10

Regular Expressions

Fall 2012 Introduction to lex (or flex) 10

 p* – p is a pattern

 Zero or more occurrences of p

 p+ – p is a pattern

 One or more occurrences of p A*  A AA AAA .... r*  r rr ...

ab*c* a ab ac abb abc acc abbb abbc abcc accc ... A+ A AA AAA AAAA ... ab+ ab abb abbb .... a*b+ b ab bb aab abb bbb ..

slide-11
SLIDE 11

Regular Expressions

Fall 2012 Introduction to lex (or flex) 11

 p? - p is a pattern

 Zero or one occurrences of p

 p{m,n} – p is a pattern, m and n are ints

 Matches m through n occurrences of p

 if ,n is missing, n = m, if just n is missing n = ∞

A?  A ab?c? a ab ac abc a{1,3} a aa aaa a{1,1} a a{1} a a{3,} aaa aaaa aaaaa …

slide-12
SLIDE 12

Regular Expressions

Fall 2012 Introduction to lex (or flex) 12

 p1p2 – p1 and p2 are patterns

 Matches p1 followed by p2

 (p) - p is a pattern

 Used to override precedence (group things)

 p1|p2 – p1 and p2 are patterns

 Matches either p1 or p2

 Notice precedence

ab ab a+b+ ab aab abb (abc)+ abc abcabc abcabcabc … abc+ abc abcc abccc … a|an|the a an the ba|ed ba ed b(a|e)d bed bad

slide-13
SLIDE 13

Regular Expression - Extra Things

Fall 2012 Introduction to lex (or flex) 13

 p1/p2 – p1 and p2 are patterns

 Matches p1 only if it's followed by p2  p2 is not part of yytext

RE: a+/bc Input: aaabc bc aaaad matches first aaa only..

 ^p – p is a pattern

 matches p only if it is at the start of a line

 p$ – p is a pattern

 matches p only if it is at the end of a line

slide-14
SLIDE 14

Two more complex examples

Fall 2012 Introduction to lex (or flex) 14

 [-+]?[0-9]+(\.[0-9]+)?([Ee][-+]?[0-9]+)?

  • r:

 nat = [0-9]+  signedNat = [-+]? nat  number = signedNat(\. nat)?

([Ee] signedNat)?

 C comments

/\*/*(\**[^/*]/*)*\**\*/

slide-15
SLIDE 15

Pattern Matching Examples

Fall 2012 Introduction to lex (or flex) 15

slide-16
SLIDE 16

Format of a lex File

Fall 2012 Introduction to lex (or flex) 16

Definitions %% Rules %% User code

 1st section holds declarations of simple name definitions

and start conditions

 2nd section holds pattern-action pairs  3rd section is copied directly to lex.yy.c  C code and comments

slide-17
SLIDE 17

Definitions

Fall 2012 Introduction to lex (or flex) 17

 Definitions are of the form:

name definition

 A name begins with a letter or underscore followed by 0 or more letters, digits,

'-', or '_'.

 You access it with {name}

 Example definitions:

Digit [0-9] Char [A-Z] AlphaNum [a-zA-Z0-9] ws [ \n\t] IntegerConst [0-9]+

slide-18
SLIDE 18

Definitions Example

Fall 2012 Introduction to lex (or flex) 18

Digit [0-9] Char [a-zA-Z] AlphaNum [a-zA-Z0-9] %% {Digit}+”.”{Digit}+ ({Char}|_)({AlphaNum}|[_-])* {printf(“A name '%s'\n”, yytext);} %%

slide-19
SLIDE 19

Rules

Fall 2012 Introduction to lex (or flex) 19

 Rules are of the form:

pattern action

 pattern is the RE to match and action is what to do when it is

matched

 Default rule is to echo the input  Lex matches the longest string possible  If a tie, it matches the 1st rule in the spec  Actions can be empty – do nothing  Actions can be complex  Use {} if multi-lined

 don't forget ';'s

 yytext contains the string matched

slide-20
SLIDE 20

Example Rules

Fall 2012 Introduction to lex (or flex) 20

\n linecount++; [0-9]+ sum+=atoi(yytext); {ws}+ a|an|the printf(“found an article\n”); [aeiou]+ { printf(“A string of vowels\n”); vcnt++; }

slide-21
SLIDE 21

Predefined Rules

Fall 2012 Introduction to lex (or flex) 21

 ECHO

 Copy yytext to output

[a-z]+ ECHO;  REJECT

 Go to the next alternative, that is the second choice rule to be

selected and it’s action taken she s++; he h++;

 Won’t count the imbedded he

she {s++; REJECT;} he {h++; REJECT;} \n

 But this will

slide-22
SLIDE 22

Rules Example

ex1.l The commands

Fall 2012 Introduction to lex (or flex) 22

%% a*b printf(“Token 1 found\n”); c+ printf(“Token 2 found\n”); %% main() { yylex(); }

 lex ex1.l

 produces lex.yy.c

 cc -o ex1 lex.yy.c – ll

 create executable  May need –lfl if using flex

 ./ex1

 to execute

aaaaaaabbccd Token 1 found Token 1 found Token 2 found d

Default is stdin and stdout so type aaaaaaaabbccd <return>

slide-23
SLIDE 23

An Example Count chars, words, lines

Fall 2012 Introduction to lex (or flex) 23

%{ unsigned ccnt=0, wcnt = 0, lcnt = 0; %} word [^ \t\n]+ eol \n %% {word}{wcnt++;ccnt+=yyleng;} {eol} {ccnt++;lcnt++;} . ccnt++; %% main() {yylex(); }

The %{ %} pair allow you to make declarations for your lexer

slide-24
SLIDE 24

About lex

Fall 2012 Introduction to lex (or flex) 24

 Lex uses some predefined functions stored in lex library

(link with -ll or -lfl)

 By default lex copies input to output  By default lex reads stdin, writes stdout  Lex reads its input (a lex script) and produced lex.yy.c  Use %{ and %} in definitions section to declare globals

and put #includes

 You can use flex instead  Not all 'lex'es are equal!  Man page has more info!

slide-25
SLIDE 25

Example 1: The Simplest Example

Fall 2012 Introduction to lex (or flex) 25

 The simplest example of a lex program is a scanner that acts

like the UNIX `cat`program %% . |\n ECHO; %%

 Or it could be written as…

%% . ECHO; \n ECHO; %%

slide-26
SLIDE 26

Lex Predefined Variables

Fall 2012 Introduction to lex (or flex) 26

slide-27
SLIDE 27

Flex Internal Names

Fall 2012 Introduction to lex (or flex) 27

Lex internal name Meaning/Use

lex.yy.c or lexyy.c

Lex output file name

yylex

Lex scanning routine

yytext

string matched on current action

yyleng

length of yytext

yyin

Lex input file (default: stdin)

yyout

Lex output file (default: stdout)

input

Lex buffered input routine

ECHO

Lex default action (print yytext to yyout)

See the Flex documentation for others

slide-28
SLIDE 28

Flex Operational Conventions

Fall 2012 Introduction to lex (or flex) 28

 yylex() runs until it is stopped by a return  ambiguity is resolved by order  any text not explicitly matched is echoed to stdout  EOF is automatically matched and returns 0 from yylex()

(unless yywrap() is suitably defined)

 yylex() returns an int which can be a token

slide-29
SLIDE 29

Example 2: wc

Fall 2012 Introduction to lex (or flex) 29

 Here is a scanner that is similar to the UNIX `wc` command

%{ unsigned charCount = 0, wordCount = 0, lineCount = 0; %} %% [^ \t\n] { wordCount++; charCount += yyleng; } \n { charCount++; lineCount++; } . charCount++; %% int main() { yylex(); printf("%d %d %d\n",charCount, wordCount, lineCount); return 0; }

slide-30
SLIDE 30

Example 3: Line Numbers (p. 84)

Fall 2012 Introduction to lex (or flex) 30

%{ /* a Lex program that adds line numbers to lines of stdin, printing to stdout */ #include <stdio.h> int lineno = 1; %} line .*\n %% {line} { printf("%5d %s",lineno++,yytext); } %% main() { yylex(); return 0; }

slide-31
SLIDE 31

Example 4: (pp. 86-87)

Fall 2012 Introduction to lex (or flex) 31

%{/* Selects only lines that end or begin with the letter 'a'. */ #include <stdio.h> %} ends_with_a .*a\n begins_with_a a.*\n %% {ends_with_a} ECHO; {begins_with_a} ECHO; .*\n ; %% main() { yylex(); return 0; }

slide-32
SLIDE 32

Example 5: wc again!

Fall 2012 Introduction to lex (or flex) 32

%{ unsigned charCount = 0, wordCount = 0, lineCount = 0; %} word [^ \t\n]+ eol \n %% {word} { wordCount++; charCount += yyleng; } {eol} { charCount++; lineCount++; } . charCount++;

slide-33
SLIDE 33

Example 5: wc again! (cont.)

Fall 2012 Introduction to lex (or flex) 33

%% int main(int argc,char *argv[]) { if (argc > 1) { FILE *file; file = fopen(argv[1], "r"); if (!file) { fprintf(stderr,"could not open %s\n",argv[1]); exit(1); } yyin = file; } yylex(); printf("%d %d %d\n",charCount, wordCount, lineCount); return 0; }

slide-34
SLIDE 34

Example 6: html (not in book)

Fall 2012 Introduction to lex (or flex) 34

%{/* a Lex program that produces html, making all C comments italic */ #include <stdio.h> %} %% "/*" { printf("<i><font color=\"blue\">/*"); } "*/" { printf("*/</font></i>"); } \n { printf("<br>\n"); } %% main() { printf("<html><tt><b>\n"); yylex(); printf("</b></tt></html>"); return 0; }

slide-35
SLIDE 35

Example 7: A Scanner to Recognize Specific Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 35

%{ /* * We expand upon the first example by adding * recognition of some other parts of speech. */ %}

slide-36
SLIDE 36

Example 7: A Scanner to Recognize Specific Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 36

%% /* ignore white space */ ; [\t ]+ is | am | are | were | was | be | being | been | do | does | did | will | would | should | can | could | has | have | had | go { printf("%s: is a verb\n", yytext); }

slide-37
SLIDE 37

Example 7: A Scanner to Recognize Specific Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 37

very | simply | gently | quietly | calmly | angrily { printf("%s: is an adverb\n", yytext); } to | from | behind | above | below | between | below { printf("%s: is a preposition\n", yytext); }

slide-38
SLIDE 38

Example 7: A Scanner to Recognize Specific Tokens

Fall 2012 Introduction to lex (or flex) 38

if | then | and | but |

  • r { printf("%s: is a conjunction\n", yytext); }

their | my | your | his | her | its { printf("%s: is an adjective\n", yytext); }

slide-39
SLIDE 39

Example 7: A Scanner to Recognize Specific Tokens (cont.)

Fall 2012 Introduction to lex (or flex) 39 I | you | he | she | we | they { printf("%s: in a pronoun\n", yytext); } [a-zA-Z]+ { printf("%s: don't recognize, might be a noun\n", yytext); } \&.|\n { ECHO; /* normal default anyway */ } %% main() { yylex(); }

slide-40
SLIDE 40

But What About Those Pesky C Comments?

Fall 2012 Introduction to lex (or flex) 40

 Match with \/\*\/*(\**[^/*]\/*)*\**\*\/  Or with “/*””/”*(“*”*[^/*]”/”*)*”*”*”*/”  But what if we want to process stuff inside a comment

(like \n, for example)?

 Do it by hand matching (Ex 2.23, pp. 87-88 and tiny.l)  Use a new feature of flex that allows explicit state management

slide-41
SLIDE 41

Final Example (flex documentation)

Fall 2012 Introduction to lex (or flex) 41

%x comment %% int line_num = 1; "/*" BEGIN(comment); /* eat anything that's not a '*' */ <comment>[^*\n]* /* eat up '*'s not followed by '/'s */ <comment>"*"+[^*/\n]* <comment>\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL);

slide-42
SLIDE 42

Beware

Fall 2012 Introduction to lex (or flex) 42

 '\.' - matches '.' (tick period tick)  '.' - matches '.', (tick anything tick)  “.” - matches a period