Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis - - PowerPoint PPT Presentation

lesson 2 lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis - - PowerPoint PPT Presentation

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program (a sequence of characters) into a sequence of tokens . get token lexical source parse parser program tree analyzer token Lexical


slide-1
SLIDE 1

CS 226/326 Spring 2003

Lesson 2 Lexical Analysis

slide-2
SLIDE 2

Lexical Analysis

  • Transform source program (a sequence of characters) into a

sequence of tokens.

  • Lexical structure is specified using regular expressions
  • Secondary tasks
  • 1. discard white space and comments
  • 2. record positional attributes (e.g. char positions, line numbers)

lexical analyzer parser

source program get token token parse tree

slide-3
SLIDE 3

Example Program

let function g(a:int) = a in g(2,”str”) end

A sample source program in Tiger What are the tokens? LET FUNCTION ID “g” LPAREN ID “a” COLON ID “int” RPAREN EQ ID “a” IN ID “g” LPAREN INT “2” COMMA STRING “str” RPAREN END

slide-4
SLIDE 4

Tokens

Tokens Text Description LET let keyword LET END end keyword END PLUS + arithmetic operator LPAREN ( punctuation COLON : punctuation STRING “str” string RPAREN ) punctuation INT 46 integer literal ID g, a, int variables, types EQ = EOF end of file

slide-5
SLIDE 5

Strings

  • Alphabet: Σ - a set of basic characters or symbols
  • finite or infinite, but we will only be concerned with finite Σ
  • e.g. printable Ascii characters
  • Strings: Σ∗ - finite sequences of symbols from Σ
  • e.g. ε (the empty string), abc, *?x_2
  • Language: L ⊆ Σ∗ - a set of strings
  • e.g. L = {ε, a, aa, aaa, ...}
  • Concatenation: s ⋅ t − concatenation of strings s and t
  • e.g. abc ⋅ xy = abcxy
  • 〈Σ∗, ⋅, ε〉is a semigroup
  • Product of languages: L1 ⋅ L2 = { s⋅t | s ∈ L1 & t ∈ L2}
slide-6
SLIDE 6

Regular Expressions

Regular expressions are a small language for describing languages (i.e. subsets of Σ∗). Regular expressions are defined by the following grammar: M ::= a

  • - a single symbol (a ∈ Σ)

M1 | M2

  • - alternation

M1 ⋅ M2

  • - concatenation (also M1M2 )

ε

  • - epsilon

M∗

  • - repetition (0 or more times)

Examples: (a ⋅ b) | ε

(0 ⋅ 1)∗ ⋅ 0

b∗(abb∗)∗(a|ε)

slide-7
SLIDE 7

Regular Expressions

The previous forms of regular expressions are adequate, but for convenience we add some redundant forms that could be defined in terms of the basic ones. M ::= ... M+

  • - repetition (1 or more times)

M?

  • - 0 or 1 occurrence of M

[a-z]

  • - ranges of characters (alternation)

.

  • - any character other than newline (\n)

“abc”

  • - literal sequence of characters

Defs: M+ = M M∗ M? = M | ε [a-z] = (a | b | c | ... | z) “abc” = a⋅b⋅c

slide-8
SLIDE 8

Meaning of Regular Expressions

The meaning of regular expressions is given by a function L from regular expressions (re’s) to languages (subsets of Σ∗). L is defined by the equations: L(a) = {a} L(M1 | M2) = L(M1) ∪ L(M2) L(M1 ⋅ M2) = L(M1) ⋅ L(M2) L(ε) = {ε} L(M∗) = {ε} | (L(M) ⋅ L(M∗)) Examples L((a ⋅ b) | ε) = {ε, ab} L((0 ⋅ 1)∗ ⋅ 0) = even binary numbers L(b∗(abb∗)∗(a|ε)) = strings of a, b with no consecutive a’s

slide-9
SLIDE 9

Using R.E.s to Define Tokens

Regular expressions are used to define token classes in a specification of lexical structure:

if (IF)

  • - if keyword

[a-z][a-z0-9]* (ID(str))

  • - identifier

[0-9]+ (NUM(str))

  • - integer const

([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) (REAL(str))

  • - real const

(”--”[a-z]*”\n”) (continue()) -- comment (” ”|”\t”|”\n”)+ (continue()) -- white space . (error();continue())

  • - error

Patterns are matched “top-down”, and the longest match is preferred.

slide-10
SLIDE 10

Choosing among Multiple Matches

if (IF)

  • - if keyword

[a-z][a-z0-9]* (ID(str))

  • - identifier

Consider string “if8”. The initial segment “if” matches the first r.e. while the whole string is matches the second r.e. In this case we choose the longest possible match, recognizing the string as an identifier. Consider “if 8”. Both the first and second r.e.’s match the initial segment “if” and no r.e. matches the entire string (or “if ” for that matter). In this case we choose the first matching r.e. and recognize the if keyword. Summary: the longest match is preferred, and ties are resolved in favor of the earliest match.

slide-11
SLIDE 11

Homework Assignment 1

  • 1. Program 1 (p. 10)

file: prog1.sml

  • 2. Exercise 1.1(a,b,c) (p. 12)

file: ex1_1.sml

slide-12
SLIDE 12

Finite State Machines

The r.e. recognition problem: for re M we want to build a machine that scans a string and tells us whether it belongs to L(M). Alternatively, in lexical analysis we want to scan a string and find a (longest) initial segment of the string that belongs to L(M). re ⇒ nondeterministic finite automaton (NFA) ⇒ deterministic finite automaton (DFA) ⇒ optimization/simplification of the DFA ⇒ transition table + matching engine ⇒ code for a lexical analyzer

slide-13
SLIDE 13

Finite State Machines

A finite state machine (finite automaton or FA) over alphabet Σ is a quadruple M =〈S, T, i, F〉 where S = a finite set of states (usually represented by numbers) T = a transition relation: T ⊆ S × Σ × S i = an initial state i ∈ S F = a set of final states: F ⊆ S Graphical representations: m ∈ S: 〈m,a,n〉∈ T: i ∈ S: f ∈ F:

m n m

a

i f

slide-14
SLIDE 14

Deterministic and Nondeterministic FA

A finite automata M =〈S, T, i, F〉is deterministic (a DFA) if for each m ∈ S and a ∈ Σ there is at most one n ∈ S such that 〈m,a,n〉∈ T Graphically, in a DFA we don’t have any situations of the form: If a FA is not deterministic, it is a nondeterministic FA (an NFA). Nondeterministic automata are also formed by introducing

εtransitions -- silent transitions that can be taken without

consuming an input symbol.

a q p m a n m ε

slide-15
SLIDE 15

DFAs for Token Classes

1 2 3

i f

1 2

a-z a-z 0-9

if (IF) [a-z][a-z0-9]* (ID(str))

1 2

0-9 0-9

[0-9]+ (NUM(str))

slide-16
SLIDE 16

DFAs for Token Classes

2

0-9 0-9 0-9 .

4 5

0-9

1 2

0-9 .

([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) (REAL(str))

3 4

a-z

2

  • \n

1

  • (”--”[a-z]*”\n”) (continue()) -- comment
slide-17
SLIDE 17

DFAs for Token Classes

(” ”|”\t”|”\n”)+ (continue()) -- white space . (error();continue()) -- error

where ws is (” ”|”\t”|”\n”)

1

ws

2

ws

1

any but \n

2

slide-18
SLIDE 18

Combined DFA

4

4

1 2

2 3 12 13 5

6

6 7 8 9 10 11 ID ID IF ws error error error comment ws ws 0-9 0-9 0-9 0-9 0-9

  • a-z

NUM REAL . . i f 0-9,a-z \n 0-9 a-z REAL a

  • h

, j

  • z
  • ther

a-e,g-z

slide-19
SLIDE 19

R.E. to NFA

a ε

M | N

ε ε ε

M N

ε

a

M ⋅ N M N

ε ε

M∗ M

slide-20
SLIDE 20

RE to NFA Example

b∗(abb∗)∗(a|ε)

b ε ε a b b ε ε ε ε ε a ε ε

ε

slide-21
SLIDE 21

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

slide-22
SLIDE 22

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

1

slide-23
SLIDE 23

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

1 2 3 4

ε-closure of 1

slide-24
SLIDE 24

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

1 2 3 4 5

x

slide-25
SLIDE 25

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

1 2 3 4 5 6 7

x

ε-closure of 5

slide-26
SLIDE 26

NFA to DFA

6 7 1 2 3 4 5 6 7

x y

ε-closure of 6

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

slide-27
SLIDE 27

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

6 7 1 2 3 4 5 6 7

x y z ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

slide-28
SLIDE 28

NFA to DFA

ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

1

y

3 2

x z ε

1 4 3 2 5 7 6

ε ε ε ε ε y z x

slide-29
SLIDE 29

ML-Lex

ML-Lex foo.lex foo.lex.sml lexer specification sml code for lexer Specification for token values has to be supplied externally, usually in the form of a Tokens module that defines a token type and a set

  • f functions for building tokens of various classes.
slide-30
SLIDE 30

An ML-Lex specification

ML Declarations:

type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %%

Lex definitions:

digits=[0-9]+; %%

Regular Expressions and Actions:

if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos+size yytext)); {digits} => (Tokens.NUM(Int.fromString yytext,yypos, yypos+size yytext)); ({digits}"."[0-9]*)|([0-9]*"."{digits}) => (Tokens.REAL(Real.fromString yytext,yypos, yypos+size yytext)); ("--"[a-z]*"\n") => (continue()); (" "|"\n"|"\t") => (continue()); . => (ErrorMsg.error yypos "illegal character"; continue());

slide-31
SLIDE 31

Variables Defined by ML-Lex

ML-Lex defines several variables:

lex()

recursively call the lexer

continue() same, but with %arg yytext

the string matched by the current r.e.

yypos

character position at start of current r.e. match

yylineno

line number at start of match (if command %count given)

slide-32
SLIDE 32

Defining Tokens

(* ML Declaration of a

Tokens module (called a structure in ML): *)

structure Tokens = struct type pos = int datatype token = EOF of pos * pos | IF of pos * pos | ID of string * pos * pos | NUM of int * pos * pos | REAL of real * pos * pos ... end (* structure Tokens *)

slide-33
SLIDE 33

Start States

Several different lexing automata can be set up using start states. Additional start states are commonly used for handling comments and strings. ML decls...

%%

Lex decls...

%s COMMENT %% <INITIAL>if

=> (Tokens.IF(yypos,yypos+2));

<INITIAL>[a-z]+

=> (Tokens.ID(yytext,yypos, yypos+size yytext));

<INITIAL>”(*”

=> (YYBEGIN COMMENT; continue());

<COMMENT>”*)”

=> (YYBEGIN INITIAL; continue());

<COMMENT>.

=> (continue());