Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis - PowerPoint PPT Presentation

Lesson 2 Lexical Analysis CS 226/326 Spring 2003

Lexical Analysis • Transform source program (a sequence of characters) into a sequence of tokens . get token lexical source parse parser program tree analyzer token • Lexical structure is specified using regular expressions • Secondary tasks 1. discard white space and comments 2. record positional attributes (e.g. char positions, line numbers)

Example Program A sample source program in Tiger let function g(a:int) = a in g(2,”str”) end What are the tokens? LET FUNCTION ID “g” LPAREN ID “a” COLON ID “int” RPAREN EQ ID “a” IN ID “g” LPAREN INT “2” COMMA STRING “str” RPAREN END

Tokens Tokens Text Description let keyword LET LET end keyword END END + arithmetic operator PLUS ( punctuation LPAREN : punctuation COLON “str” string STRING ) punctuation RPAREN 46 integer literal INT g, a, int variables, types ID = EQ end of file EOF

Strings • Alphabet: Σ - a set of basic characters or symbols • finite or infinite, but we will only be concerned with finite Σ • e.g. printable Ascii characters • Strings: Σ ∗ - finite sequences of symbols from Σ • e.g. ε (the empty string), abc , *?x_2 • Language: L ⊆ Σ ∗ - a set of strings • e.g. L = { ε, a , aa , aaa , ...} • Concatenation: s ⋅ t − concatenation of strings s and t • e.g. abc ⋅ xy = abcxy • 〈 Σ ∗ , ⋅ , ε 〉 is a semigroup • Product of languages: L 1 ⋅ L 2 = { s ⋅ t | s ∈ L 1 & t ∈ L 2 }

Regular Expressions Regular expressions are a small language for describing languages (i.e. subsets of Σ ∗ ). Regular expressions are defined by the following grammar : -- a single symbol ( a ∈ Σ ) M ::= a M 1 | M 2 -- alternation M 1 ⋅ M 2 -- concatenation (also M 1 M 2 ) -- epsilon ε M ∗ -- repetition (0 or more times) Examples: ( a ⋅ b ) | ε ( 0 ⋅ 1 ) ∗ ⋅ 0 b ∗ ( abb ∗ ) ∗ ( a | ε )

Regular Expressions The previous forms of regular expressions are adequate, but for convenience we add some redundant forms that could be defined in terms of the basic ones. M ::= ... M + -- repetition (1 or more times) M ? -- 0 or 1 occurrence of M [ a-z] -- ranges of characters (alternation) . -- any character other than newline ( \n ) “ abc ” -- literal sequence of characters M + = M M ∗ Defs : M ? = M | ε [ a-z] = ( a | b | c | ... | z ) “ abc ” = a ⋅ b ⋅ c

Meaning of Regular Expressions The meaning of regular expressions is given by a function L from regular expressions (re’s) to languages (subsets of Σ ∗ ). L is defined by the equations: L ( a ) = { a } L (M 1 | M 2 ) = L (M 1 ) ∪ L (M 2 ) L (M 1 ⋅ M 2 ) = L (M 1 ) ⋅ L (M 2 ) L ( ε ) = { ε } L (M ∗ ) = { ε } | ( L (M) ⋅ L (M ∗ )) Examples L (( a ⋅ b ) | ε ) = { ε , ab } L (( 0 ⋅ 1 ) ∗ ⋅ 0 ) = even binary numbers L ( b ∗ ( abb ∗ ) ∗ ( a | ε )) = strings of a , b with no consecutive a ’s

Using R.E.s to Define Tokens Regular expressions are used to define token classes in a specification of lexical structure: -- if keyword if (IF) -- identifier [a-z][a-z0-9]* (ID(str)) -- integer const [0-9] + (NUM(str)) ([0-9] + ”.”[0-9]*)|([0-9]*”.”[0-9] + ) (REAL(str)) -- real const (”--”[a-z]*”\n”) (continue()) -- comment (” ”|”\t”|”\n”) + (continue()) -- white space . (error();continue()) -- error Patterns are matched “top-down”, and the longest match is preferred.

Choosing among Multiple Matches -- if keyword if (IF) -- identifier [a-z][a-z0-9]* (ID(str)) Consider string “ if8 ”. The initial segment “ if ” matches the first r.e. while the whole string is matches the second r.e. In this case we choose the longest possible match, recognizing the string as an identifier. Consider “ if 8 ”. Both the first and second r.e.’s match the initial segment “ if ” and no r.e. matches the entire string (or “ if ” for that matter). In this case we choose the first matching r.e. and recognize the if keyword. Summary: the longest match is preferred, and ties are resolved in favor of the earliest match.

Homework Assignment 1 1. Program 1 (p. 10) file: prog1.sml 2. Exercise 1.1(a,b,c) (p. 12) file: ex1_1.sml

Finite State Machines The r.e. recognition problem: for re M we want to build a machine that scans a string and tells us whether it belongs to L (M). Alternatively, in lexical analysis we want to scan a string and find a (longest) initial segment of the string that belongs to L (M). re ⇒ nondeterministic finite automaton (NFA) ⇒ deterministic finite automaton (DFA) ⇒ optimization/simplification of the DFA ⇒ transition table + matching engine ⇒ code for a lexical analyzer

Finite State Machines A finite state machine ( finite automaton or FA ) over alphabet Σ is a quadruple M = 〈 S, T, i, F 〉 where S = a finite set of states (usually represented by numbers) T = a transition relation: T ⊆ S × Σ × S i = an initial state i ∈ S F = a set of final states: F ⊆ S Graphical representations: m ∈ S: 〈 m,a,n 〉∈ T: a m m n i ∈ S: f ∈ F: i f

ε Deterministic and Nondeterministic FA A finite automata M = 〈 S, T, i, F 〉 is deterministic (a DFA) if for each m ∈ S and a ∈ Σ there is at most one n ∈ S such that 〈 m ,a, n 〉∈ T Graphically, in a DFA we don’t have any situations of the form: p a m a q If a FA is not deterministic, it is a nondeterministic FA (an NFA). Nondeterministic automata are also formed by introducing ε transitions -- silent transitions that can be taken without consuming an input symbol. m n

DFAs for Token Classes if (IF) i f 1 2 3 [a-z][a-z0-9]* (ID(str)) a-z a-z 1 2 0-9 [0-9] + (NUM(str)) 0-9 0-9 1 2

DFAs for Token Classes ([0-9] + ”.”[0-9]*)|([0-9]*”.”[0-9] + ) (REAL(str)) 0-9 0-9 . 0-9 2 1 2 . 0-9 4 5 0-9 (”--”[a-z]*”\n”) (continue()) -- comment a-z \n - - 1 2 3 4

DFAs for Token Classes (” ”|”\t”|”\n”) + (continue()) -- white space ws 2 1 ws where ws is (” ”|”\t”|”\n”) . (error();continue()) -- error any but \n 1 2

Combined DFA 0-9 a-e,g-z 0-9 a-z IF ID ID f 0-9 0-9,a-z 2 4 6 2 3 4 5 6 error REAL z - j . , h 0-9 0-9 - i a 0-9 . 1 7 8 NUM REAL - ws other - \n 11 9 10 12 13 error ws comment ws error a-z

R.E. to NFA ε a ε a M ε ε M | N N ε M ⋅ N M N ε ε M ∗ M

RE to NFA Example b ∗ ( abb ∗ ) ∗ ( a | ε ) ε ε ε b ε ε ε b b a ε a ε ε ε

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6 1

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6 ε- closure of 1 1 2 3 4

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6 1 2 3 4 x 5

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6 1 2 3 4 x 5 6 ε- closure of 5 7

NFA to DFA ε ε ε ε 1 2 3 4 z y x ε ε 7 5 6 y ε- closure of 6 1 2 6 7 3 4 x 5 6 7

NFA to DFA ε ε ε ε ε ε ε ε 1 2 3 4 1 2 3 4 z y x z y x ε ε ε ε 7 5 6 7 5 6 y 1 2 6 7 3 4 z x 5 6 7

NFA to DFA ε ε ε ε ε ε ε ε 1 2 3 4 1 2 3 4 z y x z y x ε ε ε ε 7 5 6 7 5 6 y 1 3 z x 2

ML-Lex ML-Lex foo.lex foo.lex.sml lexer specification sml code for lexer Specification for token values has to be supplied externally, usually in the form of a Tokens module that defines a token type and a set of functions for building tokens of various classes.

An ML-Lex specification ML Declarations: type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% Lex definitions: digits=[0-9]+; %% Regular Expressions and Actions: if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos+size yytext)); {digits} => (Tokens.NUM(Int.fromString yytext,yypos, yypos+size yytext)); ({digits}"."[0-9]*)|([0-9]*"."{digits}) => (Tokens.REAL(Real.fromString yytext,yypos, yypos+size yytext)); ("--"[a-z]*"\n") => (continue()); (" "|"\n"|"\t") => (continue()); . => (ErrorMsg.error yypos "illegal character"; continue());

Variables Defined by ML-Lex ML-Lex defines several variables: recursively call the lexer lex() continue() same, but with %arg the string matched by the current r.e. yytext character position at start of current yypos r.e. match line number at start of match yylineno (if command %count given)

Defining Tokens (* ML Declaration of a Tokens module (called a structure in ML): *) structure Tokens = struct type pos = int datatype token = EOF of pos * pos | IF of pos * pos | ID of string * pos * pos | NUM of int * pos * pos | REAL of real * pos * pos ... end (* structure Tokens *)

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis - PowerPoint PPT Presentation

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program (a sequence of characters) into a sequence of tokens . get token lexical source parse parser program tree analyzer token Lexical

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

Lexical Analysis / Scanning Why separate lexical from syntactic analysis? Purpose: turn character

Lexical Analysis Scanners, Regular expressions, and Automata cs4713 1 Phases of compilation

Understanding Joe Landry, Transnational PhD October 16, 2019 Terrorism 1 After peaking in

Lexical and Syntax Analysis Part I 1 Introduction Every implementation of Programming

Outline Informal sketch of lexical

Compiler Design and Construction Semantic Analysis: Type Checking Slides modified from Louden

Compiler Development (CMPSC 401) Lexical Analysis Janyl Jumadinova January 24, 2019 Janyl

CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 Lets get started Character