Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - PowerPoint PPT Presentation

Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst

Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target … code

Lexical analysis First phase in the compilation Input: stream of characters i f ( x > 0 ) \n \t t h e n 1 \n \t e l s e 0 IF LPAREN ID (“x”) GE INT (0) RPAREN THEN INT (1) ELSE INT (0) Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives

Tokens Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=

Non-tokens Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace

Token data structure • Many tokens need no associated data, e.g.:   IF , COMMA, LPAREN, RPAREN, ASGMT • Some tokens carry an associated string:   ID (“my-fun”) • Some tokens carry associated data of other types:   INT (73), INT (1), FLOAT (IEEE754, 1001111100…) • Tokens may include useful additional information:   start/end pos in input file (line number + column, or charpos)

  Q/A • Consider source program   var δ := 0.0 • Language: case sensitive, ASCII • How to report error of using δ ? FileName:Line.Col: Illegal character δ

Regular expressions • We can use regular expressions to specify programming language tokens • Regular expressions R • Expected to be well-known (dRegAut) • Syntax • character a • choice R 1 | R 2 • concat R 1 · R 2 also sometimes R 1 R 2 • empty string ε • repeat R*

Regular expressions used for scanning Examples   if (IF); [a-z][a-z0-9]* (ID); [0-9]* (NUM); ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL); (”--” [a-z]*”\n”) | (” ”|”\t”) (continue()); . (error (); continue());

Resolving ambiguities • Rule: when a string can match multiple tokens, the longest matching token wins • if (IF); i f x > 0 • [a-z][a-z0-9]* (ID); ID (“ifx”) • We also need to specify priorities if we match several tokens of the same length. • Usual rule: earliest declaration wins i f ID (“if”) IF

Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens “classical” approach – from RegEx to NFA to DFA

Total NFA for ID,IF,NUM,REAL a-e,g-z,0-9 0-9,a-z ID error IF REAL 0-9,a-z f 0-9 0-9 . ID 4 2 3 5 6 a-h,j-z . i NUM REAL 0-9 0-9 1 7 8 blank etc. - 0-9 whitespace other blank - \n etc. 9 12 13 10 11 error error a-z

ML-Lex • Lexer generator, “built-in” part of SML/NJ • Accepts lexical specification, produces a scanner • Example specification (* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue()); • => ( ErrorMsg.error yypos “Illegal character”; continue());

Lexer states • Helpful when handling di ff erent “kinds” of tokens • For ex.: use state • INITIAL in general lexing (automatic) • STRING when scanning the contents of a string • COMMENT when scanning a comment • Point: keep di ff erent concerns apart – simpler! • Syntax: ... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...

Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens alternative, purely algebraic approach – from RegEx to DFA using regexp derivatives

More on SML [online demo]

Warmup project

Straight-line Programming Language • Toy programming language: no branching, no loops • Skip lexing and parsing issues • Focus on the “meaning” – interpretation • Syntax Stm → Stm; Stm (CompoundStm) ExpList → Exp , ExpList (PairExpList) Stm → id := Exp ExpList → Exp (AssignStm) (LastExpList) Stm → print ( ExpList ) Binop → + (PrintStm) (Plus) Exp → id Binop → – (IdExp) (Minus) Exp → num Binop → × (NumExp) (Times) Exp → Exp BinOp Exp Binop → / (OpExp) (Div) Exp → ( Stm , Exp ) (EseqExp)

Straight-line program • Source: CompoundStm a : = 5 + 3 ; AssignStm CompoundStm b : = ( p r i n t ( a , a - 1),10 * a); OpExp AssignStm PrintStm p r i n t ( b ) a NumExp BinOp NumExp EseqExp LastExpList b PrintStm OpExp IdExp • Corresponding syntax tree: 5 Plus 3 PairExpList NumExp BinOp IdExp b IdExp LastExpList Times 10 a OpExp a IdExp BinOp NumExp Minus a 1

SLP syntax representation datatype • SML declaration (CompoundStm) Stm → Stm; Stm type id = string (AssignStm) Stm → id := Exp datatype binop Stm → print ( ExpList ) (PrintStm) = Plus | Minus | Times | Div Exp → id (IdExp) datatype stm Exp → num (NumExp) = CompoundStm of stm * stm Exp → Exp BinOp Exp (OpExp) | AssignStm of id * exp Exp → ( Stm , Exp ) (EseqExp) | PrintStm of exp list (PairExpList) ExpList → Exp , ExpList and exp ExpList → Exp (LastExpList) = IdExp of id Binop → + (Plus) | NumExp of int Binop → – (Minus) | OpExp of exp * binop * exp Binop → × (Times) | EseqExp of stm * exp Binop → / (Div)

SLP syntax representation • Source program CompoundStm a := 5 + 3; AssignStm CompoundStm b := (print (a, a - 1),10 * a); print (b) OpExp AssignStm PrintStm a NumExp BinOp NumExp EseqExp LastExpList b • SML value: PrintStm OpExp IdExp 5 Plus 3 val prog = CompoundStm ( PairExpList NumExp BinOp IdExp b AssignStm (“a", OpExp ( NumExp 5, IdExp LastExpList Times 10 a Plus, NumExp 3)), OpExp a CompoundStm ( AssignStm ("b", IdExp BinOp NumExp EseqExp ( PrintStm [IdExp "a", OpExp (…)], Minus a 1 OpExp (NumExp 10, …))), PrintStm [IdExp "b"]))

Project assignment • Follow descriptions p10-12 in MCIML • “Modularity principles” p9-10: discussed on Friday, may be ignored at first • Clarification: • Let bindings are OK • References, arrays, and ref update (:=) are not OK

Summary • Warm-up project: Program in SML! • Straight-line programming language, no lexing/parsing involved • Express programs: use abstract syntax tree datatype • Project specified on website, essentially as in the book • Lexical analysis • Avoid complexity in grammar. Use lexer • Based on regular expressions. • Implementation via RE → NFA → DFA (theory assumed known) • Alternatives: via RE derivatives → DFA • Tools: ML-Lex • Scanner generator, outputs SML code from spec • Note lexer states

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - PowerPoint PPT Presentation

Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target code Lexical analysis First phase in the compilation

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

LEXING cs4430/7430 Spring 2019 Bill Harrison Announcements "CS4430 Code

Parsing Principles of Programming Languages Colorado School of Mines https://lambda.mines.edu

15-411/15-611 Compiler Design Robert Simmons, Instructor Fall

Formal Languages and Grammars Chapter 2: Sections 2.1 and 2.2 Outline Languages and grammars

Lexical and Syntactic Analysis Exercises a. 3.1 Identifier letter IdentRest IdentRest |

Compilers and computer architecture: introduction Martin Berger 1 Thanks to Chad MacKinney, Alex

Compiler Construction Christian Rinderknecht 31 October 2008 1 Why study compiler construction?

Big Picture: Compilation Process Source program Scanner Lexical CSCI: 4500/6500 Programming

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - PowerPoint PPT Presentation

Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target code Lexical analysis First phase in the compilation

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Lexical Phonology and Morphology February 4, 2016 Lexical Phonology and Morphology Paul

LEXING cs4430/7430 Spring 2019 Bill Harrison Announcements &quot;CS4430 Code

Parsing Principles of Programming Languages Colorado School of Mines https://lambda.mines.edu

15-411/15-611 Compiler Design Robert Simmons, Instructor Fall

Formal Languages and Grammars Chapter 2: Sections 2.1 and 2.2 Outline Languages and grammars

Lexical and Syntactic Analysis Exercises a. 3.1 Identifier letter IdentRest IdentRest |

Compilers and computer architecture: introduction Martin Berger 1 Thanks to Chad MacKinney, Alex

Compiler Construction Christian Rinderknecht 31 October 2008 1 Why study compiler construction?

Big Picture: Compilation Process Source program Scanner Lexical CSCI: 4500/6500 Programming

LEXING cs4430/7430 Spring 2019 Bill Harrison Announcements "CS4430 Code