Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - - PowerPoint PPT Presentation

lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - - PowerPoint PPT Presentation

Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target code Lexical analysis First phase in the compilation


slide-1
SLIDE 1

Compilation 2016

Lexical Analysis

Aslan Askarov aslan@cs.au.dk

acknowledgments: E. Ernst

slide-2
SLIDE 2

Lexical analysis

High-level source code Lexing Elaboration Parsing … Low-level target code

slide-3
SLIDE 3

Lexical analysis

i f x > 0 t h \n e n 1 \t e l s e ( ) \n \t

IF LPAREN ID (“x”) GE INT (0) THEN INT (1) ELSE INT (0) RPAREN Input: stream of characters Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives First phase in the compilation

slide-4
SLIDE 4

Tokens

Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=

slide-5
SLIDE 5

Non-tokens

Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace

slide-6
SLIDE 6

Token data structure

  • Many tokens need no associated data, e.g.:


IF , COMMA, LPAREN, RPAREN, ASGMT

  • Some tokens carry an associated string:


ID (“my-fun”)

  • Some tokens carry associated data of other types:


INT (73), INT (1), FLOAT (IEEE754, 1001111100…)

  • Tokens may include useful additional information: 


start/end pos in input file (line number + column, or charpos)

slide-7
SLIDE 7

Q/A

  • Consider source program



 var δ := 0.0

  • Language: case sensitive, ASCII
  • How to report error of using δ?

FileName:Line.Col: Illegal character δ

slide-8
SLIDE 8

Regular expressions

  • We can use regular expressions to specify

programming language tokens

  • Regular expressions R
  • Expected to be well-known (dRegAut)
  • Syntax
  • character a
  • choice R1 | R2
  • concat R1 · R2 also sometimes R1 R2
  • empty string ε
  • repeat R*
slide-9
SLIDE 9

Regular expressions used for scanning

Examples


if (IF); [a-z][a-z0-9]* (ID); [0-9]* (NUM); ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL); (”--” [a-z]*”\n”) | (” ”|”\t”) (continue()); . (error (); continue());

slide-10
SLIDE 10

Resolving ambiguities

  • Rule: when a string can match multiple tokens, the

longest matching token wins

  • We also need to specify priorities if we match

several tokens of the same length.

  • Usual rule: earliest declaration wins

i f x > 0

ID (“ifx”)

i f

IF

  • if (IF);
  • [a-z][a-z0-9]* (ID);

ID (“if”)

slide-11
SLIDE 11

Lexical analysis

Specification: Tokens as regular exps Formalism: NFA DFA Implementation: Output: Simulate NFA Simulate DFA Program that translates raw text into stream of tokens

+longest-matching rule +priorities linear complexity

“classical” approach – from RegEx to NFA to DFA

slide-12
SLIDE 12

Total NFA for ID,IF,NUM,REAL

1 2 3 5

12 13

9 6 8 7

11 10

4

ID ID IF error REAL NUM REAL error error

i f 0-9,a-z

  • ther

blank etc. blank etc.

  • 0-9

0-9 0-9 a-z

  • \n

whitespace

.

a-h,j-z

.

0-9,a-z a-e,g-z,0-9 0-9 0-9

slide-13
SLIDE 13

ML-Lex

  • Lexer generator, “built-in” part of SML/NJ
  • Accepts lexical specification, produces a scanner
  • Example specification

(* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue());

  • => ( ErrorMsg.error yypos “Illegal character”; continue());
slide-14
SLIDE 14

Lexer states

  • Helpful when handling different “kinds” of tokens
  • For ex.: use state
  • INITIAL in general lexing (automatic)
  • STRING when scanning the contents of a string
  • COMMENT when scanning a comment
  • Point: keep different concerns apart – simpler!
  • Syntax:

... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...

slide-15
SLIDE 15

Lexical analysis

Specification: Tokens as regular exps Formalism: NFA DFA Implementation: Output: Simulate NFA Simulate DFA Program that translates raw text into stream of tokens

+longest-matching rule +priorities linear complexity

alternative, purely algebraic approach – from RegEx to DFA using regexp derivatives

slide-16
SLIDE 16

More on SML

[online demo]

slide-17
SLIDE 17

Warmup project

slide-18
SLIDE 18

Straight-line Programming Language

  • Toy programming language: no branching, no loops
  • Skip lexing and parsing issues
  • Focus on the “meaning” – interpretation
  • Syntax

Stm → Stm; Stm Stm → id := Exp Stm → print ( ExpList ) Exp → id Exp → num Exp → Exp BinOp Exp Exp → ( Stm , Exp )

(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)

ExpList → Exp , ExpList ExpList → Exp Binop → + Binop → – Binop → × Binop → /

(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)

slide-19
SLIDE 19

Straight-line program

  • Source:
  • Corresponding syntax tree:

CompoundStm AssignStm a OpExp NumExp 5 BinOp Plus NumExp 3 CompoundStm AssignStm b EseqExp PrintStm PairExpList IdExp a LastExpList OpExp IdExp a BinOp Minus NumExp 1 OpExp NumExp 10 BinOp Times IdExp a PrintStm LastExpList IdExp b

a : = 5 + 3 ; b : = ( p r i n t ( a , a - 1),10 * a); p r i n t ( b )

slide-20
SLIDE 20

SLP syntax representation datatype

  • SML declaration

Stm → Stm; Stm Stm → id := Exp Stm → print ( ExpList ) Exp → id Exp → num Exp → Exp BinOp Exp Exp → ( Stm , Exp )

(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)

ExpList → Exp , ExpList ExpList → Exp Binop → + Binop → – Binop → × Binop → /

(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)

type id = string datatype binop = Plus | Minus | Times | Div datatype stm = CompoundStm of stm * stm | AssignStm of id * exp | PrintStm of exp list and exp = IdExp of id | NumExp of int | OpExp of exp * binop * exp | EseqExp of stm * exp

slide-21
SLIDE 21

SLP syntax representation

  • Source program
  • SML value:

val prog = CompoundStm ( AssignStm (“a", OpExp ( NumExp 5, Plus, NumExp 3)), CompoundStm ( AssignStm ("b", EseqExp ( PrintStm [IdExp "a", OpExp (…)], OpExp (NumExp 10, …))), PrintStm [IdExp "b"])) a := 5 + 3; b := (print (a, a - 1),10 * a); print (b)

CompoundStm AssignStm a OpExp NumExp 5 BinOp Plus NumExp 3 CompoundStm AssignStm b EseqExp PrintStm PairExpList IdExp a LastExpList OpExp IdExp a BinOp Minus NumExp 1 OpExp NumExp 10 BinOp Times IdExp a PrintStm LastExpList IdExp b

slide-22
SLIDE 22

Project assignment

  • Follow descriptions p10-12 in MCIML
  • “Modularity principles” p9-10: discussed on Friday,

may be ignored at first

  • Clarification:
  • Let bindings are OK
  • References, arrays, and ref update (:=) are not OK
slide-23
SLIDE 23

Summary

  • Warm-up project: Program in SML!
  • Straight-line programming language, no lexing/parsing

involved

  • Express programs: use abstract syntax tree datatype
  • Project specified on website, essentially as in the book
  • Lexical analysis
  • Avoid complexity in grammar. Use lexer
  • Based on regular expressions.
  • Implementation via RE → NFA → DFA (theory assumed

known)

  • Alternatives: via RE derivatives → DFA
  • Tools: ML-Lex
  • Scanner generator, outputs SML code from spec
  • Note lexer states