lexical analysis
play

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. - PowerPoint PPT Presentation

Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target code Lexical analysis First phase in the compilation


  1. Compilation 2016 Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst

  2. Lexical analysis High-level source code Lexing Parsing Elaboration Low-level target … code

  3. Lexical analysis First phase in the compilation Input: stream of characters i f ( x > 0 ) \n \t t h e n 1 \n \t e l s e 0 IF LPAREN ID (“x”) GE INT (0) RPAREN THEN INT (1) ELSE INT (0) Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives

  4. Tokens Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=

  5. Non-tokens Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace

  6. Token data structure • Many tokens need no associated data, e.g.: 
 IF , COMMA, LPAREN, RPAREN, ASGMT • Some tokens carry an associated string: 
 ID (“my-fun”) • Some tokens carry associated data of other types: 
 INT (73), INT (1), FLOAT (IEEE754, 1001111100…) • Tokens may include useful additional information: 
 start/end pos in input file (line number + column, or charpos)

  7. 
 Q/A • Consider source program 
 var δ := 0.0 • Language: case sensitive, ASCII • How to report error of using δ ? FileName:Line.Col: Illegal character δ

  8. Regular expressions • We can use regular expressions to specify programming language tokens • Regular expressions R • Expected to be well-known (dRegAut) • Syntax • character a • choice R 1 | R 2 • concat R 1 · R 2 also sometimes R 1 R 2 • empty string ε • repeat R*

  9. Regular expressions used for scanning Examples 
 if (IF); [a-z][a-z0-9]* (ID); [0-9]* (NUM); ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL); (”--” [a-z]*”\n”) | (” ”|”\t”) (continue()); . (error (); continue());

  10. Resolving ambiguities • Rule: when a string can match multiple tokens, the longest matching token wins • if (IF); i f x > 0 • [a-z][a-z0-9]* (ID); ID (“ifx”) • We also need to specify priorities if we match several tokens of the same length. • Usual rule: earliest declaration wins i f ID (“if”) IF

  11. Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens “classical” approach – from RegEx to NFA to DFA

  12. Total NFA for ID,IF,NUM,REAL a-e,g-z,0-9 0-9,a-z ID error IF REAL 0-9,a-z f 0-9 0-9 . ID 4 2 3 5 6 a-h,j-z . i NUM REAL 0-9 0-9 1 7 8 blank etc. - 0-9 whitespace other blank - \n etc. 9 12 13 10 11 error error a-z

  13. ML-Lex • Lexer generator, “built-in” part of SML/NJ • Accepts lexical specification, produces a scanner • Example specification (* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue()); • => ( ErrorMsg.error yypos “Illegal character”; continue());

  14. Lexer states • Helpful when handling di ff erent “kinds” of tokens • For ex.: use state • INITIAL in general lexing (automatic) • STRING when scanning the contents of a string • COMMENT when scanning a comment • Point: keep di ff erent concerns apart – simpler! • Syntax: ... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...

  15. Lexical analysis Specification: Tokens as regular exps +longest-matching rule +priorities Formalism: NFA DFA Implementation: Simulate NFA Simulate DFA linear complexity Program that translates raw text Output: into stream of tokens alternative, purely algebraic approach – from RegEx to DFA using regexp derivatives

  16. More on SML [online demo]

  17. Warmup project

  18. Straight-line Programming Language • Toy programming language: no branching, no loops • Skip lexing and parsing issues • Focus on the “meaning” – interpretation • Syntax Stm → Stm; Stm (CompoundStm) ExpList → Exp , ExpList (PairExpList) Stm → id := Exp ExpList → Exp (AssignStm) (LastExpList) Stm → print ( ExpList ) Binop → + (PrintStm) (Plus) Exp → id Binop → – (IdExp) (Minus) Exp → num Binop → × (NumExp) (Times) Exp → Exp BinOp Exp Binop → / (OpExp) (Div) Exp → ( Stm , Exp ) (EseqExp)

  19. Straight-line program • Source: CompoundStm a : = 5 + 3 ; AssignStm CompoundStm b : = ( p r i n t ( a , a - 1),10 * a); OpExp AssignStm PrintStm p r i n t ( b ) a NumExp BinOp NumExp EseqExp LastExpList b PrintStm OpExp IdExp • Corresponding syntax tree: 5 Plus 3 PairExpList NumExp BinOp IdExp b IdExp LastExpList Times 10 a OpExp a IdExp BinOp NumExp Minus a 1

  20. SLP syntax representation datatype • SML declaration (CompoundStm) Stm → Stm; Stm type id = string (AssignStm) Stm → id := Exp datatype binop Stm → print ( ExpList ) (PrintStm) = Plus | Minus | Times | Div Exp → id (IdExp) datatype stm Exp → num (NumExp) = CompoundStm of stm * stm Exp → Exp BinOp Exp (OpExp) | AssignStm of id * exp Exp → ( Stm , Exp ) (EseqExp) | PrintStm of exp list (PairExpList) ExpList → Exp , ExpList and exp ExpList → Exp (LastExpList) = IdExp of id Binop → + (Plus) | NumExp of int Binop → – (Minus) | OpExp of exp * binop * exp Binop → × (Times) | EseqExp of stm * exp Binop → / (Div)

  21. SLP syntax representation • Source program CompoundStm a := 5 + 3; AssignStm CompoundStm b := (print (a, a - 1),10 * a); print (b) OpExp AssignStm PrintStm a NumExp BinOp NumExp EseqExp LastExpList b • SML value: PrintStm OpExp IdExp 5 Plus 3 val prog = CompoundStm ( PairExpList NumExp BinOp IdExp b AssignStm (“a", OpExp ( NumExp 5, IdExp LastExpList Times 10 a Plus, NumExp 3)), OpExp a CompoundStm ( AssignStm ("b", IdExp BinOp NumExp EseqExp ( PrintStm [IdExp "a", OpExp (…)], Minus a 1 OpExp (NumExp 10, …))), PrintStm [IdExp "b"]))

  22. Project assignment • Follow descriptions p10-12 in MCIML • “Modularity principles” p9-10: discussed on Friday, may be ignored at first • Clarification: • Let bindings are OK • References, arrays, and ref update (:=) are not OK

  23. Summary • Warm-up project: Program in SML! • Straight-line programming language, no lexing/parsing involved • Express programs: use abstract syntax tree datatype • Project specified on website, essentially as in the book • Lexical analysis • Avoid complexity in grammar. Use lexer • Based on regular expressions. • Implementation via RE → NFA → DFA (theory assumed known) • Alternatives: via RE derivatives → DFA • Tools: ML-Lex • Scanner generator, outputs SML code from spec • Note lexer states

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend