Warm-up project Aslan Askarov aslan@cs.au.dk Revised from slides by - - PowerPoint PPT Presentation

warm up project
SMART_READER_LITE
LIVE PREVIEW

Warm-up project Aslan Askarov aslan@cs.au.dk Revised from slides by - - PowerPoint PPT Presentation

Compilation 2014 Warm-up project Aslan Askarov aslan@cs.au.dk Revised from slides by E. Ernst Straight-line Programming Language Toy programming language: no branching, no loops Skip lexing and parsing issues Focus on the


slide-1
SLIDE 1

Compilation 2014

Warm-up project

Aslan Askarov aslan@cs.au.dk

Revised from slides by E. Ernst

slide-2
SLIDE 2

Straight-line Programming Language

  • Toy programming language: no branching, no loops
  • Skip lexing and parsing issues
  • Focus on the “meaning” – interpretation
  • Syntax

Stm → Stm; Stm Stm → id := Exp Stm → print ( ExpList ) Exp → id Exp → num Exp → Exp BinOp Exp Exp → ( Stm , Exp )

(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)

ExpList → Exp , ExpList ExpList → Exp Binop → + Binop → – Binop → × Binop → /

(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)

slide-3
SLIDE 3

Straight-line program

  • Source:
  • Corresponding syntax tree:

CompoundStm AssignStm a OpExp NumExp 5 BinOp Plus NumExp 3 CompoundStm AssignStm b EseqExp PrintStm PairExpList IdExp a LastExpList OpExp IdExp a BinOp Minus NumExp 1 OpExp NumExp 10 BinOp Times IdExp a PrintStm LastExpList IdExp b

a := 5 + 3; b := (print (a, a - 1),10 * a); print (b)

slide-4
SLIDE 4

SLP syntax representation datatype

  • SML declaration

Stm → Stm; Stm Stm → id := Exp Stm → print ( ExpList ) Exp → id Exp → num Exp → Exp BinOp Exp Exp → ( Stm , Exp )

(CompoundStm) (AssignStm) (PrintStm) (IdExp) (NumExp) (OpExp) (EseqExp)

ExpList → Exp , ExpList ExpList → Exp Binop → + Binop → – Binop → × Binop → /

(PairExpList) (LastExpList) (Plus) (Minus) (Times) (Div)

type id = string datatype binop = Plus | Minus | Times | Div datatype stm = CompoundStm of stm * stm | AssignStm of id * exp | PrintStm of exp list and exp = IdExp of id | NumExp of int | OpExp of exp * binop * exp | EseqExp of stm * exp

slide-5
SLIDE 5

SLP syntax representation

  • Source program
  • SML value:

val prog = CompoundStm ( AssignStm (“a", OpExp ( NumExp 5, Plus, NumExp 3)), CompoundStm ( AssignStm ("b", EseqExp ( PrintStm [IdExp "a", OpExp (…)], OpExp (NumExp 10, …))), PrintStm [IdExp "b"])) a := 5 + 3; b := (print (a, a - 1),10 * a); print (b)

CompoundStm AssignStm a OpExp NumExp 5 BinOp Plus NumExp 3 CompoundStm AssignStm b EseqExp PrintStm PairExpList IdExp a LastExpList OpExp IdExp a BinOp Minus NumExp 1 OpExp NumExp 10 BinOp Times IdExp a PrintStm LastExpList IdExp b

slide-6
SLIDE 6

Project assignment

  • Follow descriptions p10-12 in MCIML
  • “Modularity principles” p9-10: discussed on Friday,

may be ignored at first

slide-7
SLIDE 7

Lexical analysis

slide-8
SLIDE 8

Lexical analysis

High-level source code Lexing Elaboration Parsing … Low-level target code

slide-9
SLIDE 9

Lexical analysis

i f x > 0 t h \n e n 1 \t e l s e ( ) \n \t

IF LPAREN ID (“x”) GE INT (0) THEN INT (1) ELSE INT (0) RPAREN Input: stream of characters Output: stream of tokens in our language Discards comments, whitespace, newline, tab characters, preprocessor directives First phase in the compilation

slide-10
SLIDE 10

Tokens

Type Examples ID foo n14 a’ my-fun INT 73 0 070 REAL 0.0 .5 10. IF if COMMA , LPAREN ( ASGMT :=

slide-11
SLIDE 11

Non-tokens

Type Examples comments /* dead code */ // comment (* nest (*ed*) *) preprocessor directives #define N 10 #include <stdio.h> whitespace

slide-12
SLIDE 12

Token data structure

  • Many tokens need no associated data, e.g.:


IF , COMMA, LPAREN, RPAREN, ASGMT

  • Some tokens carry an associated string:


ID (“my-fun”)

  • Some tokens carry associated data of other types:


INT (73), INT (1), FLOAT (IEEE754, 1001111100…)

  • Tokens may include useful additional information: 


start/end pos in input file (line number + column, or charpos)

slide-13
SLIDE 13

Q/A

  • Consider source program



 var δ := 0.0

  • Language: case sensitive, ASCII
  • How to report error of using δ?

FileName:Line.Col: Illegal character δ

slide-14
SLIDE 14

Regular expressions

  • We can use regular expressions to specify

programming language tokens

  • Regular expressions:
  • Expected to be well-known
  • Syntax:
  • symbol a
  • choice x | y
  • concat x y
  • empty ε
  • repeat x*
slide-15
SLIDE 15

Regular expressions used for scanning

  • Examples
  • if (IF);
  • [a-z][a-z0-9]* (ID);
  • [0-9]* (NUM);
  • ([0-9]+”.”[0-9]*) | ([0-9]* ”.” [0-9]+) (REAL);
  • (”--” [a-z]*”\n”) | (” ”|”\t”) (continue());
  • . (error (); continue());
slide-16
SLIDE 16

Resolving ambiguities

  • Rule: when a string can match multiple tokens, the

longest matching token wins

  • We also need to specify priorities if we match

several tokens of the same length.

  • Usual rule: earliest declaration wins

i f x > 0

ID (“ifx”)

i f

IF

  • if (IF);
  • [a-z][a-z0-9]* (ID);

ID (“if”)

slide-17
SLIDE 17

Lexical analysis

Specification: Tokens as regular exps Formalism: NFA DFA Implementation: Output: Simulate NFA Simulate DFA Program that translates raw text into stream of tokens

+longest-matching rule +priorities linear complexity

slide-18
SLIDE 18

Total NFA for ID,IF,NUM,REAL

1 2 3 5

12 13

9 6 8 7

11 10

4

ID ID IF error REAL NUM REAL error error

i f 0-9,a-z

  • ther

blank etc. blank etc.

  • 0-9

0-9 0-9 a-z

  • \n

whitespace

.

a-h,j-z

.

0-9,a-z a-e,g-z,0-9 0-9 0-9

slide-19
SLIDE 19

ML-Lex

  • Lexer generator, “built-in” part of SML/NJ
  • Accepts lexical specification, produces a scanner
  • Example specification

(* SML declarations *) type lexresult = Tokens.token fun eof() = Tokens.EOF(0,0) %% (* Lex definitions *) digits=[0-9]+ %% (* Regular Expressions and Actions *) if => (Tokens.IF(yypos,yypos+2)); [a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); {digits} => (Tokens.NUM( Int.fromString yytext, yypos, yypos + size yytext); ({digits}”.”[0-9]*)|([0-9]*”.”{digits}) => (Tokens.REAL( Real.fromString yytext, yypos, yypos + size yytext)); (“--”[a-z]*”\n”)|(“ “|”\n”|”\t”)+ => (continue());

  • => ( ErrorMsg.error yypos “Illegal character”; continue());
slide-20
SLIDE 20

Lexer states

  • Helpful when handling different “kinds” of tokens
  • For ex.: use state
  • INITIAL in general lexing (automatic)
  • STRING when scanning the contents of a string
  • COMMENT when scanning a comment
  • Point: keep different concerns apart – simpler!
  • Syntax:

... (* Regular Expressions and Actions *) <INITIAL>if => (Tokens.IF(yypos,yypos+2)); <INITIAL>[a-z][a-z0-9]* => (Tokens.ID(yytext,yypos,yypos + size yytext)); ... <INITIAL>”\”” => (YYBEGIN STRING; continue()); ... <STRING>. => (continue()); ...

slide-21
SLIDE 21

Summary

  • Warm-up project: Program in SML!
  • Straight-line programming language, no lexing/parsing

involved

  • Express programs: use abstract syntax tree datatype
  • Project specified on website, essentially as in the book
  • Lexical analysis
  • Avoid complexity in grammar. Use lexer
  • Based on regular expressions. Implementation via NFA/DFA
  • Theory assumed known
  • Tools: ML-Lex
  • Scanner generator, outputs SML code from spec
  • Note lexer states