Compiling Techniques Lecture 3: Introduction to Lexical Analysis - PowerPoint PPT Presentation

Languages and Syntax Lexical Analysis Compiling Techniques Lecture 3: Introduction to Lexical Analysis Christophe Dubach 22 September 2017 Christophe Dubach Compiling Techniques

Languages and Syntax Lexical Analysis Reminder Action Create an account and subscribe to the course on piazza. Christophe Dubach Compiling Techniques

Languages and Syntax Lexical Analysis Coursework Starts this afternoon (14.10 - 16.00) Coursework description is updated regularly; check frequently or “watch” http://bitbucket.org/cdubach/ct-17-18/ Register for a bitbucket account and fill in the Google form (instructions online) ( https://docs.google.com/forms/d/ 1z2EthflazoU2bvfnJlrCWB_-AqB4ZxIgsJW-8SWiXyM ) Christophe Dubach Compiling Techniques

Languages and Syntax Lexical Analysis The Lexer Lexer AST AST IR Source char token Semantic IR Scanner Tokeniser Parser code Analyser Generator Errors Maps character stream into words — the basic unit of syntax Assign a syntactic category to each work (part of speech) x = x + y ; becomes ID(x) EQ ID(x) PLUS ID(y) SC word ∼ = lexeme syntactic category ∼ = part of speech In casual speech, we call the pair a token Typical tokens: number, identifier, +, − , new, while, if, . . . Scanner eliminates white space (including comments) Christophe Dubach Compiling Techniques

Languages and Syntax Lexical Analysis Table of contents 1 Languages and Syntax Context-free Language Regular Expression Regular Languages 2 Lexical Analysis Building a Lexer Ambiguous Grammar Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Context-free Language Context-free syntax is specified with a grammar SheepNoise → SheepNoise baa | baa This grammar defines the set of noises that a sheep makes under normal circumstances It is written in a variant of BackusNaur Form (BNF) Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite rules (P:N → N ∪ T) Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Example 1 goal → expr 2 expr → expr op term S = goal 3 | term T = { number , id ,+, −} 4 term → number N = { goal , expr , term , op } 5 | i d P = { 1 ,2 ,3 ,4 ,5 ,6 ,7 } 6 op → + 7 | − This grammar defines simple expressions with addition & subtraction over “number” and “id” This grammar, like many, falls in a class called “context-free grammars”, abbreviated CFG Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Regular Expression Grammars can often be simplified and shortened using an augmented BNF notation where: x ∗ is the Kleene closure : zero or more occurrences of x x + is the positive closure : one or more occurrences of x [ x ] is an option: zero or one occurrence of x Example: identifier syntax i d e n t i f i e r ::= l e t t e r ( l e t t e r | d i g i t ) ∗ d i g i t ::= ”0” | . . . | ”9” l e t t e r ::= ”a” | . . . | ”z” | ”A” | . . . | ”Z” Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Exercise: write the grammar of signed natural number Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Regular Language Definition A language is regular if it can be expressed with a single regular expression or with multiple non-recursive regular expressions. Regular languages can used to specify the words to be translated to tokens by the lexer. Regular languages can be recognised with finite state machine. Using results from automata theory and theory of algorithms, we can automatically build recognisers from regular expressions. Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Regular language to program Given the following: c is a lookahead character; next() consumes the next character; error () quits with an error message; and first (exp) is the set of initial characters of exp. Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Regular language to program Then we can build a program to recognise a regular language if the grammar is left-parsable. RE pr(RE) “ x ′′ if (c == ’x’) next() else error (); ( exp ) pr(exp); [ exp ] if (c in first (exp)) pr(exp); exp ∗ while (c in first (exp)) pr(exp); exp + pr(exp); while (c in first (exp)) pr(exp); fact 1 . . . fact n pr(fact1 ); ... ; pr(factn ); switch ( c ) { case c i n f i r s t ( term1 ) : pr ( term1 ) ; case . . . : . . . ; term 1 | . . . | term n case c i n f i r s t ( termn ) : pr ( termn ) ; d e f a u l t : e r r o r ( ) ; } Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Definition: left-parsable A grammar is left-parsable if: term 1 | . . . | term n The terms do not share any initial symbols. fact 1 . . . fact n If fact i contains the empty symbol then fact i and fact i +1 do not share any common initial symbols. [ exp ] , exp ∗ The initial symbols of exp cannot contain a symbol which belong to the first set of an expression following exp . Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Example: Recognising identifiers void i d e n t () { i f ( c i s i n [ a − zA − Z ] ) l e t t e r ( ) ; e l s e e r r o r ( ) ; while ( c i s i n [ a − zA − Z0 − 9]) { switch ( c ) { case c i s i n [ a − zA − Z ] : l e t t e r ( ) ; case c i s i n [0 − 9] : d i g i t ( ) ; default : e r r o r ( ) ; } } } void l e t t e r () { . . . } void d i g i t () { . . . } Christophe Dubach Compiling Techniques

Context-free Language Languages and Syntax Regular Expression Lexical Analysis Regular Languages Example: Simplified Java version void i d e n t () { i f ( Character . i s L e t t e r ( c )) next ( ) ; e l s e e r r o r ( ) ; while ( Character . i s L e t t e r O r D i g i t ( c )) next ( ) ; } Christophe Dubach Compiling Techniques

Languages and Syntax Building a Lexer Lexical Analysis Ambiguous Grammar Role of lexical analysiser The main role of the lexical analyser (or lexer) is to read a bit of the input and return a lexeme (or token). c l a s s Lexer { public Token nextToken () { // r e t u r n the next token , i g n o r i n g white spaces } . . . } White spaces are usually ignored by the lexer. White spaces are: white characters (tabulation, newline, . . . ) comments (any character following “//” or enclosed between “/*” and “*/” Christophe Dubach Compiling Techniques

Languages and Syntax Building a Lexer Lexical Analysis Ambiguous Grammar What is a token? A token consists of a token class and other additional information. Example: some token classes IDENTIFIER → foo , main , cnt , . . . NUMBER → 0 , − 12, 1000 , . . . STRING LITERAL → ” Hello world !” , ”a ” , . . . EQ → == ASSIGN → = PLUS → + LPAR → ( . . . → . . . c l a s s Token { TokenClass tokenClass ; // Java enumeration S t r i n g data ; // s t o r e s number or s t r i n g P o s i t i o n pos ; // l i n e /column number i n source } Christophe Dubach Compiling Techniques

Languages and Syntax Building a Lexer Lexical Analysis Ambiguous Grammar Example Given the following C program: i n t foo ( i n t i ) { return i +2; } the lexer will return: INT IDENTIFIER (” foo ”) LPAR INT IDENTIFIER (” i ”) RPAR LBRA RETURN IDENTIFIER (” i ”) PLUS NUMBER(”2”) SEMICOLON RBRA Christophe Dubach Compiling Techniques

Languages and Syntax Building a Lexer Lexical Analysis Ambiguous Grammar A Lexer for Simple Arithmetic Expressions Example: BNF syntax i d e n t i f i e r ::= l e t t e r ( l e t t e r | d i g i t ) ∗ d i g i t ::= ”0” | . . . | ”9” l e t t e r ::= ”a” | . . . | ”z” | ”A” | . . . | ”Z” number ::= d i g i t+ p l u s : : = ”+” minus : : = ” − ” Christophe Dubach Compiling Techniques

Languages and Syntax Building a Lexer Lexical Analysis Ambiguous Grammar Example: token definition c l a s s Token { enum TokenClass { IDENTIFIER NUMBER, PLUS , MINUS, } // f i e l d s f i n a l TokenClass t o k e n C l a s s ; f i n a l S t r i n g data ; f i n a l P o s i t i o n p o s i t i o n ; // c o n s t r u c t o r s Token ( TokenClass tc ) { . . . } Token ( TokenClass tc , S t r i n g data ) { . . . } . . . } Christophe Dubach Compiling Techniques

Compiling Techniques Lecture 3: Introduction to Lexical Analysis - PowerPoint PPT Presentation

Languages and Syntax Lexical Analysis Compiling Techniques Lecture 3: Introduction to Lexical Analysis Christophe Dubach 22 September 2017 Christophe Dubach Compiling Techniques Languages and Syntax Lexical Analysis Reminder Action

Introduction to Compiling Chapter 1 1 Compiler Construction Introduction to Compiling To Do

Compiling Techniques Lecture 7: Abstract Syntax Christophe Dubach 3 October 2017 Christophe

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 19 September 2017

Compiling Techniques Lecture 1: Introduction Christophe Dubach 17 September 2019 Christophe

Compiling Techniques Lecture 10: Introduction to Java ByteCode Christophe Dubach 10 November

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 18 September 2018

Compiling Techniques Lecture 6: Ambiguous Grammars and Bottom-Up Parsing Christophe Dubach 30

Compiling Techniques Lecture 10: An Introduction to MIPS assembly Hugh Leather 15 October 2019

Compiling Techniques Lecture 5: Top-Down Parsing Christophe Dubach 26 September 2017 Christophe

Compiling Techniques Lecture 7: Abstract Syntax Christophe Dubach 2 October 2018 Christophe

Compiling Techniques Lecture 8: Semantic Analysis Christophe Dubach 5 October 2018 Christophe

Compiling Techniques Lecture 5: Top-Down Parsing Christophe Dubach 24 September 2019 Christophe

Compiling Techniques Lecture 1: Introduction Christophe Dubach 19 September 2017 Christophe

Compiling Techniques Lecture 9: Semantic Analysis: Types Christophe Dubach 10 October 2017

Compiling Techniques Lecture 8: Semantic Analysis Christophe Dubach 3 October 2019 Christophe

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

OCL parsing / type checking in the context of GF and KeY Kristofer Johannisson 1 I.

Chapter 2: Grammars Aarne Ranta Slides for the book Implementing Programming Languages. An

Chapter 6: Syntax Syntax Syntax is the structure of a language. Earlier, both syntax and

Chapter Twelve: Context-Free Languages Formal Language, chapter 12, slide 1 1 We defined the

CS 4400 / 5400 Programming Languages [03: Names, Scope / Environments] Ferdinand Vesely

What Does this Notation Mean Anyway? BNF-Style Notation as it is Actually Used D. A. Feller J.

The Reform of Time, Space & Custom in the French Revolution 21H.141 Spring 2015 1 THE

Executable Component-Based Semantics L. Thomas van Binsbergen 1 , Neil Sculthorpe 2 , Peter D.

Compiling Techniques Lecture 3: Introduction to Lexical Analysis - PowerPoint PPT Presentation

Languages and Syntax Lexical Analysis Compiling Techniques Lecture 3: Introduction to Lexical Analysis Christophe Dubach 22 September 2017 Christophe Dubach Compiling Techniques Languages and Syntax Lexical Analysis Reminder Action

Introduction to Compiling Chapter 1 1 Compiler Construction Introduction to Compiling To Do

Compiling Techniques Lecture 7: Abstract Syntax Christophe Dubach 3 October 2017 Christophe

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 19 September 2017

Compiling Techniques Lecture 1: Introduction Christophe Dubach 17 September 2019 Christophe

Compiling Techniques Lecture 10: Introduction to Java ByteCode Christophe Dubach 10 November

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 18 September 2018

Compiling Techniques Lecture 6: Ambiguous Grammars and Bottom-Up Parsing Christophe Dubach 30

Compiling Techniques Lecture 10: An Introduction to MIPS assembly Hugh Leather 15 October 2019

Compiling Techniques Lecture 5: Top-Down Parsing Christophe Dubach 26 September 2017 Christophe

Compiling Techniques Lecture 7: Abstract Syntax Christophe Dubach 2 October 2018 Christophe

Compiling Techniques Lecture 8: Semantic Analysis Christophe Dubach 5 October 2018 Christophe

Compiling Techniques Lecture 5: Top-Down Parsing Christophe Dubach 24 September 2019 Christophe

Compiling Techniques Lecture 1: Introduction Christophe Dubach 19 September 2017 Christophe

Compiling Techniques Lecture 9: Semantic Analysis: Types Christophe Dubach 10 October 2017

Compiling Techniques Lecture 8: Semantic Analysis Christophe Dubach 3 October 2019 Christophe

Compiling Techniques Lecture 2: The view from 35000 feet Christophe Dubach 17 September 2019

OCL parsing / type checking in the context of GF and KeY Kristofer Johannisson 1 I.

Chapter 2: Grammars Aarne Ranta Slides for the book Implementing Programming Languages. An

Chapter 6: Syntax Syntax Syntax is the structure of a language. Earlier, both syntax and

Chapter Twelve: Context-Free Languages Formal Language, chapter 12, slide 1 1 We defined the

CS 4400 / 5400 Programming Languages [03: Names, Scope / Environments] Ferdinand Vesely

What Does this Notation Mean Anyway? BNF-Style Notation as it is Actually Used D. A. Feller J.

The Reform of Time, Space &amp; Custom in the French Revolution 21H.141 Spring 2015 1 THE

Executable Component-Based Semantics L. Thomas van Binsbergen 1 , Neil Sculthorpe 2 , Peter D.

The Reform of Time, Space & Custom in the French Revolution 21H.141 Spring 2015 1 THE