Introduction to Lexical Analysis Outline Informal sketch of - PowerPoint PPT Presentation

Introduction to Lexical Analysis

Outline • Informal sketch of lexical analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying lexers – Regular expressions – Examples of regular expressions Compiler Design 1 (2011) 2

Lexical Analysis • What do we want to do? Example: if (i == j) then Z = 0; else Z = 1; • The input is just a string of characters: \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings – Where the substrings are tokens Compiler Design 1 (2011) 3

What’s a Token? • A syntactic category – In English: noun, verb, adjective, … – In a programming language: Identifier, Integer, Keyword, Whitespace, … Compiler Design 1 (2011) 4

Tokens • Tokens correspond to sets of strings. • Identifier: strings of letters or digits, starting with a letter • Integer: a non-empty string of digits • Keyword: “else” or “if” or “begin” or … • Whitespace: a non-empty sequence of blanks, newlines, and tabs Compiler Design 1 (2011) 5

What are Tokens used for? • Classify program substrings according to role • Output of lexical analysis is a stream of tokens . . . • . . . which is input to the parser • Parser relies on token distinctions – An identifier is treated differently than a keyword Compiler Design 1 (2011) 6

Designing a Lexical Analyzer: Step 1 • Define a finite set of tokens – Tokens describe all items of interest – Choice of tokens depends on language, design of parser • Recall \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Useful tokens for this expression: Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ; Compiler Design 1 (2011) 7

Designing a Lexical Analyzer: Step 2 • Describe which strings belong to each token • Recall: – Identifier: strings of letters or digits, starting with a letter – Integer: a non-empty string of digits – Keyword: “else” or “if” or “begin” or … – Whitespace: a non-empty sequence of blanks, newlines, and tabs Compiler Design 1 (2011) 8

Lexical Analyzer: Implementation An implementation must do two things: 1. Recognize substrings corresponding to tokens 2. Return the value or lexeme of the token – The lexeme is the substring Compiler Design 1 (2011) 9

Example • Recall: \tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Token-lexeme groupings: – Identifier: i, j, z – Keyword: if, then, else – Relation: == – Integer: 0, 1 – (, ), =, ; single character of the same name Compiler Design 1 (2011) 10

Why do Lexical Analysis? • Dramatically simplify parsing – The lexer usually discards “uninteresting” tokens that don’t contribute to parsing • E.g. Whitespace, Comments – Converts data early • Separate out logic to read source files – Potentially an issue on multiple platforms – Can optimize reading code independently of parser Compiler Design 1 (2011) 11

True Crimes of Lexical Analysis • Is it as easy as it sounds? • Not quite! • Look at some programming language history . . . Compiler Design 1 (2011) 12

Lexical Analysis in FORTRAN • FORTRAN rule: Whitespace is insignificant • E.g., VAR1 is the same as VA R1 • Footnote: FORTRAN whitespace rule was motivated by inaccuracy of punch card operators Compiler Design 1 (2011) 13

A terrible design! Example • Consider – DO 5 I = 1,25 – DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • The second is DO 5I = 1.25 • Reading left-to-right, cannot tell if DO 5I is a variable or DO stmt. until after “,” is reached Compiler Design 1 (2011) 14

Lexical Analysis in FORTRAN. Lookahead. Two important points: 1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing one token at a time 2. “Lookahead” may be required to decide where one token ends and the next token begins – Even our simple example has lookahead issues i vs. if = vs. == Compiler Design 1 (2011) 15

Another Great Moment in Scanning • PL/1: Keywords can be used as identifiers: I F T HEN T HEN T HEN = EL SE; EL SE EL SE = I F can be difficult to determine how to label lexemes Compiler Design 1 (2011) 16

More Modern True Crimes in Scanning • Nested template declarations in C++ ve c to r<ve c to r<int>> myVe c to r ve c to r < ve c to r < int >> myVe c to r (ve c to r < (ve c to r < (int >> myVe c to r))) Compiler Design 1 (2011) 17

Review • The goal of lexical analysis is to – Partition the input string into lexemes (the smallest program units that are individually meaningful) – Identify the token of each lexeme • Left-to-right scan ⇒ lookahead sometimes required Compiler Design 1 (2011) 18

Next • We still need – A way to describe the lexemes of each token – A way to resolve ambiguities • Is if two variables i and f ? • Is == two equal signs = = ? Compiler Design 1 (2011) 19

Regular Languages • There are several formalisms for specifying tokens • Regular languages are the most popular – Simple and useful theory – Easy to understand – Efficient implementations Compiler Design 1 (2011) 20

Languages Def. Let Σ be a set of characters. A language Λ over Σ is a set of strings of characters drawn from Σ ( Σ is called the alphabet of Λ ) Compiler Design 1 (2011) 21

Examples of Languages • Alphabet = English • Alphabet = ASCII characters • Language = English • Language = C programs sentences • Not every string on • Note: ASCII character English characters is an set is different from English sentence English character set Compiler Design 1 (2011) 22

Notation • Languages are sets of strings • Need some notation for specifying which sets of strings we want our language to contain • The standard notation for regular languages is regular expressions Compiler Design 1 (2011) 23

Atomic Regular Expressions • Single character { } = ' ' " " c c • Epsilon { } ε = "" Compiler Design 1 (2011) 24

Compound Regular Expressions • Union { } + = ∈ ∈ | or A B s s A s B • Concatenation { } = ∈ ∈ | and AB ab a A b B • Iteration = = U * i i where ... times ... A A A A i A ≥ 0 i Compiler Design 1 (2011) 25

Regular Expressions • Def. The regular expressions over Σ are the smallest set of expressions including ε ∈∑ ' ' where c c + ∑ where , are rexp over A B A B " " " AB ∑ * where is a rexp over A A Compiler Design 1 (2011) 26

Syntax vs. Semantics • To be careful, we should distinguish syntax and semantics (meaning) of regular expressions { } ε = ( ) "" L = (' ') {" "} L c c + = ∪ ( ) ( ) ( ) L A B L A L B = ∈ ∈ ( ) { | ( ) and ( )} L AB ab a L A b L B = U * i ( ) ( ) L A L A ≥ 0 i Compiler Design 1 (2011) 27

Example: Keyword Keyword: “else” or “if” or “begin” or … n' + L ' else' + 'if' + 'begi Note: 'else' abbrev iates 'e''l''s ''e' Compiler Design 1 (2011) 28

Example: Integers Integer: a non-empty string of digits = + + + + + + + + + digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' * integer = digit digit + = * Abbreviation: A AA Compiler Design 1 (2011) 29

Example: Identifier Identifier: strings of letters or digits, starting with a letter + + + + + K K letter = 'A' 'Z' 'a' 'z' + * identifier = letter (letter digit) * * Is (letter + di git ) the s ame? Compiler Design 1 (2011) 30

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs ( ) + ' ' + '\n' + '\t' Compiler Design 1 (2011) 31

Example 1: Phone Numbers • Regular expressions are all around you! • Consider +46(0)18-471-1056 Σ = digits ∪ { + , − , ( , ) } country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = ‘ + ’country’ ( ’0‘ ) ’city’ − ’univ’ − ’extension Compiler Design 1 (2011) 32

Example 2: Email Addresses • Consider kostis@it.uu.se { } ∑ = ∪ letters .,@ + name = letter address = name '@' name '.' name '. ' name Compiler Design 1 (2011) 33

Summary • Regular expressions describe many useful languages • Regular languages are a language specification – We still need an implementation • Next time: Given a string s and a regular expression R , is ∈ ( )? s L R Compiler Design 1 (2011) 34

Introduction to Lexical Analysis Outline Informal sketch of - PowerPoint PPT Presentation

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Compiling Techniques Lecture 3: Introduction to Lexical Analysis Christophe Dubach 22 September

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Lexical Analysis / Scanning Why separate lexical from syntactic analysis? Purpose: turn character

Lexical Analysis Scanners, Regular expressions, and Automata cs4713 1 Phases of compilation

Understanding Joe Landry, Transnational PhD October 16, 2019 Terrorism 1 After peaking in

Communication in Science / 2 April 2015 Samo Stani University of Nova Gorica Samo Stani,

Lexical and Syntax Analysis Part I 1 Introduction Every implementation of Programming

Outline Informal sketch of lexical

Compiler Design and Construction Semantic Analysis: Type Checking Slides modified from Louden

Compiler Development (CMPSC 401) Lexical Analysis Janyl Jumadinova January 24, 2019 Janyl

Sambuz

Useful Links

Newsletter

Mail Us