Introduction to Lexical Analysis Identifies tokens in input string - PowerPoint PPT Presentation

Outline • Informal sketch of lexical analysis Introduction to Lexical Analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying lexers – Regular expressions – Examples of regular expressions 2 Lexical Analysis What’s a Token? • What do we want to do? Example: • A syntactic category if (i == j) – In English: then noun, verb, adjective, … z = 0; else – In a programming language: z = 1; Identifier, Integer, Keyword, Whitespace, … • The input is just a string of characters: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings – Where the substrings are tokens 3 4

Tokens What are Tokens used for? • Tokens correspond to sets of strings • Classify program substrings according to role – these sets depend on the programming language • Output of lexical analysis is a stream of tokens . . . • Identifier: strings of letters or digits, starting with a letter • Integer: a non-empty string of digits • . . . which is input to the parser • Keyword: “else” or “if” or “begin” or … • Whitespace: a non-empty sequence of blanks, • Parser relies on token distinctions newlines, and tabs – An identifier is treated differently than a keyword 5 6 Designing a Lexical Analyzer: Step 1 Designing a Lexical Analyzer: Step 2 • Define a finite set of tokens • Describe which strings belong to each token – Tokens describe all items of interest – Choice of tokens depends on language, design of • Recall: parser – Identifier: strings of letters or digits, starting • Recall with a letter if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; – Integer: a non-empty string of digits • Useful tokens for this expression: – Keyword: “else” or “if” or “begin” or … Integer, Keyword, Relation, Identifier, Whitespace, – Whitespace: a non-empty sequence of blanks, (, ), =, ; newlines, and tabs 7 8

Lexical Analyzer: Implementation Example An implementation must do two things: • Recall: if (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1; 1. Recognize substrings corresponding to tokens • Token-lexeme groupings: 2. Return the value or lexeme of the token – Identifier: i, j, z – The lexeme is the substring – Keyword: if, then, else – Relation: == – Integer: 0, 1 – (, ), =, ; single character of the same name 9 10 Why do Lexical Analysis? True Crimes of Lexical Analysis • Dramatically simplify parsing • Is it as easy as it sounds? – The lexer usually discards “uninteresting” tokens that don’t contribute to parsing • Not quite! • E.g. Whitespace, Comments – Converts data early • Look at some programming language history . . . • Separate out logic to read source files – Potentially an issue on multiple platforms – Can optimize reading code independently of parser 11 12

Lexical Analysis in FORTRAN A terrible design! Example • FORTRAN rule: Whitespace is insignificant • Consider – DO 5 I = 1,25 • E.g., VAR1 is the same as VA R1 – DO 5 I = 1.25 • The first is DO 5 I = 1 , 25 • Footnote: FORTRAN whitespace rule was motivated • The second is DO 5I by inaccuracy of punch card operators = 1.25 • Reading left-to-right, cannot tell if DO 5I is a variable or DO stmt. until after “,” is reached 13 14 Lexical Analysis in FORTRAN. Lookahead. Another Great Moment in Scanning Two important points: • PL/1: Keywords can be used as identifiers: 1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing I F T HEN T HEN T HEN = EL SE; EL SE EL SE = I F one token at a time can be difficult to determine how to label lexemes 2. “Lookahead” may be required to decide where one token ends and the next token begins – Even our simple example has lookahead issues i vs. if = vs. == 15 16

More Modern True Crimes in Scanning Review • Nested template declarations in C++ • The goal of lexical analysis is to – Partition the input string into lexemes (the smallest program units that are individually meaningful) ve c to r<ve c to r<int>> myVe c to r – Identify the token of each lexeme ve c to r < ve c to r < int >> myVe c to r • Left-to-right scan ⇒ lookahead sometimes required (ve c to r < (ve c to r < (int >> myVe c to r))) 17 18 Next Regular Languages • We still need • There are several formalisms for specifying tokens – A way to describe the lexemes of each token – A way to resolve ambiguities • Regular languages are the most popular • Is if two variables i and f ? – Simple and useful theory • Is == two equal signs = = ? – Easy to understand – Efficient implementations 19 20

Languages Examples of Languages • Alphabet = English • Alphabet = ASCII characters • Language = English • Language = C programs sentences Def. Let Σ be a set of characters. A language Λ over Σ is a set of strings of characters drawn • Not every string on • Note: ASCII character from Σ English characters is an set is different from English sentence ( Σ is called the alphabet of Λ ) English character set 21 22 Notation Atomic Regular Expressions • Languages are sets of strings • Single character { } = ' ' " " c c • Need some notation for specifying which sets of strings we want our language to contain • Epsilon { } ε = "" • The standard notation for regular languages is regular expressions 23 24

Compound Regular Expressions Regular Expressions • Union • Def. The regular expressions over Σ are the smallest set of expressions including { } + = ∈ ∈ | or A B s s A s B ε • Concatenation ∈∑ ' ' where c c { } = ∈ ∈ | and AB ab a A b B + ∑ where , are rexp over A B A B • Iteration " " " AB ∑ = = * U * i i where is a rexp over A A where ... times ... A A A A i A ≥ i 0 25 26 Syntax vs. Semantics Example: Keyword • To be careful, we should distinguish syntax Keyword: “else” or “if” or “begin” or … and semantics (meaning) of regular expressions { } ε = n' + L ' else' + 'if' + 'begi ( ) "" L = (' ') {" "} L c c + = ∪ ( ) ( ) ( ) L A B L A L B = ∈ ∈ ( ) { | ( ) and ( )} L AB ab a L A b L B = U * i ( ) ( ) L A L A Note: 'else' abbrev iates 'e''l''s ''e' ≥ 0 i 27 28

Example: Integers Example: Identifier Integer: a non-empty string of digits Identifier: strings of letters or digits, starting with a letter = + + + + + + + + + digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' + + + + + * K K letter = 'A' 'Z' 'a' 'z' integer = digit digit + * identifier = letter (letter digit) + = * Abbreviation: A AA * * Is (letter + di git ) the s ame? 29 30 Example: Whitespace Example 1: Phone Numbers Whitespace: a non-empty sequence of blanks, • Regular expressions are all around you! newlines, and tabs • Consider +46(0)18-471-1056 ( ) + Σ = digits ∪ { + , − , ( , ) } ' ' + '\n' + '\t' country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = ‘ + ’country’ ( ’0‘ ) ’city’ − ’univ’ − ’extension 31 32

Example 2: Email Addresses Summary • Consider kostis@it.uu.se • Regular expressions describe many useful languages { } • Regular languages are a language specification ∑ = ∪ letters .,@ – We still need an implementation + name = letter address = name '@' name '.' name '. ' name • Next time: Given a string s and a regular expression R , is ∈ ( )? s L R 33 34

Introduction to Lexical Analysis Identifies tokens in input string - PowerPoint PPT Presentation

Outline Informal sketch of lexical analysis Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Compiling Techniques Lecture 3: Introduction to Lexical Analysis Christophe Dubach 22 September

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

CS 188: Artificial Intelligence Markov Decision Processes II Instructor: Anca Dragan University

Look-Ahead with Mini-Bucket Heuristics for MPE Rina Dechter 1 , Kalev Kask 1 , William Lam 1 ,

Practical SAT Solving: Look-ahead Techniques Marijn J.H. Heule Warren A. Hunt Jr. The University

Lookahead Techniques Marijn J.H. Heule http://www.cs.cmu.edu/~mheule/15816-f19/ Automated

Grouping and capturing REGULAR EX P RES S ION S IN P YTH ON Maria Eugenia Inzaugarat Data

Compiling Techniques Lecture 5: Top-Down Parsing Christophe Dubach 24 September 2019 Christophe

Planning and Optimization F6. Determinization-based Algorithms Gabriele R oger and Thomas

How Much Lookahead is Needed to Win Infinite Games? Joint work with Felix Klein (Saarland