CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation
CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation
CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 Lets get started Character stream Machine-Independent Machine-Independent Lexical Analyzer Lexical Analyzer Code Optimizer Code Optimizer B a c k e n d Intermediate
Manas Thakur CS502: Compiler Design 2
Let’s get started
Lexical Analyzer Lexical Analyzer Syntax Analyzer Syntax Analyzer Semantic Analyzer Semantic Analyzer Intermediate Code Generator Intermediate Code Generator Character stream Token stream Syntax tree Syntax tree Intermediate representation Machine-Independent Code Optimizer Machine-Independent Code Optimizer Code Generator Code Generator Target machine code Intermediate representation Machine-Dependent Code Optimizer Machine-Dependent Code Optimizer Target machine code Symbol Table
F r o n t e n d B a c k e n d
Manas Thakur CS502: Compiler Design 3
Lexical Analysis
- Also called scanning
- Corresponding component called lexical analyzer or scanner
- Roles:
– Read input characters – Group into tokens (also called lexemes) – Return stream of tokens
- To whom?
– Usually the parser
– Sometimes
- Remove whitespace
- Remove comments
- Record information (such as line number) into symbol table
- Report errors
Manas Thakur CS502: Compiler Design 4
Characters to tokens
- Input program:
if (a>b)
x = 0;
else
x = 1;
– Basically a sequence of characters
- Actual input:
– \tif (a>b)\n\t\tx = 0;\n\telse\n\t\tx = 1;
- Goal of lexical analyzer:
– Partition input stream into substrings (tokens) and classify
them according to their roles (types).
Manas Thakur CS502: Compiler Design 5
Identifying and classifying tokens: Example
- Input:
– \tif (a>b)\n\t\tx = 0;\n\telse\n\t\tx = 1;
- Say we have the following token types:
– keywords, operators, identifjers, literals (constants), special
symbols, white space
- How many tokens are there in this string?
- Example output (excluding white spaces):
– <keyword, ‘if’> – <special_symbol, ‘(‘> – <identifjer, ‘a’> – ...
Manas Thakur CS502: Compiler Design 6
Patterns for lexical analysis
- Keywords can be represented directly
– ‘break’, ‘int’, ‘while’
- And similarly punctuation symbols
- What about the ones that are too many?
– Numbers – Identifiers
- Specified (or modelled) using
– Regular expressions – The set of strings represented by a regular expression r forms a
regular language L(r).
Manas Thakur CS502: Compiler Design 7
Regex Primer
- Alphabet Σ consists of the symbols
– Our fjrst names are strings over the alphabet Σ = [(a-z)*]
- * denotes zero or more occurences
- ε denotes an empty string
- + denotes one or more occurences
- ? denotes zero or one occurence
- | (or sometimes +) used to denote choice
– a*b | a*c
- Many ways to express the same language:
– a*b | a*c can also be written as: a*(b+c)
Manas Thakur CS502: Compiler Design 8
Classwork
- Write a regex that represents strings over alphabet {a, b} that
start and end with a.
– (a(a+b)*a) + a
- Strings with third last letter as a.
– (a*+b*)*a(a+b)(a+b)
- Strings with exactly three bs.
– a*ba*ba*ba*
- Strings over Σ = {0,1} with odd number of 1s:
– HW
Manas Thakur CS502: Compiler Design 9
More Regex
- Identifiers that begin only with a letter and may have numbers or
letters afterwards:
– letter: (a|b|c| ... |z|A|B|C| ...|Z) – number: (0|1|2| ... |9) – identifier: letter(letter|digit)*
- HWOT: Write a regular expression for representing valid
email ids. (You are free to choose your alphabet.)
Manas Thakur CS502: Compiler Design 10
Some considerations
- How to distinguish between patterns with common prefixes:
– <, <=, << – Need to “look ahead” before taking a decision
- Clashes between token types (e.g., then versus thenVar)
– Assign priorities while checking (e.g., keywords before identifiers) – Start with an identifier and if the value matches a reserved word,
then change its type
- Detecting and recovering from errors
– Next class