CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation

cs502 compiler design lexical analysis manas thakur
SMART_READER_LITE
LIVE PREVIEW

CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation

CS502: Compiler Design Lexical Analysis Manas Thakur Fall 2020 Lets get started Character stream Machine-Independent Machine-Independent Lexical Analyzer Lexical Analyzer Code Optimizer Code Optimizer B a c k e n d Intermediate


slide-1
SLIDE 1

CS502: Compiler Design Lexical Analysis Manas Thakur

Fall 2020

slide-2
SLIDE 2

Manas Thakur CS502: Compiler Design 2

Let’s get started

Lexical Analyzer Lexical Analyzer Syntax Analyzer Syntax Analyzer Semantic Analyzer Semantic Analyzer Intermediate Code Generator Intermediate Code Generator Character stream Token stream Syntax tree Syntax tree Intermediate representation Machine-Independent Code Optimizer Machine-Independent Code Optimizer Code Generator Code Generator Target machine code Intermediate representation Machine-Dependent Code Optimizer Machine-Dependent Code Optimizer Target machine code Symbol Table

F r o n t e n d B a c k e n d

slide-3
SLIDE 3

Manas Thakur CS502: Compiler Design 3

Lexical Analysis

  • Also called scanning
  • Corresponding component called lexical analyzer or scanner
  • Roles:

– Read input characters – Group into tokens (also called lexemes) – Return stream of tokens

  • To whom?

– Usually the parser

– Sometimes

  • Remove whitespace
  • Remove comments
  • Record information (such as line number) into symbol table
  • Report errors
slide-4
SLIDE 4

Manas Thakur CS502: Compiler Design 4

Characters to tokens

  • Input program:

if (a>b)

x = 0;

else

x = 1;

– Basically a sequence of characters

  • Actual input:

– \tif (a>b)\n\t\tx = 0;\n\telse\n\t\tx = 1;

  • Goal of lexical analyzer:

– Partition input stream into substrings (tokens) and classify

them according to their roles (types).

slide-5
SLIDE 5

Manas Thakur CS502: Compiler Design 5

Identifying and classifying tokens: Example

  • Input:

– \tif (a>b)\n\t\tx = 0;\n\telse\n\t\tx = 1;

  • Say we have the following token types:

– keywords, operators, identifjers, literals (constants), special

symbols, white space

  • How many tokens are there in this string?
  • Example output (excluding white spaces):

– <keyword, ‘if’> – <special_symbol, ‘(‘> – <identifjer, ‘a’> – ...

slide-6
SLIDE 6

Manas Thakur CS502: Compiler Design 6

Patterns for lexical analysis

  • Keywords can be represented directly

– ‘break’, ‘int’, ‘while’

  • And similarly punctuation symbols
  • What about the ones that are too many?

– Numbers – Identifiers

  • Specified (or modelled) using

– Regular expressions – The set of strings represented by a regular expression r forms a

regular language L(r).

slide-7
SLIDE 7

Manas Thakur CS502: Compiler Design 7

Regex Primer

  • Alphabet Σ consists of the symbols

– Our fjrst names are strings over the alphabet Σ = [(a-z)*]

  • * denotes zero or more occurences
  • ε denotes an empty string
  • + denotes one or more occurences
  • ? denotes zero or one occurence
  • | (or sometimes +) used to denote choice

– a*b | a*c

  • Many ways to express the same language:

– a*b | a*c can also be written as: a*(b+c)

slide-8
SLIDE 8

Manas Thakur CS502: Compiler Design 8

Classwork

  • Write a regex that represents strings over alphabet {a, b} that

start and end with a.

– (a(a+b)*a) + a

  • Strings with third last letter as a.

– (a*+b*)*a(a+b)(a+b)

  • Strings with exactly three bs.

– a*ba*ba*ba*

  • Strings over Σ = {0,1} with odd number of 1s:

– HW

slide-9
SLIDE 9

Manas Thakur CS502: Compiler Design 9

More Regex

  • Identifiers that begin only with a letter and may have numbers or

letters afterwards:

– letter: (a|b|c| ... |z|A|B|C| ...|Z) – number: (0|1|2| ... |9) – identifier: letter(letter|digit)*

  • HWOT: Write a regular expression for representing valid

email ids. (You are free to choose your alphabet.)

slide-10
SLIDE 10

Manas Thakur CS502: Compiler Design 10

Some considerations

  • How to distinguish between patterns with common prefixes:

– <, <=, << – Need to “look ahead” before taking a decision

  • Clashes between token types (e.g., then versus thenVar)

– Assign priorities while checking (e.g., keywords before identifiers) – Start with an identifier and if the value matches a reserved word,

then change its type

  • Detecting and recovering from errors

– Next class