CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - - PowerPoint PPT Presentation

cs502 compiler design lexical analysis cont manas thakur
SMART_READER_LITE
LIVE PREVIEW

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - - PowerPoint PPT Presentation

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020 Recognizing strings in regular languages Done using Finite Automata A kind of state machine A state denotes a remembrance of the string read so far. An arrow


slide-1
SLIDE 1

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur

Fall 2020

slide-2
SLIDE 2

Manas Thakur CS502: Compiler Design 2

Recognizing strings in regular languages

  • Done using Finite Automata
  • A kind of state machine

– A state denotes a remembrance of the string read so far. – An arrow from state A to state B over a character c denotes a

transition (state change).

  • Formally, a finite automaton M consists of 5 components:

– Q: Set of states – Σ: Set of input symbols (alphabet) – q: Initial state – F: Set of final (accept) states – δ: Transition function

  • M accepts a string w if reading w takes M to an accept state.
slide-3
SLIDE 3

Manas Thakur CS502: Compiler Design 3

Regular expression to finite automata

  • Exercise: Write a regular expression that accepts your fjrst

name and nothing else!

  • Finite automaton that accepts “manas”:

– Q: {q0, q1, q2, q3, q4, q5} – Σ: {a..z} – q: q0 – F: {q5} – δ: {<q0,m,q1>, <q1,a,q2>, <q2,n,q3>,<q3,a,q4>,<q4,s,q5>}

  • “Invisible” arrows take to an error state.

q1 m a n a s q2 q3 q4 q5 q0

slide-4
SLIDE 4

Manas Thakur CS502: Compiler Design 4

Finite automaton to recognize identifiers

Regular expression: letter(letter|digit)*

q1 letter, digit

  • ther

q2 q0 q3 letter digit,

  • ther
slide-5
SLIDE 5

Manas Thakur CS502: Compiler Design 5

Tables for the recognizer

  • Two tables control the recognizer:
  • To change languages, we can just change tables.
slide-6
SLIDE 6

Manas Thakur CS502: Compiler Design 6

Code for the recognizer

  • Given an automaton, can we write a recognizer for a token?
slide-7
SLIDE 7

Manas Thakur CS502: Compiler Design 7

Classwork

  • Draw an FA that recognizes strings over alphabet {a, b} with

exactly three bs.

– a*ba*ba*ba*

  • Strings of length>1 starting and ending with a.

– a(a+b)*a

  • Strings with third last letter as a.

– (a+b)*a(a+b)(a+b)

  • Do you see non-determinism in the above two FAs?

– We have actually constructed a non-deterministic FA (the first

  • ne being a deterministic FA)!
  • Not for CS502:

– Conversion of NFA to DFA, minimization of DFA, ...

slide-8
SLIDE 8

Manas Thakur CS502: Compiler Design 8

Automatic construction of lexers

  • JavaCC: Popular lexical analyzer (and also parser) generator

javacc javac java fjle.jj Lexer in Java Input stream Tokens

  • Takes regular expressions as input.
  • Constructs equivalent finite automata.
  • Emits code for the scanner.
  • Lex/Flex: another popular lexer generator written in C.

Java Bytecode

slide-9
SLIDE 9

Manas Thakur CS502: Compiler Design 9

JavaCC regexes in action

  • BREAKING NEWS:

– You can start doing A1 today (spec on Moodle before eod).

slide-10
SLIDE 10

Manas Thakur CS502: Compiler Design 10

Errors in lexical analysis

  • It is difficult for a lexer to identify errors.

– Limited resources: e.g., no context information.

  • fi (a = f(x))

– Is fi a misspelling for if, or a function identifier?

  • As fi is a valid lexeme for the token identifier, the lexer must

return the token <id, fi>.

  • A later phase (parser or semantic analyzer) may be able to catch

the error.

  • But some errors can be caught by a lexer:

– int %x; – if (a < b);$

What should a lexer do on detecting an error?

slide-11
SLIDE 11

Manas Thakur CS502: Compiler Design 11

Error handling in lexical analysis

  • Panic and exit(1).
  • Try to recover from the error and proceed.

Why?

  • We are a compiler; not an interpreter!
slide-12
SLIDE 12

Manas Thakur CS502: Compiler Design 12

Error recovery in lexical analysis

  • Delete one character from the input.
  • Insert a missing character into the remaining input.

– Which one?

  • Replace a character by another character.
  • Transpose two adjacent characters.
  • Theoretical problem: Find the smallest number of

transformations (add, delete, replace) needed to convert a source program to one that consists only of valid lexemes.

– Too expensive in practice.

  • In practice, most lexical errors involve a single character.
slide-13
SLIDE 13

Manas Thakur CS502: Compiler Design 13

Limits of Regular Languages

  • Not all languages are regular.
  • Try constructing an FA for the following languages:

– L = {0n1n} – L = {wcwr | w ∈ Σ*}

Note: neither of these is a regular expression!

  • FAs cannot count properly!
  • However, this is a little subtle. One can construct FAs for:

– Alternating 0s and 1s

  • (ε | 1)(01)*(ε | 0)

– Sets of pairs of 0s and 1s

  • (01 | 10)+
slide-14
SLIDE 14

Manas Thakur CS502: Compiler Design 14

What next?

  • Learn a language that could recognize L = {0n1n} and

L = {wcwr | w ∈ Σ*}!

  • Why do we care?

– Do you fjnd any similarity between above, and recognizing:

  • Matching parentheses/blocks?