CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - - PowerPoint PPT Presentation
CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - - PowerPoint PPT Presentation
CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020 Recognizing strings in regular languages Done using Finite Automata A kind of state machine A state denotes a remembrance of the string read so far. An arrow
Manas Thakur CS502: Compiler Design 2
Recognizing strings in regular languages
- Done using Finite Automata
- A kind of state machine
– A state denotes a remembrance of the string read so far. – An arrow from state A to state B over a character c denotes a
transition (state change).
- Formally, a finite automaton M consists of 5 components:
– Q: Set of states – Σ: Set of input symbols (alphabet) – q: Initial state – F: Set of final (accept) states – δ: Transition function
- M accepts a string w if reading w takes M to an accept state.
Manas Thakur CS502: Compiler Design 3
Regular expression to finite automata
- Exercise: Write a regular expression that accepts your fjrst
name and nothing else!
- Finite automaton that accepts “manas”:
– Q: {q0, q1, q2, q3, q4, q5} – Σ: {a..z} – q: q0 – F: {q5} – δ: {<q0,m,q1>, <q1,a,q2>, <q2,n,q3>,<q3,a,q4>,<q4,s,q5>}
- “Invisible” arrows take to an error state.
q1 m a n a s q2 q3 q4 q5 q0
Manas Thakur CS502: Compiler Design 4
Finite automaton to recognize identifiers
Regular expression: letter(letter|digit)*
q1 letter, digit
- ther
q2 q0 q3 letter digit,
- ther
Manas Thakur CS502: Compiler Design 5
Tables for the recognizer
- Two tables control the recognizer:
- To change languages, we can just change tables.
Manas Thakur CS502: Compiler Design 6
Code for the recognizer
- Given an automaton, can we write a recognizer for a token?
Manas Thakur CS502: Compiler Design 7
Classwork
- Draw an FA that recognizes strings over alphabet {a, b} with
exactly three bs.
– a*ba*ba*ba*
- Strings of length>1 starting and ending with a.
– a(a+b)*a
- Strings with third last letter as a.
– (a+b)*a(a+b)(a+b)
- Do you see non-determinism in the above two FAs?
– We have actually constructed a non-deterministic FA (the first
- ne being a deterministic FA)!
- Not for CS502:
– Conversion of NFA to DFA, minimization of DFA, ...
Manas Thakur CS502: Compiler Design 8
Automatic construction of lexers
- JavaCC: Popular lexical analyzer (and also parser) generator
javacc javac java fjle.jj Lexer in Java Input stream Tokens
- Takes regular expressions as input.
- Constructs equivalent finite automata.
- Emits code for the scanner.
- Lex/Flex: another popular lexer generator written in C.
Java Bytecode
Manas Thakur CS502: Compiler Design 9
JavaCC regexes in action
- BREAKING NEWS:
– You can start doing A1 today (spec on Moodle before eod).
Manas Thakur CS502: Compiler Design 10
Errors in lexical analysis
- It is difficult for a lexer to identify errors.
– Limited resources: e.g., no context information.
- fi (a = f(x))
– Is fi a misspelling for if, or a function identifier?
- As fi is a valid lexeme for the token identifier, the lexer must
return the token <id, fi>.
- A later phase (parser or semantic analyzer) may be able to catch
the error.
- But some errors can be caught by a lexer:
– int %x; – if (a < b);$
What should a lexer do on detecting an error?
Manas Thakur CS502: Compiler Design 11
Error handling in lexical analysis
- Panic and exit(1).
- Try to recover from the error and proceed.
Why?
- We are a compiler; not an interpreter!
Manas Thakur CS502: Compiler Design 12
Error recovery in lexical analysis
- Delete one character from the input.
- Insert a missing character into the remaining input.
– Which one?
- Replace a character by another character.
- Transpose two adjacent characters.
- Theoretical problem: Find the smallest number of
transformations (add, delete, replace) needed to convert a source program to one that consists only of valid lexemes.
– Too expensive in practice.
- In practice, most lexical errors involve a single character.
Manas Thakur CS502: Compiler Design 13
Limits of Regular Languages
- Not all languages are regular.
- Try constructing an FA for the following languages:
– L = {0n1n} – L = {wcwr | w ∈ Σ*}
Note: neither of these is a regular expression!
- FAs cannot count properly!
- However, this is a little subtle. One can construct FAs for:
– Alternating 0s and 1s
- (ε | 1)(01)*(ε | 0)
– Sets of pairs of 0s and 1s
- (01 | 10)+
Manas Thakur CS502: Compiler Design 14
What next?
- Learn a language that could recognize L = {0n1n} and
L = {wcwr | w ∈ Σ*}!
- Why do we care?
– Do you fjnd any similarity between above, and recognizing:
- Matching parentheses/blocks?