cs502 compiler design lexical analysis cont manas thakur
play

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall - PowerPoint PPT Presentation

CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020 Recognizing strings in regular languages Done using Finite Automata A kind of state machine A state denotes a remembrance of the string read so far. An arrow


  1. CS502: Compiler Design Lexical Analysis (Cont.) Manas Thakur Fall 2020

  2. Recognizing strings in regular languages ● Done using Finite Automata ● A kind of state machine – A state denotes a remembrance of the string read so far. – An arrow from state A to state B over a character c denotes a transition (state change). ● Formally, a finite automaton M consists of 5 components: – Q : Set of states – Σ : Set of input symbols (alphabet) – q : Initial state – F : Set of final (accept) states – δ : Transition function ● M accepts a string w if reading w takes M to an accept state. Manas Thakur CS502: Compiler Design 2

  3. Regular expression to finite automata ● Exercise: Write a regular expression that accepts your fjrst name and nothing else! ● Finite automaton that accepts “manas”: m a n a s q1 q2 q3 q0 q4 q5 – Q: {q0, q1, q2, q3, q4, q5} – Σ : {a..z} – q: q0 – F: {q5} – δ : {<q0,m,q1>, <q1,a,q2>, <q2,n,q3>,<q3,a,q4>,<q4,s,q5>} ● “Invisible” arrows take to an error state. Manas Thakur CS502: Compiler Design 3

  4. Finite automaton to recognize identifiers Regular expression: letter(letter|digit)* letter, digit letter other q0 q1 q2 digit, other q3 Manas Thakur CS502: Compiler Design 4

  5. Tables for the recognizer ● Two tables control the recognizer: ● To change languages, we can just change tables. Manas Thakur CS502: Compiler Design 5

  6. Code for the recognizer ● Given an automaton, can we write a recognizer for a token? Manas Thakur CS502: Compiler Design 6

  7. Classwork ● Draw an FA that recognizes strings over alphabet { a , b } with exactly three b s. – a*ba*ba*ba* ● Strings of length>1 starting and ending with a . – a(a+b)*a ● Strings with third last letter as a . – (a+b)*a(a+b)(a+b) ● Do you see non-determinism in the above two FAs? – We have actually constructed a non-deterministic FA (the first one being a deterministic FA)! ● Not for CS502: – Conversion of NFA to DFA, minimization of DFA, ... Manas Thakur CS502: Compiler Design 7

  8. Automatic construction of lexers ● JavaCC: Popular lexical analyzer (and also parser) generator Lexer in Java fjle.jj Java Bytecode javacc javac java Tokens Input stream ● Takes regular expressions as input. ● Constructs equivalent finite automata. ● Emits code for the scanner. ● Lex/Flex: another popular lexer generator written in C. Manas Thakur CS502: Compiler Design 8

  9. JavaCC regexes in action ● BREAKING NEWS: – You can start doing A1 today (spec on Moodle before eod). Manas Thakur CS502: Compiler Design 9

  10. Errors in lexical analysis ● It is difficult for a lexer to identify errors. – Limited resources: e.g., no context information. ● fi (a = f(x)) – Is fi a misspelling for if , or a function identifier? ● As fi is a valid lexeme for the token identifier, the lexer must return the token <id, fi> . ● A later phase (parser or semantic analyzer) may be able to catch What should a lexer do on detecting an error? the error. ● But some errors can be caught by a lexer: – int %x; – if (a < b);$ Manas Thakur CS502: Compiler Design 10

  11. Error handling in lexical analysis ● Panic and exit(1). ● Try to recover from the error and proceed. Why? ● We are a compiler; not an interpreter! Manas Thakur CS502: Compiler Design 11

  12. Error recovery in lexical analysis ● Delete one character from the input. ● Insert a missing character into the remaining input. – Which one? ● Replace a character by another character. ● Transpose two adjacent characters. ● Theoretical problem: Find the smallest number of transformations (add, delete, replace) needed to convert a source program to one that consists only of valid lexemes. – Too expensive in practice. ● In practice, most lexical errors involve a single character. Manas Thakur CS502: Compiler Design 12

  13. Limits of Regular Languages ● Not all languages are regular. ● Try constructing an FA for the following languages: – L = {0 n 1 n } – L = {wcw r | w ∈ Σ *} Note: neither of these is a regular expression! ● FAs cannot count properly! ● However, this is a little subtle. One can construct FAs for: – Alternating 0s and 1s ● (ε | 1)(01)*(ε | 0) – Sets of pairs of 0s and 1s ● (01 | 10)+ Manas Thakur CS502: Compiler Design 13

  14. What next? ● Learn a language that could recognize L = {0 n 1 n } and L = {wcw r | w ∈ Σ*}! ● Why do we care? – Do you fjnd any similarity between above, and recognizing: ● Matching parentheses/blocks? Manas Thakur CS502: Compiler Design 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend