lexical analyzer scanner
play

Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, - PowerPoint PPT Presentation

Lexical Analyzer Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Main tasks Read the input characters and produce as output a sequence of tokens that


  1. Lexical Analyzer — Scanner ASU Textbook Chapter 3.1, 3.3, 3.4, 3.6, 3.7, 3.5 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1

  2. Main tasks Read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis. Lexeme : a sequence of characters matched by a given pattern for a token . Lexeme pi = 3.1416 ; • Example: token ID ASSIGN FLOAT-LIT SEMI-COL • patterns: ⊲ identifier (variable) starts with a letter and follows by letters, digits or “ ”; ⊲ floating point number starts with a string of digits + a dot + another string of digits; Compiler notes #2, Tsan-sheng Hsu, IIS 2

  3. Strings Definitions and operations. alphabet : a finite set of characters (symbols); • string : a finite sequence of characters from the alphabet; • • | S | : length of a string S ; • empty string: ǫ ; • xy : concatenation of string x and y ǫx ≡ xǫ ≡ x ; • exponention: ⊲ s 0 ≡ ǫ ; ⊲ s i ≡ s i − 1 s , i > 0 . Compiler notes #2, Tsan-sheng Hsu, IIS 3

  4. Parts of a string Parts of a string: example string “necessary” • prefix: deleting zero or more tailing characters; eg: “nece” • suffix: deleting zero or more leading characters; eg: “ssary” • substring: deleting prefix and suffix; eg: “ssa” • subsequence: deleting zero or more not necessarily contiguous symbols; eg: “ncsay” Proper prefix, suffix, substring or subsequence: one that cannot • equal to the original string; Compiler notes #2, Tsan-sheng Hsu, IIS 4

  5. Language Language : any set of strings over an alphabet. Operations on languages: • union: L ∪ M = { s | s ∈ L or s ∈ M } ; • concatenation: LM = { st | s ∈ L and t ∈ M } ; • L 0 = { ǫ } ; Kleene closure : L ∗ = ∪ ∞ i =0 L i ; • Positive closure : L + = ∪ ∞ i =1 L i ; • • L ∗ = L + ∪ { ǫ } . Compiler notes #2, Tsan-sheng Hsu, IIS 5

  6. Regular expressions A regular expression r denotes a language L ( r ) , also called a regular set . Operations on regular expressions: regular language expression ∅ empty set {} ǫ the set containing the empty string { ǫ } a { a } where a is a legal symbol r | s L ( r ) ∪ L ( s ) — union rs L ( r ) L ( s ) — concatenation L ( r ) ∗ — Kleene closure r ∗ a | b { a, b } ( a | b )( a | b ) { aa, ab, ba, bb } a ∗ { ǫ, a, aa, aaa, . . . } Example: a | a ∗ b { a, b, ab, aab, . . . } ( A | B | · · · ) ( ( A | B | · · · ) — (0 | 1 | · · · ) — “ ”) ∗ C identifier Compiler notes #2, Tsan-sheng Hsu, IIS 6

  7. Regular definitions For simplicity, give names to regular expressions. • format: name → regular expression. • example 1: digit → 0 | 1 | 2 | · · · | 9 . • example 2: letter → a | b | c | · · · | z | A | B | · · · . r + | ǫ r ∗ r + rr ∗ r ? Notational standards: r | ǫ a | b | c [ abc ] [ a − z ] a | b | c | · · · | z Example: C variable name: [ A − Za − z ][ A − Za − z 0 − 9 ] ∗ Compiler notes #2, Tsan-sheng Hsu, IIS 7

  8. Non-regular sets Balanced or nested construct • Example: if · · · then · · · else • Recognized by context free grammar . Matching strings: • { wcw } , where w is a string of a ’s and b ’s and c is a legal symbol. Remark: anything that needs to “memorize” something happened in the past. Compiler notes #2, Tsan-sheng Hsu, IIS 8

  9. Finite state automata (FA) FA is a mechanism used to recognize tokens specified by a regular expression. Definition: • A finite set of states. • A set of transitions, labeled by characters. • A starting state. • A set of final (accepting) states. transition graph for the regular expression ( abc + ) + Example: a start c a b 3 c 2 1 0 Compiler notes #2, Tsan-sheng Hsu, IIS 9

  10. Transition graph and table for FA Transition graph: a start c a b 3 c 2 1 0 a b c 0 1 1 2 Transition table: 2 3 3 1 3 • Rows are input symbols. • Columns are current states. • Entries are resulting states. • Along with the table, a start state and a set of accepting states are also given. This is also called a GOTO table. Compiler notes #2, Tsan-sheng Hsu, IIS 10

  11. Types of FA’s Deterministic FA (DFA): • has a unique next state for a transition; • does not contain ǫ -transitions , that is a transition take ǫ as the input symbol. Nondeterministic FA (NFA): • has more than one next state for a transition; • contains ǫ -transitions. • Example: aa ∗ | bb ∗ . a 2 1 ε a start 0 b ε 4 b 3 Compiler notes #2, Tsan-sheng Hsu, IIS 11

  12. How to execute a DFA s ← starting state; while there are inputs do Algorithm: s ← Table [ s, input ] end while if s ∈ accpetingstates then ACCEPT else RE- JECT Example: input “abccabc”. The accepting path: a b c c a b c − → 1 − → 2 − → 3 − → 3 − → 1 − → 2 − → 3 0 a start c a b 3 c 2 1 0 Compiler notes #2, Tsan-sheng Hsu, IIS 12

  13. How to execute an NFA (informally) An NFA accepts an input string x if and only if there is some path in the transition graph initiating from the starting state to some accepting state such that the edge labels along the path spell out x . Could have more than one path. (Note DFA has at most one.) Example: regular expression: ( a | b ) ∗ abb ; input aabb a start b a b 3 2 1 0 b a b a a b b 0 { 0,1 } { 0 } 0 − → 0 − → 1 − → 2 − → 3 Accept! 1 { 2 } a a b b 0 − → 0 − → 0 − → 0 − → 0 Reject! 2 { 3 } Compiler notes #2, Tsan-sheng Hsu, IIS 13

  14. From regular expressions to NFA’s Structural decomposition: • atomic items: ∅ , ǫ and a legal symbol. start state for r r|s r* ε start state for r ε NFA for r start ε start NFA for r ε ε NFA for s ε accepting states for r start state for s start state for s start state for r ε start NFA for r NFA for s ε convert all accepting states in r into non accepting states and rs add −transitions ε Compiler notes #2, Tsan-sheng Hsu, IIS 14

  15. � ✁ ✞ ✝ ✆ ☎ ✄ ✂ Example: ( a | b ) ∗ abb ε ε a ε ε ε 2 ε 3 start b b a ο 1 12 6 9 10 11 8 7 ε b ε 4 5 ε This construction produces only ǫ -transitions, never multiple transitions for an input symbol. It is possible to remove all ǫ -transitions from an NFA and replace them with multiple transitions for an input symbol, and vice versa. Compiler notes #2, Tsan-sheng Hsu, IIS 15

  16. Construction theorems Theorem #1: • Any regular expression can be expressed by an NFA. • Any NFA can be converted into a DFA. That is, any regular expression can be expressed by a DFA. How to convert an NFA to a DFA: • Find out what is the set of possible states that can be reached from an NFA state using ǫ -transitions. • Find out what is the set of possible states that can be reached from an NFA state on an input symbol. Theorem #2: • Every DFA can be expressed as a regular expression. • Every regular expression can be expressed as a DFA. • DFA and regular expressions have the same expressive power. How about the power of DFA and NFA? Compiler notes #2, Tsan-sheng Hsu, IIS 16

  17. ✂ ✞ ☎ ✁ � ✆ ✝ ✄ Converting an NFA to a DFA Definitions: let T be a set of states and a be an input symbol. • ǫ -closure( T ): the set of NFA states reachable from some state s ∈ T using ǫ -transitions. • move ( T, a ) : the set of NFA states to which there is a transition on the input symbol a from state s ∈ T . • Both can be computed using standard graph algorithms. • ǫ -closure ( move ( T, a )) : the set of states reachable from a state in T for the input a . Example: NFA for ( a | b ) ∗ abb ε a ε ε ε ε 2 3 start b b a ο 1 12 10 6 9 11 7 8 ε b 4 5 ε • ǫ -closure ( { 0 } ) = { 0 , 1 , 2 , 4 , 6 , 7 } , that is the set of all possible start states • move ( { 2 , 7 } , a ) = { 3 , 8 } Compiler notes #2, Tsan-sheng Hsu, IIS 17

  18. Subset construction algorithm In the converted DFA, each state represents a subset of NFA states. a − → ǫ -closure ( move ( T, a )) • T Subset construction algorithm : initially, we have an unmarked state labeled with ǫ -closure ( { s 0 } ) , where s 0 is the starting state. while there is an unmarked state with the label T do ⊲ mark the state with the label T ⊲ for each input symbol a do U ← ǫ -closure ( move ( T, a )) ⊲ ⊲ if U is a subset of states that is never seen before ⊲ then add an unmarked state with the label U ⊲ end for end while New accepting states: those contain an original accepting state. Compiler notes #2, Tsan-sheng Hsu, IIS 18

  19. ✝ � ✞ ✆ ☎ ✄ ✂ ✁ Example ε a ε ε ε ε 2 3 start b b a ο 1 12 10 6 9 11 8 7 ε b 4 5 ε First step: • ǫ -closure( { 0 } ) = { 0,1,2,4,6,7 } • move ( { 0 , 1 , 2 , 4 , 6 , 7 } , a ) = 0,1,2,3,4, 6,7,8,9 { 3,8 } a 0,1,2,4,6,7 • ǫ -closure( { 3,8 } ) = { 1,2,3,4,6,7,8 } b 0,1,2,4,5,6,7 • move ( { 0 , 1 , 2 , 4 , 6 , 7 } , b ) = { 5 } • ǫ -closure( { 5 } ) = { 1,2,4,5,6,7 } Compiler notes #2, Tsan-sheng Hsu, IIS 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend