lexical analysis
play

Lexical analysis CS440/540 Lexical Analysis Process: converting - PowerPoint PPT Presentation

Lexical analysis CS440/540 Lexical Analysis Process: converting input string (source program) into substrings (tokens) Input: source program Output: a sequence of tokens Also called: lexer, tokenizer, scanner Token and Lexeme


  1. Lexical analysis CS440/540

  2. Lexical Analysis • Process: converting input string (source program) into substrings (tokens) • Input: source program • Output: a sequence of tokens • Also called: lexer, tokenizer, scanner

  3. Token and Lexeme • Token: a syntactic category • Lexeme: instance of the token Token Sample lexemes keyword if, else, for, while,… whitespace ‘ ’, ‘ \ t’, ‘ \ n’, … comparison <,>,==,!=,… identifier total, score, name, … number 1, 3.14159, 0, … literal “Super nice cool compiler ”, “ ComS ”, …

  4. Basic design 1. Define a finite set of tokens. • Keyword, whitespace, identifier, … 2. Describe which strings belong to each token • Keyword: “if” or “else” or “for” or … • whitespace: non-empty sequence of blanks, newlines, and tabs • identifier: strings of letters or digits, starting with a letter

  5. Analysis example if (i == j) z = 0; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; else z = 1; • Identifier: ? • Keyword: ? • Comparison: ? • Number: ? • Whitespace: ?

  6. Analysis example if (i == j) z = 0; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; else z = 1; • Identifier: i, j, z • Keyword: if, else • Comparison: == • Number: 0, 1 • Whitespace: ‘ ’, \t, \n

  7. What would you do? • Foo<Bar<Bazz>> • This is nested templates in C++. • However, do you see any conflict?

  8. What would you do? • Foo<Bar<Bazz>> • This is nested templates in C++. • However, do you see any conflict? • Foo<Bar<Bazz >> • cin >> var

  9. Alphabet, String, and Language • Alphabet ( Σ ) • Any finite set of symbols. • String over an alphabet • A finite sequence of symbols drawn from that alphabet. • Language ( 𝑀 ) • Any countable set of strings over some fixed alphabet. • Formally, Let S be a set of characters. A language over S is a set of strings of characters drawn from S. Alphabet Language English characters English sentences ASCII C programs

  10. Operations on Languages • Single character • ′𝑑 ′ = {"c"} • Epsilon • 𝜗 = {""} • Union • 𝐵 + 𝐶 = {𝑡|𝑡 ∈ 𝐵 𝑝𝑠 𝑡 ∈ 𝐶} • Concatenation • 𝐵𝐶 = {𝑏𝑐|𝑏 ∈ 𝐵 𝑏𝑜𝑒 𝑐 ∈ 𝐶} • Iteration • 𝐵 ∗ =∪ 𝑗≥0 𝐵 𝑗 where 𝐵 𝑗 = 𝐵 … 𝑗 𝑢𝑗𝑛𝑓𝑡 … 𝐵

  11. Example • 𝑀 = {𝐵, 𝐶, … , 𝑎, 𝑏, 𝑐, … , 𝑨} , 𝐸 = {0,1, … , 9} • 𝑀 + 𝐸 • set of letters and digits, each of which strings is either one letter or one digit • 𝐵 , 𝑕 , 1 , … • 𝑀𝐸 • set of strings of length two, each consisting of one letter followed by one digit • 𝑑4 , 𝑘8 , 𝑧6 , … • 𝑀 4 • set of all 4-letter strings • 1234 , 7416 , 2592 , …

  12. Regular Expressions • Describing the language by a combination of language operations of some alphabet.

  13. Example • Keyword • “if” or “else” or “for” or … • keyword = ?

  14. Example • Keyword • “if” or “else” or “for” or … • keyword = ‘if’ + ‘else’ + ‘for’ + …

  15. Examples • Integer • non-empty string of digits • digit = ‘0’ + ‘1’ + … + ‘9’ • integer = ?

  16. Examples • Integer • non-empty string of digits • digit = ‘0’ + ‘1’ + … + ‘9’ • integer = digit digit* • Definition • A*: zero or more of the preceding element • A + =AA*: one or more of the preceding element • integer = digit + • A?: zero or one of the preceding element

  17. Examples • Identifier • Strings of letters or digits, starting with a letter • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’ • digit = ‘0’ + ‘1’ + … + ‘9’ • identifier = ?

  18. Examples • Identifier • Strings of letters or digits, starting with a letter • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’ • digit = ‘0’ + ‘1’ + … + ‘9’ • identifier = letter (letter + digit)*

  19. More Examples • Phone number • (515)-294-8813 • Σ =? • 𝑏𝑠𝑓𝑏 =? • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 =? • 𝑞ℎ𝑝𝑜𝑓 =? • phone number = ?

  20. More Examples • Phone number • (515)-294-8813 • Σ = 𝑒𝑗𝑕𝑗𝑢𝑡 ∪ {−, , } • 𝑏𝑠𝑓𝑏 = 𝑒𝑗𝑕𝑗𝑢 3 • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 = 𝑒𝑗𝑕𝑗𝑢 3 • 𝑞ℎ𝑝𝑜𝑓 = 𝑒𝑗𝑕𝑗𝑢 4 • phone number = ‘(’area ‘) - ’ exchange ‘ - ’ phone

  21. More Examples • email address • weile@iastate.edu • Σ =? • 𝑜𝑏𝑛𝑓 =? • address = ?

  22. More Examples • email address • weile@iastate.edu • Σ = 𝑚𝑓𝑢𝑢𝑓𝑠𝑡 ∪ {. , @} • 𝑜𝑏𝑛𝑓 = 𝑚𝑓𝑢𝑢𝑓𝑠 + • address = name ‘@’ name ‘.’ name

  23. An algorithm of lexical analysis • Transition diagram • Flowchart with states and edges; each edge is labelled with characters; certain subset of states are marked as “final states.” • Transition from state to state proceeds along edges according to the next input character. • Every string that ends up at a final state is accepted. • If get “stuck”, there is no transition for a given character, it is an error. • Transition diagrams can be easily translated to programs using if or case statements

  24. Implementation state0: c = getchar(); if (isalpha(c)) token += c; goto state1; error(); state1: c = getchar(); if (isalpha(c) || isdigit(c)) token += c; goto state1; if (isdelimiter(c)) goto state2; error(); state2: return(token);

  25. Finite automata • Finite automata • Deterministic Finite Automata (DFAs) • Non-deterministic Finite Automata (NFAs)

  26. Notation • Given a string s and a regxp R, is 𝑡 ∈ 𝑀(𝑆) • There is variation in regular expression notation • Union: A + B ≡ A | B • Option: A + ε ≡ A? • Range: ‘a’+’b’+…+’z’ ≡ [a -z] • Excluded range: complement of [a- z] ≡ [^a -z]

  27. Lexical Spec  Regular Expressions (1) 1. Write a rexp for the lexemes of each token • Number = digit + • Keyword = ‘if’ + ‘else’ + … • Identifier = letter (letter + digit)* • OpenPar = ‘(‘ • ClosePar = ‘)’ 2. Construct R, matching all lexemes for all tokens • R = Keyword + Identifier + Number + … • = R 1 + R 2 + R 3 + …

  28. Lexical Spec  Regular Expressions (2) 3. Let input be x 1 … x n • For 1 ≤ i ≤ n check • x 1 …x i ∈ L(R) 4. If success, then we know that • x 1 …x i ∈ L(R j ) for some j 5. Remove x 1 …x i from input and go to (3) \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

  29. Ambiguities • What if x 1 …x i ∈ L(R) and also x 1 …x j ∈ L(R)? • note that i ≠ j • Possible rule • pick longest possible string in L(R) • What if x 1 …x i ∈ L(R j ) and also x 1 …x i ∈ L(R k )? • note that j ≠ k • Possible rule • use the listed first

  30. Finite Automata • A finite automaton consists of • An input alphabet Σ • A set of states S • A start state n • A set of accepting states F ⊆ S • A set of transitions state → input state

  31. Finite Automata • Transition • s 1 → a s 2 • Is read: • In state s 1 on input “a” go to state s 2 • If end of input and in accepting state  accept • Otherwise  reject

  32. Finite Automata State Graphs

  33. Simple examples • A finite automaton that accepts only “1” • A finite automaton accepting any number of 1’s followed by a single 0

  34. And Another Example • Alphabet {0,1} • What language does this recognize?

  35. And Another Example • Alphabet {0,1} • What language does this recognize? • (1*0(0 + 1?|1)) +

  36. Epsilon Moves • Machine can move from state A to state B without reading input

  37. Deterministic and Nondeterministic Automata • Deterministic Finite Automata (DFA) • One transition per input per state • No ε -moves • Nondeterministic Finite Automata (NFA) • Can have multiple transitions for one input in a given state • Can have ε -moves

  38. Execution of Finite Automata • A DFA can take only one path through the state graph • Completely determined by input • NFAs can choose • Whether to make ε -moves • Which of multiple transitions for a single input to take

  39. Acceptance of NFAs • An NFA can get into multiple states • Rule: NFA accepts if it can get to a final state • Input: 100

  40. NFA vs. DFA • NFAs and DFAs recognize the same set of languages (regular languages) • DFAs are faster to execute • DFA can be exponentially larger than NFA • For a given language NFA can be simpler than DFA (1*0(0|1)0*1?) +

  41. Regular Expressions to NFA (1) • For each kind of rexp, define an NFA • Notation: NFA for rexp M • For ε • For input a

  42. Regular Expressions to NFA (2) • For AB • For A | B

  43. Regular Expressions to NFA (3) • For A*

  44. Example: RegExp  NFA conversion • Consider the regular expression • (1|0)*1 • The NFA is

  45. Example: RegExp  NFA conversion • Consider the regular expression • (1|0)*1 • The NFA is

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend