Lexical analysis CS440/540 Lexical Analysis Process: converting - - PowerPoint PPT Presentation

lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Lexical analysis CS440/540 Lexical Analysis Process: converting - - PowerPoint PPT Presentation

Lexical analysis CS440/540 Lexical Analysis Process: converting input string (source program) into substrings (tokens) Input: source program Output: a sequence of tokens Also called: lexer, tokenizer, scanner Token and Lexeme


slide-1
SLIDE 1

Lexical analysis

CS440/540

slide-2
SLIDE 2

Lexical Analysis

  • Process: converting input string (source program) into substrings (tokens)
  • Input: source program
  • Output: a sequence of tokens
  • Also called: lexer, tokenizer, scanner
slide-3
SLIDE 3

Token and Lexeme

  • Token: a syntactic category
  • Lexeme: instance of the token

Token Sample lexemes keyword if, else, for, while,… whitespace ‘ ’, ‘\t’, ‘\n’, … comparison <,>,==,!=,… identifier total, score, name, … number 1, 3.14159, 0, … literal “Super nice cool compiler”, “ComS”, …

slide-4
SLIDE 4

Basic design

  • 1. Define a finite set of tokens.
  • Keyword, whitespace, identifier, …
  • 2. Describe which strings belong to each token
  • Keyword: “if” or “else” or “for” or …
  • whitespace: non-empty sequence of blanks, newlines, and tabs
  • identifier: strings of letters or digits, starting with a letter
slide-5
SLIDE 5

Analysis example

if (i == j) z = 0; else z = 1; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

  • Identifier: ?
  • Keyword: ?
  • Comparison: ?
  • Number: ?
  • Whitespace: ?
slide-6
SLIDE 6

Analysis example

if (i == j) z = 0; else z = 1; \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

  • Identifier: i, j, z
  • Keyword: if, else
  • Comparison: ==
  • Number: 0, 1
  • Whitespace: ‘ ’, \t, \n
slide-7
SLIDE 7

What would you do?

  • Foo<Bar<Bazz>>
  • This is nested templates in C++.
  • However, do you see any conflict?
slide-8
SLIDE 8

What would you do?

  • Foo<Bar<Bazz>>
  • This is nested templates in C++.
  • However, do you see any conflict?
  • Foo<Bar<Bazz>>
  • cin >> var
slide-9
SLIDE 9

Alphabet, String, and Language

  • Alphabet (Σ)
  • Any finite set of symbols.
  • String over an alphabet
  • A finite sequence of symbols drawn from that alphabet.
  • Language (𝑀)
  • Any countable set of strings over some fixed alphabet.
  • Formally, Let S be a set of characters. A language over S is a set of strings of

characters drawn from S.

Alphabet Language English characters English sentences ASCII C programs

slide-10
SLIDE 10

Operations on Languages

  • Single character
  • ′𝑑′ = {"c"}
  • Epsilon
  • 𝜗 = {""}
  • Union
  • 𝐵 + 𝐶 = {𝑡|𝑡 ∈ 𝐵 𝑝𝑠 𝑡 ∈ 𝐶}
  • Concatenation
  • 𝐵𝐶 = {𝑏𝑐|𝑏 ∈ 𝐵 𝑏𝑜𝑒 𝑐 ∈ 𝐶}
  • Iteration
  • 𝐵∗ =∪𝑗≥0 𝐵𝑗 where 𝐵𝑗 = 𝐵 … 𝑗 𝑢𝑗𝑛𝑓𝑡 … 𝐵
slide-11
SLIDE 11

Example

  • 𝑀 = {𝐵, 𝐶, … , 𝑎, 𝑏, 𝑐, … , 𝑨}, 𝐸 = {0,1, … , 9}
  • 𝑀 + 𝐸
  • set of letters and digits, each of which strings is either one letter or one digit
  • 𝐵 , 𝑕 , 1 , …
  • 𝑀𝐸
  • set of strings of length two, each consisting of one letter followed by one digit
  • 𝑑4 , 𝑘8 , 𝑧6 , …
  • 𝑀4
  • set of all 4-letter strings
  • 1234 , 7416 , 2592 , …
slide-12
SLIDE 12

Regular Expressions

  • Describing the language by a combination of language operations of

some alphabet.

slide-13
SLIDE 13

Example

  • Keyword
  • “if” or “else” or “for” or …
  • keyword = ?
slide-14
SLIDE 14

Example

  • Keyword
  • “if” or “else” or “for” or …
  • keyword = ‘if’ + ‘else’ + ‘for’ + …
slide-15
SLIDE 15

Examples

  • Integer
  • non-empty string of digits
  • digit = ‘0’ + ‘1’ + … + ‘9’
  • integer = ?
slide-16
SLIDE 16

Examples

  • Integer
  • non-empty string of digits
  • digit = ‘0’ + ‘1’ + … + ‘9’
  • integer = digit digit*
  • Definition
  • A*: zero or more of the preceding element
  • A+=AA*: one or more of the preceding element
  • integer = digit+
  • A?: zero or one of the preceding element
slide-17
SLIDE 17

Examples

  • Identifier
  • Strings of letters or digits, starting with a letter
  • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’
  • digit = ‘0’ + ‘1’ + … + ‘9’
  • identifier = ?
slide-18
SLIDE 18

Examples

  • Identifier
  • Strings of letters or digits, starting with a letter
  • letter = ‘A’ + … + ‘Z’ + ‘a’ + … + ‘z’
  • digit = ‘0’ + ‘1’ + … + ‘9’
  • identifier = letter (letter + digit)*
slide-19
SLIDE 19

More Examples

  • Phone number
  • (515)-294-8813
  • Σ =?
  • 𝑏𝑠𝑓𝑏 =?
  • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 =?
  • 𝑞ℎ𝑝𝑜𝑓 =?
  • phone number = ?
slide-20
SLIDE 20

More Examples

  • Phone number
  • (515)-294-8813
  • Σ = 𝑒𝑗𝑕𝑗𝑢𝑡 ∪ {−, , }
  • 𝑏𝑠𝑓𝑏 = 𝑒𝑗𝑕𝑗𝑢3
  • 𝑓𝑦𝑑ℎ𝑏𝑜𝑕𝑓 = 𝑒𝑗𝑕𝑗𝑢3
  • 𝑞ℎ𝑝𝑜𝑓 = 𝑒𝑗𝑕𝑗𝑢4
  • phone number = ‘(’area ‘)-’ exchange ‘-’ phone
slide-21
SLIDE 21

More Examples

  • email address
  • weile@iastate.edu
  • Σ =?
  • 𝑜𝑏𝑛𝑓 =?
  • address = ?
slide-22
SLIDE 22

More Examples

  • email address
  • weile@iastate.edu
  • Σ = 𝑚𝑓𝑢𝑢𝑓𝑠𝑡 ∪ {. , @}
  • 𝑜𝑏𝑛𝑓 = 𝑚𝑓𝑢𝑢𝑓𝑠+
  • address = name ‘@’ name ‘.’ name
slide-23
SLIDE 23

An algorithm of lexical analysis

  • Transition diagram
  • Flowchart with states and edges; each edge is labelled with characters;

certain subset of states are marked as “final states.”

  • Transition from state to state proceeds along edges according to the next

input character.

  • Every string that ends up at a final state is accepted.
  • If get “stuck”, there is no transition for a given character, it is an error.
  • Transition diagrams can be easily translated to programs using if or case

statements

slide-24
SLIDE 24

Implementation

state0: c = getchar(); if (isalpha(c)) token += c; goto state1; error(); state1: c = getchar(); if (isalpha(c) || isdigit(c)) token += c; goto state1; if (isdelimiter(c)) goto state2; error(); state2: return(token);

slide-25
SLIDE 25

Finite automata

  • Finite automata
  • Deterministic Finite Automata (DFAs)
  • Non-deterministic Finite Automata (NFAs)
slide-26
SLIDE 26

Notation

  • Given a string s and a regxp R, is 𝑡 ∈ 𝑀(𝑆)
  • There is variation in regular expression notation
  • Union: A + B ≡ A | B
  • Option: A + ε ≡ A?
  • Range: ‘a’+’b’+…+’z’ ≡ [a-z]
  • Excluded range: complement of [a-z] ≡ [^a-z]
slide-27
SLIDE 27

Lexical Spec  Regular Expressions (1)

  • 1. Write a rexp for the lexemes of each token
  • Number = digit+
  • Keyword = ‘if’ + ‘else’ + …
  • Identifier = letter (letter + digit)*
  • OpenPar = ‘(‘
  • ClosePar = ‘)’
  • 2. Construct R, matching all lexemes for all tokens
  • R = Keyword + Identifier + Number + …
  • = R1 + R2 + R3 + …
slide-28
SLIDE 28

Lexical Spec  Regular Expressions (2)

  • 3. Let input be x1…xn
  • For 1 ≤ i ≤ n check
  • x1…xi ∈ L(R)
  • 4. If success, then we know that
  • x1…xi ∈ L(Rj) for some j
  • 5. Remove x1…xi from input and go to (3)

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

slide-29
SLIDE 29

Ambiguities

  • What if x1…xi∈L(R) and also x1…xj∈L(R)?
  • note that i ≠ j
  • Possible rule
  • pick longest possible string in L(R)
  • What if x1…xi∈L(Rj) and also x1…xi∈L(Rk)?
  • note that j ≠ k
  • Possible rule
  • use the listed first
slide-30
SLIDE 30

Finite Automata

  • A finite automaton consists of
  • An input alphabet Σ
  • A set of states S
  • A start state n
  • A set of accepting states F ⊆ S
  • A set of transitions state →input state
slide-31
SLIDE 31

Finite Automata

  • Transition
  • s1 →a s2
  • Is read:
  • In state s1 on input “a” go to state s2
  • If end of input and in accepting state  accept
  • Otherwise  reject
slide-32
SLIDE 32

Finite Automata State Graphs

slide-33
SLIDE 33

Simple examples

  • A finite automaton that accepts only “1”
  • A finite automaton accepting any number of 1’s followed by a single 0
slide-34
SLIDE 34

And Another Example

  • Alphabet {0,1}
  • What language does this recognize?
slide-35
SLIDE 35

And Another Example

  • Alphabet {0,1}
  • What language does this recognize?
  • (1*0(0+1?|1))+
slide-36
SLIDE 36

Epsilon Moves

  • Machine can move from state A to state B without reading input
slide-37
SLIDE 37

Deterministic and Nondeterministic Automata

  • Deterministic Finite Automata (DFA)
  • One transition per input per state
  • No ε-moves
  • Nondeterministic Finite Automata (NFA)
  • Can have multiple transitions for one input in a given state
  • Can have ε-moves
slide-38
SLIDE 38

Execution of Finite Automata

  • A DFA can take only one path through the state graph
  • Completely determined by input
  • NFAs can choose
  • Whether to make ε-moves
  • Which of multiple transitions for a single input to take
slide-39
SLIDE 39

Acceptance of NFAs

  • An NFA can get into multiple states
  • Rule: NFA accepts if it can get to a final state
  • Input: 100
slide-40
SLIDE 40

NFA vs. DFA

  • NFAs and DFAs recognize the same set of languages (regular

languages)

  • DFAs are faster to execute
  • DFA can be exponentially larger than NFA
  • For a given language NFA can be simpler than DFA

(1*0(0|1)0*1?)+

slide-41
SLIDE 41

Regular Expressions to NFA (1)

  • For each kind of rexp, define an NFA
  • Notation: NFA for rexp M
  • For ε
  • For input a
slide-42
SLIDE 42

Regular Expressions to NFA (2)

  • For AB
  • For A | B
slide-43
SLIDE 43

Regular Expressions to NFA (3)

  • For A*
slide-44
SLIDE 44

Example: RegExp  NFA conversion

  • Consider the regular expression
  • (1|0)*1
  • The NFA is
slide-45
SLIDE 45

Example: RegExp  NFA conversion

  • Consider the regular expression
  • (1|0)*1
  • The NFA is
slide-46
SLIDE 46

NFA  DFA

  • Simulate the NFA
  • Each state of DFA
  • a non-empty subset of states of the NFA
  • Start state
  • the set of NFA states reachable through ε-moves from NFA start state
  • Add a transition S →a S’ to DFA iff
  • S’ is the set of NFA states reachable from any state in S after seeing the input

a, considering ε-moves as well

slide-47
SLIDE 47

NFA  DFA: Example

slide-48
SLIDE 48

NFA  DFA: Example

S=ABCDH, T=FGHABCD, U=EGHIABCDI

slide-49
SLIDE 49

Implementation

  • A DFA can be implemented by a 2D table T
  • One dimension is “states”
  • Other dimension is “input symbol”
  • For every transition Si →a Sk define T[i,a] = k
  • DFA “execution”
  • If in state Si and input a, read T[i,a] = k and skip to state Sk
  • Very efficient
slide-50
SLIDE 50

Table Implementation of a DFA

slide-51
SLIDE 51

Table Implementation of a DFA

slide-52
SLIDE 52