Lexical Analysis The Scanner CSC 4181 Compiler Construction 1 - - PDF document

lexical analysis the scanner
SMART_READER_LITE
LIVE PREVIEW

Lexical Analysis The Scanner CSC 4181 Compiler Construction 1 - - PDF document

Lexical Analysis The Scanner CSC 4181 Compiler Construction 1 Scanner 1 Introduction A scanner, sometimes called a lexical analyzer A scanner : gets a stream of characters (source program) divides it into tokens Tokens are


slide-1
SLIDE 1

1

Lexical Analysis The Scanner

CSC 4181 Compiler Construction

Scanner

1

Introduction

  • A scanner, sometimes called a lexical analyzer
  • A scanner :

– gets a stream of characters (source program) – divides it into tokens

  • Tokens are units that are meaningful in the

source language.

  • Lexemes are strings which match the patterns
  • f tokens.

Scanner 2

1 2

slide-2
SLIDE 2

2

Examples of Tokens in C

Scanner 3

Tokens Lexemes

identifier Age, grade,Temp, zone, q1 number 3.1416, -498127,987.76412097 string “A cat sat on a mat.”, “90183654”

  • pen parentheses

( close parentheses ) Semicolon ; reserved word if IF, if, If, iF

Scanning

  • When a token is found:

– It is passed to the next phase of compiler. – Sometimes values associated with the token, called attributes, need to be calculated. – Some tokens, together with their attributes, must be stored in the symbol/literal table.

  • it is necessary to check if the token is already in the table
  • Examples of attributes

– Attributes of a variable are name, address, type, etc. – An attribute of a numeric constant is its value.

Scanner 4

3 4

slide-3
SLIDE 3

3

How to construct a scanner

  • Define tokens in the source language.
  • Describe the patterns allowed for tokens.
  • Write regular expressions describing the patterns.
  • Construct an FA for each pattern.
  • Combine all FA’s which results in an NFA.
  • Convert NFA into DFA
  • Write a program simulating the DFA.

Scanner 5

Regular Expression

  • a character or symbol in the alphabet
  • an empty string
  • an empty set
  • if r and s are regular expressions
  • r | s
  • r s
  • r *
  • (r )

Scanner 6

 

5 6

slide-4
SLIDE 4

4

Extension of regular expr.

  • [a‐z]

– any character in a range from a to z

  • .

– any character

  • r +

– one or more repetition

  • r ?

– optional subexpression

  • ~(a | b | c), [^abc]

– any single character NOT in the set

Scanner 7

Examples of Patterns

  • (a | A) = the set {a, A}
  • [0‐9]+ = (0 |1 |...| 9) (0 |1 |...| 9)*
  • [0‐9]? = (0 | 1 |...| 9 | )
  • [A‐Za‐z] = (A |B |...| Z |a |b |...| z)
  • A . = the string with A following by any one

symbol

  • ~[0‐9] = [^0123456789] = any character which

is not 0, 1, ..., 9

Scanner 8

7 8

slide-5
SLIDE 5

5

Describing Patterns of Tokens

  • reservedIF = (IF| if| If| iF) = (I|i)(F|f)
  • letter = [a‐zA‐Z]
  • digit =[0‐9]
  • identifier = letter (letter|digit)*
  • numeric = (+|‐)? digit+ (. digit+)? (E (+|‐)? digit+)?
  • Comments

– { (~})* } // from tiny C grammar – /* ([^*]*[^/]*)* */ // C‐style comments – ;(~newline)* newline // Assembly lang comments

Scanner 9

Disambiguating Rules

  • IF is an identifier or a reserved word?

–A reserved word cannot be used as identifier. –A keyword can also be identifier.

is < and or

  • –Principle of longest substring
  • When a string can be either a single token or a

sequence of tokens, single‐token interpretation is preferred.

Scanner 10

9 10

slide-6
SLIDE 6

6

11

Nondeterministic Finite Automata

A nondeterministic finite automaton (NFA) is a mathematical model that consists of 1. A set of states S 2. A set of input symbols  3. A transition function that maps state/symbol pairs to a set of states: S x { + }  set of S 4. A special state s0 called the start state 5. A set of states F (subset of S) of final states

INPUT: string OUTPUT: yes or no

12

STATE

a b  0,1 3 1 2 2 3 3 Transition Table: 1 2 3 a,b a b b  S = {0,1,2,3} S0 = 0  = {a,b} F = {3}

Example NFA

11 12

slide-7
SLIDE 7

7

13

NFA Execution

An NFA says ‘yes’ for an input string if there is some path from the start state to some final state where all input has been processed.

NFA(int s0, int input_element) { if (all input processed and s0 is a final state) return Yes; if (all input processed and s0 is not a final state) return No; for all states s1 where transition(s0,table[input_element]) = s1 if (NFA(s1,input_element+1) = = Yes) return Yes; for all states s1 where transition(s0,) = s1 if (NFA(s1,input_element) = = Yes) return Yes; return No; }

Uses backtracking to search all possible paths

14

Deterministic Finite Automata

A deterministic finite automaton (DFA) is a mathematical model that consists of 1. A set of states S 2. A set of input symbols  3. A transition function that maps state/symbol pairs to a state:

S x   S

4. A special state s0 called the start state 5. A set of states F (subset of S) of final states

INPUT: string OUTPUT: yes or no

13 14

slide-8
SLIDE 8

8

FA Recognizing Tokens

  • Identifier
  • Numeric
  • Comment

Scanner 15

  • E

digit digit digit digit digit

  • E
  • e

digit

  • e

letter letter,digit

Example

  • identifier = letter(letter|digit)*

Scanner 16

15 16

slide-9
SLIDE 9

9

Combining FA’s

  • Identifiers
  • Reserved words
  • Combined

Scanner 17

I,i F,f E,e L,l S,s E,e

  • ther letter

letter,digit

E,e L,l S,s E,e I,i F,f

letter letter,digit

Lookahead

Scanner 18

I,i F,f [other] letter, digit Return ID Return IF

17 18

slide-10
SLIDE 10

10

Implementing DFA

  • nested‐if
  • transition table

Scanner 19

letter,digit

E,e L,l S,s E,e I,i F,f

  • ther
  • ther
  • ther

Return IF Return ID Return ELSE

Nested IF

switch (state) { case 0: { if isletter(nxt) state=1; elseif isdigit(nxt) state=2; else state=3; break; } case 1: { if isletVdig(nxt) state=1; else state=4; break; } … }

Scanner 20

1 2 4 3 digit letter, digit

  • ther

… …

19 20

slide-11
SLIDE 11

11

Transition table

St ch 0 1 2 3 … letter 1 1 .. .. digit 2 1 .. .. … 3 4 ..

Scanner 21

1 2 4 3 digit letter, digit

  • ther

… …

21