Lexical Analysis April 3, 2013 Wednesday, April 3, 13 Previously - - PowerPoint PPT Presentation

lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Lexical Analysis April 3, 2013 Wednesday, April 3, 13 Previously - - PowerPoint PPT Presentation

Lexical Analysis April 3, 2013 Wednesday, April 3, 13 Previously on CSE 131b... Structure of a modern compiler Source Lexical Analysis Code Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Machine


slide-1
SLIDE 1

Lexical Analysis

April 3, 2013

Wednesday, April 3, 13

slide-2
SLIDE 2

Previously on CSE 131b...

Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Source Code

Machine Code

Structure of a modern compiler

Wednesday, April 3, 13

slide-3
SLIDE 3

Where are we?

Lexical Analysis Syntax Analysis Semantic Analysis IR Generation IR Optimization Code Generation Optimization

Source Code

Machine Code

Wednesday, April 3, 13

slide-4
SLIDE 4

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

Input: code (character stream)

Goal of Lexical Analysis

Breaking the program down into words or “tokens”

Wednesday, April 3, 13

slide-5
SLIDE 5

w h i l e ( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident ip z ip

Goal of Lexical Analysis

Output: Token Stream

Wednesday, April 3, 13

slide-6
SLIDE 6
  • w h i l e

( i < z ) \n \t + i p ;

while (ip < z) ++ip;

p + +

T_While ( T_Ident < T_Ident ) ++ T_Ident ip z ip While ++ Ident < Ident Ident ip z ip

The Token Stream is then used as input for Parser (syntax analysis)

Wednesday, April 3, 13

slide-7
SLIDE 7

What’s a token?

  • What’s a lexical unit of code?

Wednesday, April 3, 13

slide-8
SLIDE 8

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-9
SLIDE 9

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-10
SLIDE 10

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-11
SLIDE 11

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-12
SLIDE 12

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-13
SLIDE 13

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-14
SLIDE 14

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-15
SLIDE 15

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-16
SLIDE 16

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-17
SLIDE 17

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-18
SLIDE 18

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-19
SLIDE 19

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-20
SLIDE 20

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-21
SLIDE 21

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-22
SLIDE 22

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-23
SLIDE 23

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

What is my name ?

Wednesday, April 3, 13

slide-24
SLIDE 24

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

Wednesday, April 3, 13

slide-25
SLIDE 25

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while

Wednesday, April 3, 13

slide-26
SLIDE 26

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;

Wednesday, April 3, 13

slide-27
SLIDE 27

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++

Wednesday, April 3, 13

slide-28
SLIDE 28

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =

Wednesday, April 3, 13

slide-29
SLIDE 29

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

Wednesday, April 3, 13

slide-30
SLIDE 30

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

  • Integer, float point, string: 2345 2.0 “hello world”

Wednesday, April 3, 13

slide-31
SLIDE 31

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Token Type

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

  • Integer, float point, string: 2345 2.0 “hello world”
  • Whitespace, comment /* this code is awesome */

Wednesday, April 3, 13

slide-32
SLIDE 32

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-33
SLIDE 33

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-34
SLIDE 34

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-35
SLIDE 35

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-36
SLIDE 36

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-37
SLIDE 37

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-38
SLIDE 38

Scanning a Source File

w h i l e ( i < z ) \n \t + i p ; p + + ( 1 < i ) \n \t + i ; 3 + + 7

Wednesday, April 3, 13

slide-39
SLIDE 39

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ; 3 + +

T_While

7

Wednesday, April 3, 13

slide-40
SLIDE 40

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ; 3 + +

T_While

7

Token

Wednesday, April 3, 13

slide-41
SLIDE 41

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ; 3 + +

T_While

7

Token Lexeme: the piece of the

  • riginal program from

which we made the token

Wednesday, April 3, 13

slide-42
SLIDE 42

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ; 3 + +

T_While

7

( T_IntConst 137

Wednesday, April 3, 13

slide-43
SLIDE 43

Scanning a Source File

w h i l e ( 1 < i ) \n \t + i ; 3 + +

T_While

7

( T_IntConst 137

Some tokens can have attributes that store extra information about the token. Here we store which integer is represented. Some tokens can have attributes that store extra information about the token. Here we store which integer is represented.

Wednesday, April 3, 13

slide-44
SLIDE 44

Lexical Analyzer

  • Recognize substrings that correspond to

tokens: lexemes

  • Lexeme: actual text of the token
  • For each lexeme, identify token type
  • < Token type, attribute>
  • attribute: optional, extra information, often

numeric value

Wednesday, April 3, 13

slide-45
SLIDE 45

Why is this process hard?

Wednesday, April 3, 13

slide-46
SLIDE 46

Scanning is Hard

  • FORTRAN: Whitespace is irrelevant

DO 5 I = 1,25 DO5I = 1.25

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

slide-47
SLIDE 47

Scanning is Hard

  • C++: Nested template declarations

vector<vector<int>> myVector

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

slide-48
SLIDE 48

Scanning is Hard

  • C++: Nested template declarations

vector < vector < int >> myVector

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

slide-49
SLIDE 49

Scanning is Hard

  • C++: Nested template declarations

(vector < (vector < (int >> myVector)))

  • Again, can be difficult to determine

where to split.

Thanks to Prof. Alex Aiken

Wednesday, April 3, 13

slide-50
SLIDE 50

Challenges for Lexical Analyzer

  • How do we determine which lexemes are

associated with each token?

  • When there are multiple ways we could

scan the input, how do we know which one to pick?

  • if if1
  • How do we address these concerns

efficiently?

Wednesday, April 3, 13

slide-51
SLIDE 51

Associate Lexemes to Tokens

  • Tokens: categorize lexemes by what

information they provide.

  • Associate lexemes to token: Pattern

matching

  • How to describe patterns??

Wednesday, April 3, 13

slide-52
SLIDE 52

Token: Lexemes

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

  • Integer, float point, string: 2345 2.0 “hello world”
  • Whitespace, comment /* this code is awesome */

Wednesday, April 3, 13

slide-53
SLIDE 53

Token: Lexemes

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

  • Integer, float point, string: 2345 2.0 “hello world”
  • Whitespace, comment /* this code is awesome */

Finite possible lexemes

Wednesday, April 3, 13

slide-54
SLIDE 54

Token: Lexemes

  • Keyword: for int if else while
  • Punctuation: ( ) { } ;
  • Operand: + - ++
  • Relation: < > =
  • Identifier: (variable name,function name) foo

foo_2

  • Integer, float point, string: 2345 2.0 “hello world”
  • Whitespace, comment /* this code is awesome */

Finite possible lexemes Infinite possible lexemes

Wednesday, April 3, 13

slide-55
SLIDE 55
  • How do we describe which (potentially

infinite) set of lexemes is associated with each token type?

Wednesday, April 3, 13

slide-56
SLIDE 56

Formal Languages

  • A formal language is a set of strings.
  • Many infinite languages have finite descriptions:
  • Define the language using an automaton.
  • Define the language using a grammar.
  • Define the language using a regular expression.
  • We can use these compact descriptions of the

language to define sets of strings.

  • Over the course of this class, we will use all of

these approaches.

Wednesday, April 3, 13

slide-57
SLIDE 57
  • What type of formal language should we

use to describe tokens?

Wednesday, April 3, 13

slide-58
SLIDE 58

Regular Expressions

  • Regular expressions are a family of

descriptions that can be used to capture certain languages (the regular languages).

  • Often provide a compact and human-

readable description of the language.

  • Used as the basis for numerous software

systems, including the flex tool we will use in this course.

Wednesday, April 3, 13

slide-59
SLIDE 59

Atomic Regular Expressions

  • The regular expressions we will use in

this course begin with two simple building blocks.

  • The symbol ε is a regular expression

matches the empty string.

  • For any symbol a, the symbol a is a

regular expression that just matches a.

Wednesday, April 3, 13

slide-60
SLIDE 60

Compound Regular Expressions

  • If R1 and R2 are regular expressions, R1R2 is a regular

expression represents the concatenation of the languages of R1 and R2.

  • If R1 and R2 are regular expressions, R1 | R2 is a regular

expression representing the union of R1 and R2.

  • If R is a regular expression, R* is a regular expression for

the Kleene closure of R.

  • If R is a regular expression, (R) is a regular expression

with the same meaning as R.

Wednesday, April 3, 13

slide-61
SLIDE 61

Simple Regular Expressions

  • Suppose the only characters are 0 and 1.
  • Here is a regular expression for strings containing

00 as a substring:

(0 | 1)*00(0 | 1)*

Wednesday, April 3, 13

slide-62
SLIDE 62

Simple Regular Expressions

  • Suppose the only characters are 0 and 1.
  • Here is a regular expression for strings containing

00 as a substring:

(0 | 1)*00(0 | 1)*

Wednesday, April 3, 13

slide-63
SLIDE 63

Simple Regular Expressions

  • Suppose the only characters are 0 and 1.
  • Here is a regular expression for strings containing

00 as a substring:

(0 | 1)*00(0 | 1)*

11011100101 0000 11111011110011111

Wednesday, April 3, 13

slide-64
SLIDE 64

Simple Regular Expressions

  • Suppose the only characters are 0 and 1.
  • Here is a regular expression for strings containing

00 as a substring:

(0 | 1)*00(0 | 1)*

11011100101 0000 11111011110011111

Wednesday, April 3, 13

slide-65
SLIDE 65

Applied Regular Expressions

  • Suppose that our alphabet is all ASCII

characters.

  • A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

?

Wednesday, April 3, 13

slide-66
SLIDE 66

Applied Regular Expressions

  • Suppose that our alphabet is all ASCII

characters.

  • A regular expression for even numbers is

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Wednesday, April 3, 13

slide-67
SLIDE 67

Applied Regular Expressions

  • Suppose that our alphabet is all ASCII

characters.

  • A regular expression for even numbers is

42 +1370

  • 3248
  • 9999912

(+|-)?(0|1|2|3|4|5|6|7|8|9)*(0|2|4|6|8)

Wednesday, April 3, 13

slide-68
SLIDE 68

Wednesday, April 3, 13

slide-69
SLIDE 69
  • More examples
  • Whitespace: [ \t\n]+
  • Integers: [+\-]?[0-9]+
  • Hex numbers: 0x[0-9a-f]+
  • identifier

Wednesday, April 3, 13

slide-70
SLIDE 70
  • More examples
  • Whitespace: [ \t\n]+
  • Integers: [+\-]?[0-9]+
  • Hex numbers: 0x[0-9a-f]+
  • identifier
  • [A-Za-z]([A-Za-z]|[0-9])*

Wednesday, April 3, 13

slide-71
SLIDE 71
  • Use regular expressions to describe token

types

  • How do we match regular expressions?

Wednesday, April 3, 13

slide-72
SLIDE 72

Recognizing Regular Language

What is the machine that recognize regular language??

Wednesday, April 3, 13

slide-73
SLIDE 73

Recognizing Regular Language

  • Finite Automata
  • DFA (Deterministic Finite Automata)
  • NFA (Non-deterministic Finite Automata)

What is the machine that recognize regular language??

Wednesday, April 3, 13

slide-74
SLIDE 74

" "

start

A,B,C,...,Z

A Simple Automaton

Wednesday, April 3, 13

slide-75
SLIDE 75

" "

start

A,B,C,...,Z

Each circle is a state of the

  • automaton. The automaton's

configuration is determined by what state(s) it is in. Each circle is a state of the

  • automaton. The automaton's

configuration is determined by what state(s) it is in.

A Simple Automaton

Wednesday, April 3, 13

slide-76
SLIDE 76

" "

start

A,B,C,...,Z

These arrows are called

  • transitions. The automaton

changes which state(s) it is in by following transitions. These arrows are called

  • transitions. The automaton

changes which state(s) it is in by following transitions.

A Simple Automaton

Wednesday, April 3, 13

slide-77
SLIDE 77

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Finite Automata: Takes an input string and determines whether it’s a valid sentence of a language accept or reject

Wednesday, April 3, 13

slide-78
SLIDE 78

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

The automaton takes a string as input and decides whether to accept or reject the string. The automaton takes a string as input and decides whether to accept or reject the string.

Wednesday, April 3, 13

slide-79
SLIDE 79

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-80
SLIDE 80

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-81
SLIDE 81

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-82
SLIDE 82

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-83
SLIDE 83

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-84
SLIDE 84

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-85
SLIDE 85

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-86
SLIDE 86

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-87
SLIDE 87

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-88
SLIDE 88

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

Wednesday, April 3, 13

slide-89
SLIDE 89

" "

start

A,B,C,...,Z

A Simple Automaton " H E Y A "

The double circle indicates that this state is an accepting state. The automaton accepts the string if it ends in an accepting state. The double circle indicates that this state is an accepting state. The automaton accepts the string if it ends in an accepting state.

Wednesday, April 3, 13

slide-90
SLIDE 90

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

Wednesday, April 3, 13

slide-91
SLIDE 91

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

These are called -transitions ε . These transitions are followed automatically and without consuming any input. These are called -transitions ε . These transitions are followed automatically and without consuming any input.

Wednesday, April 3, 13

slide-92
SLIDE 92

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-93
SLIDE 93

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-94
SLIDE 94

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-95
SLIDE 95

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-96
SLIDE 96

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-97
SLIDE 97

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-98
SLIDE 98

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-99
SLIDE 99

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-100
SLIDE 100

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-101
SLIDE 101

An Even More Complex Automaton

a, b a, c b, c

start

ε ε ε c b a

b c b a

Wednesday, April 3, 13

slide-102
SLIDE 102

Lexer Generator

  • Given regular expressions to describe the

language (token types),

  • Generates NFA that can recognize the

regular language defined

  • existing algorithms
  • Transforms NFA to DFA
  • existing algorithms
  • Tools: lex, flex

Wednesday, April 3, 13

slide-103
SLIDE 103

Challenges for Lexical Analyzer

  • How do we determine which lexemes are

associated with each token?

  • Regular expression to describe token type
  • When there are multiple ways we could

scan the input, how do we know which one to pick?

  • How do we address these concerns

efficiently?

Wednesday, April 3, 13

slide-104
SLIDE 104

Lexing Ambiguities

T_For for T_Identifier [A-Za-z_][A-Za-z0-9_]*

Wednesday, April 3, 13

slide-105
SLIDE 105

Lexing Ambiguities

T_For for T_Identifier [A-Za-z_][A-Za-z0-9_]*

f

  • t

r

Wednesday, April 3, 13

slide-106
SLIDE 106

Lexing Ambiguities

T_For for T_Identifier [A-Za-z_][A-Za-z0-9_]*

f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r f

  • t

r

Wednesday, April 3, 13

slide-107
SLIDE 107

Conflict Resolution

  • Assume all tokens are specified as

regular expressions.

  • Algorithm: Left-to-right scan.
  • Tiebreaking rule one: Maximal munch.
  • Always match the longest possible prefix of

the remaining text.

Wednesday, April 3, 13

slide-108
SLIDE 108

Lexing Ambiguities

T_For for T_Identifier [A-Za-z_][A-Za-z0-9_]*

f

  • t

r f

  • t

r

Wednesday, April 3, 13

slide-109
SLIDE 109

Implementing Maximal Munch

  • Given a set of regular expressions, how

can we use them to implement maximum munch?

  • Idea:
  • Convert expressions to NFAs.
  • Run all NFAs in parallel, keeping track of the

last match.

  • When all automata get stuck, report the last

match and restart the search at that point.

Wednesday, April 3, 13

slide-110
SLIDE 110

Implementing Maximal Munch

  • Given a set of regular expressions, how

can we use them to implement maximum munch?

  • Idea:
  • Convert expressions to NFAs.
  • Run all NFAs in parallel, keeping track of the

last match.

  • When all automata get stuck, report the last

match and restart the search at that point.

Wednesday, April 3, 13

slide-111
SLIDE 111
  • Example

Wednesday, April 3, 13

slide-112
SLIDE 112

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

Wednesday, April 3, 13

slide-113
SLIDE 113

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

Wednesday, April 3, 13

slide-114
SLIDE 114

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-115
SLIDE 115

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-116
SLIDE 116

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-117
SLIDE 117

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-118
SLIDE 118

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-119
SLIDE 119

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-120
SLIDE 120

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-121
SLIDE 121

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-122
SLIDE 122

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-123
SLIDE 123

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-124
SLIDE 124

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-125
SLIDE 125

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-126
SLIDE 126

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-127
SLIDE 127

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-128
SLIDE 128

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-129
SLIDE 129

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-130
SLIDE 130

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-131
SLIDE 131

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-132
SLIDE 132

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-133
SLIDE 133

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-134
SLIDE 134

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-135
SLIDE 135

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-136
SLIDE 136

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-137
SLIDE 137

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-138
SLIDE 138

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-139
SLIDE 139

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-140
SLIDE 140

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-141
SLIDE 141

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-142
SLIDE 142

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-143
SLIDE 143

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-144
SLIDE 144

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-145
SLIDE 145

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-146
SLIDE 146

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-147
SLIDE 147

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-148
SLIDE 148

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-149
SLIDE 149

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-150
SLIDE 150

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-151
SLIDE 151

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-152
SLIDE 152

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-153
SLIDE 153

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-154
SLIDE 154

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-155
SLIDE 155

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-156
SLIDE 156

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-157
SLIDE 157

T_Do do T_Double double T_Mystery [A-Za-z]

Implementing Maximal Munch

start d

  • start

d

  • u

b l e start Σ

D O U B L E D O U B

Wednesday, April 3, 13

slide-158
SLIDE 158

A Minor Simplification

d

  • d
  • u

b l e Σ

ε ε ε

start

Wednesday, April 3, 13

slide-159
SLIDE 159

Other Conflicts

T_Do do T_Double double T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o b u e l

Wednesday, April 3, 13

slide-160
SLIDE 160

More Tiebreaking

  • When two regular expressions apply,

choose the one with the greater “priority.”

  • Simple priority system: pick the rule

that was defined first.

Wednesday, April 3, 13

slide-161
SLIDE 161

Other Conflicts

T_Do do T_Double double T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o b u e l d o b u e l d o b u e l

Wednesday, April 3, 13

slide-162
SLIDE 162

Other Conflicts

T_Do do T_Double double T_Identifier [A-Za-z_][A-Za-z0-9_]*

d o b u e l d o b u e l

Wednesday, April 3, 13

slide-163
SLIDE 163

Implement a lexical analyzer

  • Use regular expressions to describe token

types (keyword, identifier, integer constant..)

  • Use DFA/NFA to recognize the regular

language

  • But...good news. you don’t need to

implement the algorithms to transform your regular expressions to DFA/NFA to recognize it

  • flex: given regular expressions -> output c

code that does lexical analysis (it internally

Wednesday, April 3, 13

slide-164
SLIDE 164

Summary

  • Lexical Analysis
  • Tokens
  • Lexemes
  • Regular expressions
  • DFA
  • NFA
  • Flex: a tool for you to build a lexical analyzer

Wednesday, April 3, 13