Plan for Today and Beginning Next week (Lexical Analysis) Regular - - PowerPoint PPT Presentation

plan for today and beginning next week lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Plan for Today and Beginning Next week (Lexical Analysis) Regular - - PowerPoint PPT Presentation

Plan for Today and Beginning Next week (Lexical Analysis) Regular Expressions Finite State Machines DFAs: Deterministic Finite Automata Complications NFAs: Non Deterministic Finite State Automata From Regular Expressions to NFAs From NFAs to


slide-1
SLIDE 1

CS453 Lecture Regular Expressions and Transition Diagrams 1

Plan for Today and Beginning Next week (Lexical Analysis)

Regular Expressions Finite State Machines DFAs: Deterministic Finite Automata Complications NFAs: Non Deterministic Finite State Automata From Regular Expressions to NFAs From NFAs to DFAs
slide-2
SLIDE 2

CS453 Lecture Regular Expressions and Transition Diagrams 2

Structure of a Typical Compiler

“sentences” Synthesis

  • ptimization

code generation target language IR IR code generation IR Analysis character stream lexical analysis “words” tokens semantic analysis syntactic analysis AST annotated AST interpreter

slide-3
SLIDE 3

CS453 Lecture Regular Expressions and Transition Diagrams 3

Tokens for Example MeggyJava program

import meggy.Meggy; import meggy.Meggy; class PA3Flower { class PA3Flower { public static void main(String[] whatever){ public static void main(String[] whatever){ { // Upper left petal, clockwise // Upper left petal, clockwise Meggy.setPixel( (byte)2, (byte)4, Meggy.Color.VIOLET ); Meggy.setPixel( (byte)2, (byte)4, Meggy.Color.VIOLET ); Meggy.setPixel( (byte)2, (byte)1, Meggy.Color.VIOLET); Meggy.setPixel( (byte)2, (byte)1, Meggy.Color.VIOLET); … … } } Tokens: Tokens: Symbol(IMPORT,null), Symbol(MEGGY,null), Symbol(SEMI,null), Symbol(CLASS,null), Symbol(ID,”PA3Flower”), Symbol(LBRACE,null), …

slide-4
SLIDE 4

About The Slides on Languages and Finite Automata

Slides Originally Developed by Prof. Costas Busch (2004) – Many thanks to Prof. Busch for developing the original slide set. Adapted with permission by Prof. Dan Massey (Spring 2007) – Subsequent modifications, many thanks to Prof. Massey for CS 301 slides Adapted with permission by Prof. Michelle Strout (Spring 2011) – Adapted for use in CS 453 – Adapted by Wim Bohm( added regular expr à à NFA à à DFA, Spr2012)

slide-5
SLIDE 5

A language is a set of strings (sometimes called sentences) String: A finite sequence of letters

Examples: “cat”, “dog”, “house”, … Defined over a fixed alphabet:

{ }

z c b a , , , , … = Σ

Languages

slide-6
SLIDE 6

Empty String

A string with no letters: ε (sometimes λ is used) Observations:

ε = 0 εw = wε = w εabba = abbaε = abba

slide-7
SLIDE 7

Regular Expressions

Regular expressions describe regular languages You have probably seen them in OSs / editors Example: describes the language

(a |(b)(c))*

L((a |(b)(c))*) = ε,a,bc,aa,abc,bca,...

{ }

slide-8
SLIDE 8

Recursive Definition for Specifying Regular Expressions

∅, ε, α

r

1 | r 2

r

1 r 2

r

1 *

r

1

( )

Are regular expressions Primitive regular expressions: where 2

r

1

r

Given regular expressions and

α ∈ Σ, somealphabet

slide-9
SLIDE 9

Regular operators choice: A | B a string from L(A) or from L(B)

concatenation: A B a string from L(A) followed by a string from L(B)

repetition: A* 0 or more concatenations of strings from L(A) A+ 1 or more grouping: ( A )

Concatenation has precedence over choice: A|B C vs. (A|B)C More syntactic sugar, used in scanner generators: [abc] means a or b or c [\t\n ] means tab, newline, or space [a-z] means a,b,c, …, or z

CS453 Lecture Regular Expressions and Transition Diagrams 9

slide-10
SLIDE 10

Example Regular Expressions and Regular Definitions

Regular definition: name : regular expression name can then be used in other regular expressions Keywords “print”, “while” Operations: “+”, “-”, “*” Identifiers: let : [a-zA-Z] // chose from a to z or A to Z dig : [0-9] id : let (let | dig)* Numbers: dig+ = dig dig*

CS453 Lecture Regular Expressions and Transition Diagrams 10

slide-11
SLIDE 11

Finite Automaton

Input String Output String Finite Automaton

slide-12
SLIDE 12

Finite Accepter

Input “Accept”

  • r

“Reject” String Finite Automaton Output

slide-13
SLIDE 13

State Transition Graph

initial state final state “accept” state transition abba -Finite Accepter q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a,

slide-14
SLIDE 14

Initial Configuration 1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, Input String a b b a b a, q

slide-15
SLIDE 15

Reading the Input

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b b a b a,

slide-16
SLIDE 16

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b b a b a,

slide-17
SLIDE 17

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b b a b a,

slide-18
SLIDE 18

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b b a b a,

slide-19
SLIDE 19

q

1

q

2

q

3

q

4

q a b b a Output: “accept”

5

q a a b b b a, a b b a b a, Input finished

slide-20
SLIDE 20

String Rejection 1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b a b a, q

slide-21
SLIDE 21

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b a b a,

slide-22
SLIDE 22

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b a b a,

slide-23
SLIDE 23

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, a b a b a,

slide-24
SLIDE 24

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, Output: “reject” a b a b a, Input finished

slide-25
SLIDE 25

The Empty String 1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a, q

ε

slide-26
SLIDE 26

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a, q Output: “reject”

Would it be possible to accept the empty string?

ε

slide-27
SLIDE 27

Another Example

a b b a, b a, q

1

q

2

q a b a

slide-28
SLIDE 28

a b b a, b a, q

1

q

2

q a b a

slide-29
SLIDE 29

a b b a, b a, q

1

q

2

q a b a

slide-30
SLIDE 30

a b b a, b a, q

1

q

2

q a b a

slide-31
SLIDE 31

a b b a, b a, q

1

q

2

q a b a Output: “accept” Input finished

slide-32
SLIDE 32

Rejection

a b b a, b a, q

1

q

2

q a b b

slide-33
SLIDE 33

a b b a, b a, q

1

q

2

q a b b

slide-34
SLIDE 34

a b b a, b a, q

1

q

2

q a b b

slide-35
SLIDE 35

a b b a, b a, q

1

q

2

q a b b

slide-36
SLIDE 36

a b b a, b a, q

1

q

2

q a b b Output: “reject” Input finished

Which strings are accepted?

slide-37
SLIDE 37

Formalities

Deterministic Finite Automaton (DFA)

( )

F q Q M , , , , δ Σ = Q Σ δ q F

: set of states : input alphabet : transition function : initial state : set of final (accepting) states

slide-38
SLIDE 38

Input Alphabet

Σ

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a,

{ }

b a, = Σ

b a,

slide-39
SLIDE 39

Set of States

Q

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a,

{ }

5 4 3 2 1

, , , , , q q q q q q Q =

b a,

slide-40
SLIDE 40

Initial State

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a, q

slide-41
SLIDE 41

Set of Final States

F

q

1

q

2

q

3

q a b b a

5

q a a b b b a,

{ }

4

q F =

b a,

4

q

slide-42
SLIDE 42

Transition Function

δ

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a,

Q Q → Σ × : δ

b a,

slide-43
SLIDE 43

( )

1 0,

q a q = δ

2

q

3

q

4

q a b b a

5

q a a b b b a, b a, q

1

q

slide-44
SLIDE 44

( )

5 0,

q b q = δ

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a, q

slide-45
SLIDE 45

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, b a,

( )

3 2,

q b q = δ

slide-46
SLIDE 46

Transition Function / table

δ

q

1

q

2

q

3

q

4

q a b b a

5

q a a b b b a, δ a b q

1

q

2

q

3

q

4

q

5

q

1

q

5

q

5

q

2

q

5

q

3

q

4

q

5

q b a,

5

q

5

q

5

q

5

q

slide-47
SLIDE 47

Complications

  • 1. "1234" is an NUMBER but what about the “123” in “1234”
  • r the “23”, etc. Also, the scanner must recognize many tokens,

not one, only stopping at end of file.

  • 3. "if" is a keyword or reserved word IF, but "if" is also defined by

the reg. exp. for identifier ID. We want to recognize IF.

  • 4. We want to discard white space and comments.
  • 5. "123" is a NUMBER but so is "235" and so is "0", just as

"a" is an ID and so is "bcd”, we want to recognize a token, but add attributes to it.

CS453 Lecture Regular Expressions and Transition Diagrams 47

slide-48
SLIDE 48

Complications 1

  • 1. "1234" is an NUMBER but what about the “123” in “1234”
  • r the “23”, etc. Also, the scanner must recognize many tokens,

not one, only stopping at end of file. So: recognize the largest string defined by some regular expression,

  • nly stop getting more input if there is no more match. This introduces

the need to reconsider a character, as it is the first of the next token e.g. fname(a,bcd ); would be scanned as ID OPEN ID COMMA ID CLOSE SEMI EOF scanning fname would consume (, which would be put back and then recognized as OPEN

CS453 Lecture Regular Expressions and Transition Diagrams 48

slide-49
SLIDE 49

Complication 2

  • 2. "if" is a keyword or reserved word IF, but "if" is also defined by

the reg. exp. for identifier ID, we want to recognize IF, so Have some way of determining which token ( IF or ID ) is recognized. This can be done using priority, e.g. in scanner generators an earlier definition has a higher priority than a later one. By putting the definition for IF before the definition for ID in the input for the scanner generator, we get the desired result. What about the string “ifyouleavemenow”?

CS453 Lecture Regular Expressions and Transition Diagrams 49

slide-50
SLIDE 50

Complication 3

  • 3. we want to discard white space and comments and not bother the

parser with these. So: in scanner generators, we can specify, using a regular expression, white space e.g. [\t\n ] and return no token, i.e. move to the next specify comments using a (NASTY) regular expression and again return no token, move to the next

CS453 Lecture Regular Expressions and Transition Diagrams 50

slide-51
SLIDE 51

Complication 4

  • 4. "123" is a NUMBER but so is "235" and so is "0", just as
"a" is an ID and so is "bcd”, we want to recognize a token, but add attributes to it. So, Scanners return Symbols, not tokens. A Symbol is a (token, tokenValue) pair, e.g. (NUMBER,123) or (ID,"a"). Often more information is added to a symbol, e.g. line number and

position (as we will do in MeggyJava)

CS453 Lecture Regular Expressions and Transition Diagrams 51

slide-52
SLIDE 52

(Non) Deterministic Finite State Automata

A Deterministic Finite State Automaton (DFA) has disjoint character sets on its edges, i.e. the choice “which state is next” is deterministic. A Non-deterministic Finite State Automaton (NFA) does NOT, i.e. it can have character sets on its edges that overlap (non empty intersection), and empty sets on the some edges (labeled ε ). NFAs are used in the translation from regular expressions to FSAs. E.g. when we combine the reg. exp for IF with the reg.exp for ID by just merging the two Transition graphs, we would get an NFA. NFAs are a first step in creating a DFA for a scanner. The NFA is then transformed into a DFA.

CS453 Lecture Regular Expressions and Transition Diagrams 52

slide-53
SLIDE 53

From regular expressions to NFAs

regexp simple letter “a” empty string AB concat the NFAs A|B split merge them A* build a loop

CS453 Lecture Regular Expressions and Transition Diagrams 53

a ε A B A B ε ε ε A ε ε

accept state of the NFA for A

slide-54
SLIDE 54

The Problem

DFAs are easy to execute (table driven interpretation) NFAs are easy to build from reg. exps, but hard to execute we would need some form of guessing, implemented by back tracking To build a DFA from an NFA we avoid the back track by taking all

choices in the NFA at once, a move with a character or ε gets us

to a set of states in the NFA, which will become one state in the DFA. We keep doing this until we have exhausted all possibilities. This mechanism is called transitive closure (This ends because there is only a finite set of subsets of NFA states. How many are there? )

CS453 Lecture Regular Expressions and Transition Diagrams 54

slide-55
SLIDE 55

Example IF and ID

let : [a-z] dig : [0-9] tok : if | id if : “i” “f” id : let (let | dig)*

CS453 Lecture Regular Expressions and Transition Diagrams 55

slide-56
SLIDE 56

Example: NFA for IF and ID

CS453 Lecture Regular Expressions and Transition Diagrams 56

i f IF ε a-z

2 3 4 5 8

ε a-z 0-9

7 6

ε ε ID

IF has priority over ID. From 0, with ε we can get to states 1 and 4 this is called an ε-closure We can now simulate the behavior of the NFA and build a table for the DFA making character moves plus ε-closures

let : [a-z]

dig : [0-9] tok : if | id if : “i” “f” id : let (let | dig)*

1

ε

slide-57
SLIDE 57

NFA simulation scanning “in”

CS453 Lecture Regular Expressions and Transition Diagrams 57

ε a-z

4 5 8

ε a-z 0-9

7 6

ε ε ID

DFAstate NFAstates Move Next 0 0,1,4 i 2,5,8,6 1 2,5,6,8 n 6,7,8 Only one of the states in 6,7,8 is an accepting state, an ID accepting state, so “in” is an ID

i f IF

2 3 1

ε

slide-58
SLIDE 58

NFA simulation scanning “if”

CS453 Lecture Regular Expressions and Transition Diagrams 58

ε a-z

4 5 8

ε a-z 0-9

7 6

ε ε ID

DFAstate NFAstates Move Next 0 0,1,4 i 2,5,6,8 1 2,5,6,8 f 3,6,7,8 Two of the states in 3,6,7,8 are accepting, an IF accepting state (3) and an ID accepting state (8), IF has priority

  • ver ID, so “if” is an IF

ε i f IF

2 3 1

ε

slide-59
SLIDE 59

Definitions: edge(s,c) and closure

edge(s,c): the set of all NFA states reachable from state s following an edge with character c closure(S): the set of all states reachable from S with no chars or ε T=S repeat T’=T; forall s in T’ { T’=T; } until T’==T This transitive closure algorithm terminates because there is a finite number of states in the NFA

CS453 Lecture Regular Expressions and Transition Diagrams 59

closure(S) = T = S∪( edge(s,ε))

s∈T

T = T '∪( edge(s,ε))

s∈T '

slide-60
SLIDE 60

DFAedge and NFA Simulation Suppose we are in state DFA d = {si, sk,sl} By moving with character c from d we reach a set of new NFA states, call these DFAedge(d,c), a new or already existing DFA state NFA simulation: let the input string be c1…ck d=closure({s1}) // s1 the start state of the NFA for i from 1 to k d = DFAedge(d,ci)

CS453 Lecture Regular Expressions and Transition Diagrams 60

DFAedge(d,c) = closure( edge(s,c))

s∈d

slide-61
SLIDE 61

Constructing a DFA with closure and DFAEdge

state d1 = closure(s1) the closure of the start state of the NFA make new states by moving from existing states with a character c, using DFAEdge(d,c); record these in the transition table make accepts in the transition table, if there is an accepting state in d, decide priority if more than one accept state. Instead of characters we use non-overlapping (DFA) character classes to keep the table manageable.

CS453 Lecture Regular Expressions and Transition Diagrams 61

slide-62
SLIDE 62

NFA to DFA (let’s build it)

CS453 Lecture Regular Expressions and Transition Diagrams 62

i f ε a-z

1 2 3 4 5 8

ε a-z 0-9

7 6

ε

IF ID

ε

slide-63
SLIDE 63

NFA to DFA

CS453 Lecture Regular Expressions and Transition Diagrams 63

i f ε a-z

1 2 3 4 5 8

ε a-z 0-9

7 6

ε

1: 1,4 2: 2,5,6,8

i

3: 3,6,7,8

f

IF IF 5: 5,6,8 ID 4: 6,7,8

a-h j-z a-z 0-9 a-z 0-9 a-z 0-9

ID ID

ε

ID

a-e g-z 0-9

slide-64
SLIDE 64

The transition table for IF ID

p NFAstates(p) i f a-h a-e,g-z a-z,0-9 ACPT j-z 0-9 1 {1,4} {2,5,6,8} {5,6,8} 2 {2,5,6,8} {3,6,7,8} {6,7,8} ID 3 {3,6,7,8} {6,7,8} IF 4 {6,7,8} {6,7,8} ID 5 {5,6,8} {6,7,8} ID

CS453 Lecture Regular Expressions and Transition Diagrams 64

slide-65
SLIDE 65

Suggested Exercise

Build an NFA and a DFA for integer and float literals dot: “.” dig: [0-9] int-lit: dig+ float-lit: dig* dot dig+

CS453 Lecture Regular Expressions and Transition Diagrams 65