Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN - - PowerPoint PPT Presentation

parsing
SMART_READER_LITE
LIVE PREVIEW

Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN - - PowerPoint PPT Presentation

Parsing CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Fall 2018 Chinese University of Hong Kong 1/28 Context-free versus regular Every regular language is context-free regular expression NFA DFA 2/28 Write a CFG for the


slide-1
SLIDE 1

Parsing

CSCI 3130 Formal Languages and Automata Theory

Siu On CHAN Fall 2018

Chinese University of Hong Kong 1/28

slide-2
SLIDE 2

Context-free versus regular

Write a CFG for the language (0 + 1)∗111 S → U111 U → 0U | 1U | ε Can you do so for every regular language? Every regular language is context-free regular expression NFA DFA

2/28

slide-3
SLIDE 3

Context-free versus regular

Write a CFG for the language (0 + 1)∗111 S → U111 U → 0U | 1U | ε Can you do so for every regular language? Every regular language is context-free regular expression NFA DFA

2/28

slide-4
SLIDE 4

From regular to context-free

regular expression ⇒ CFG ∅ grammar with no rules ε S → ε a (alphabet symbol) S → a E1 + E2 S → S1 | S2 E1E2 S → S1S2 E∗

1

S → SS1 | ε S becomes the new start variable

3/28

slide-5
SLIDE 5

Context-free versus regular

Is every context-free language regular? S 0S1 L 0n1n n Is context-free but not regular regular context-free

4/28

slide-6
SLIDE 6

Context-free versus regular

Is every context-free language regular? S → 0S1 L = {0n1n | n 0} Is context-free but not regular regular context-free

4/28

slide-7
SLIDE 7

Ambiguity

slide-8
SLIDE 8

Ambiguity

E → E+E | E*E | (E) | N N → 1 | 2 1+2*2 * + 1 2 2

= 6 + 1 * 2 2 = 5 A CFG is ambiguous if some string has more than one parse tree

5/28

slide-9
SLIDE 9

Example

Is S → SS | x ambiguous? Yes, because S S S x S x S x S S x S S x S x Two ways to derive xxx

6/28

slide-10
SLIDE 10

Example

Is S → SS | x ambiguous? Yes, because S S S x S x S x S S x S S x S x Two ways to derive xxx

6/28

slide-11
SLIDE 11

Disambiguation

S → SS | x ⇒ S → Sx | x S S S x x x Sometimes we can rewrite the grammar to remove ambiguity

7/28

slide-12
SLIDE 12

Disambiguation

E → E+E | E*E | (E) | N N → 1 | 2 + and * have the same precedence! Decompose expression into terms and factors 2 * ( 1 + 2 * 2 ) F F T T F T

8/28

slide-13
SLIDE 13

Disambiguation

E → E+E | E*E | (E) | N N → 1 | 2 An expression is a sum of one or more terms E → T | E+T Each term is a product of one or more factors T → F | T*F Each factor is a parenthesized expression or a number F → (E) | 1 | 2

9/28

slide-14
SLIDE 14

Parsing example

E → T | E+T T → F | T*F F → (E) | 1 | 2 Parse tree for 2+(1+1+2*2)+1 E E E T F 2 + T F ( E E E T F 1 + T F 1 + T T F 2 * F 2 ) + T F 1

10/28

slide-15
SLIDE 15

Disambiguation

Disambiguation is not always possible because There exists inherently ambiguous languages There is no general procedure for disambiguation In programming languages, ambiguity comes from the precedence rules, and we can resolve like in the example In English, ambiguity is sometimes a problem: I look at the dog with one eye

11/28

slide-16
SLIDE 16

Disambiguation

Disambiguation is not always possible because There exists inherently ambiguous languages There is no general procedure for disambiguation In programming languages, ambiguity comes from the precedence rules, and we can resolve like in the example In English, ambiguity is sometimes a problem:

  • I look at
  • the dog with one eye
  • 11/28
slide-17
SLIDE 17

Parsing

S → 0S1 | 1S0S | T T → S | ε input: 0011 Is 0011 ∈ L? If so, how to build a parse tree with a program?

12/28

slide-18
SLIDE 18

Parsing

S → 0S1 | 1S0S | T T → S | ε input: 0011 Try all derivations? S T ε S … 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

13/28

slide-19
SLIDE 19

Parsing

S → 0S1 | 1S0S | T T → S | ε input: 0011 Try all derivations? S T ε S … 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

13/28

slide-20
SLIDE 20

Parsing

S → 0S1 | 1S0S | T T → S | ε input: 0011 Try all derivations? S T ε S … 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

13/28

slide-21
SLIDE 21

Parsing

S → 0S1 | 1S0S | T T → S | ε input: 0011 Try all derivations? S T ε S … 1S0S … 10S10S … 0S1 0T1 … 01S0S1 … 00S11 00T11 0011 ✓ 00S11 … This is (part of) the tree of all derivations, not the parse tree

13/28

slide-22
SLIDE 22

Problems

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop

Let’s tackle the 2nd problem

14/28

slide-23
SLIDE 23

When to stop

S → 0S1 | 1S0S | T T → S | ε Idea: Stop when |derived string| > |input| Problems: S 0S1 0T1 01 Derived string may shrink because of “ -productions” S T S T Derviation may loop because of “unit productions” Remove and unit productions

15/28

slide-24
SLIDE 24

When to stop

S → 0S1 | 1S0S | T T → S | ε Idea: Stop when |derived string| > |input| Problems: S ⇒ 0S1 ⇒ 0T1 ⇒ 01 Derived string may shrink because of “ε-productions” S T S T Derviation may loop because of “unit productions” Remove and unit productions

15/28

slide-25
SLIDE 25

When to stop

S → 0S1 | 1S0S | T T → S | ε Idea: Stop when |derived string| > |input| Problems: S ⇒ 0S1 ⇒ 0T1 ⇒ 01 Derived string may shrink because of “ε-productions” S ⇒ T ⇒ S ⇒ T ⇒ . . . Derviation may loop because of “unit productions” Remove ε and unit productions

15/28

slide-26
SLIDE 26

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a B → ε C → ED | ε D → BC | b E → b D C S AD AC D C E S A Removing

16/28

slide-27
SLIDE 27

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a ✘✘✘ B → ε C → ED | ε D → BC | b E → b D → C S AD AC D C E S A Removing B → ε

16/28

slide-28
SLIDE 28

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a ✘✘✘ B → ε C → ED | ✁ ε D → BC | b E → b D → C| B S → AD AC D C E S A Removing C → ε

16/28

slide-29
SLIDE 29

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a ✘✘✘ B → ε C → ED | ✁ ε D → BC | b E → b D → C | B S → AD AC D → ε C E S A Removing C → ε

16/28

slide-30
SLIDE 30

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a ✘✘✘ B → ε C → ED | ✁ ε D → BC | b E → b D → C | B S → AD | AC ✘✘✘ ✘ D → ε C → E S A Removing D → ε

16/28

slide-31
SLIDE 31

Removing ε-productions

Goal: remove all A → ε rules for every non-start variable A If S is the start variable and the rule S → ε exists Add a new start variable T Add the rule T → S For every rule A → ε where A is not the (new) start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ S → ACD A → a ✘✘✘ B → ε C → ED | ✁ ε D → BC | b E → b D → C | B S → AD | AC ✘✘✘ ✘ D → ε C → E S → A Removing D → ε

16/28

slide-32
SLIDE 32

Eliminating ε-productions

For every A → ε rule where A is not the start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ Do 2. every time A appears B → αAβAγ yields B → αβAγ B → αAβγ B → αβγ B A becomes B If B was removed earlier, don’t add it back

17/28

slide-33
SLIDE 33

Eliminating ε-productions

For every A → ε rule where A is not the start variable

  • 1. Remove the rule A → ε
  • 2. If you see B → αAβ

Add a new rule B → αβ Do 2. every time A appears B → αAβAγ yields B → αβAγ B → αAβγ B → αβγ B → A becomes B → ε If B → ε was removed earlier, don’t add it back

17/28

slide-34
SLIDE 34

Eliminating unit productions

A unit production is a production of the form A → B Grammar: S → 0S1 | 1S0S | T T → S | R | ε R → 0SR Unit production graph: S T R

18/28

slide-35
SLIDE 35

Removing unit productions

1 If there is a cycle of unit productions A → B → · · · → C → A delete it and replace everything with A (any variable in the cycle) S → 0S1 | 1S0S | T T → S | R | ε R → 0SR S T R S 0S1 1S0S S R R 0SR Replace T by S

19/28

slide-36
SLIDE 36

Removing unit productions

1 If there is a cycle of unit productions A → B → · · · → C → A delete it and replace everything with A (any variable in the cycle) S → 0S1 | 1S0S |

  • T
  • T → ✓

S | R | ε R → 0SR S T R S → 0S1 | 1S0S S → R | ε R → 0SR Replace T by S

19/28

slide-37
SLIDE 37

Removal of unit productions

2 replace any chain A → B → · · · → C → α by A → α, B → α, . . . , C → α S → 0S1 | 1S0S | R | ε R → 0SR S R S 0S1 1S0S 0SR R 0SR Replace S R 0SR by S 0SR R 0SR

20/28

slide-38
SLIDE 38

Removal of unit productions

2 replace any chain A → B → · · · → C → α by A → α, B → α, . . . , C → α S → 0S1 | 1S0S | R | ε R → 0SR S R S → 0S1 | 1S0S | 0SR | ε R → 0SR Replace S → R → 0SR by S → 0SR, R → 0SR

20/28

slide-39
SLIDE 39

Recap

Problems:

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop

✓ Solution to problem 2:

  • 1. Eliminate ε productions
  • 2. Eliminate unit productions

Try all possible derivations but stop parsing when |derived string| > |input|

21/28

slide-40
SLIDE 40

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0 input: 0011 S 0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011 L

22/28

slide-41
SLIDE 41

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0 input: 0011 S 0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011 L

22/28

slide-42
SLIDE 42

Example

S → 0S1 | 0S0S | T T → S | 0 = ⇒ S → 0S1 | 0S0S | 0 input: 0011 S 0S0S 00S0S0S too long 00S10S too long 000S 0000S0S too long 0000S1 too long 0000 ✗ 0S1 00S0S1 too long 00S11 too long 001 ✗ 0 ✗ Conclusion: 0011 / ∈ L

22/28

slide-43
SLIDE 43

Problems

  • 1. Trying all derivations may take too long
  • 2. If input is not in the language, parsing will never stop

23/28

slide-44
SLIDE 44

Preparations

A faster way to parse: Cocke–Younger–Kasami algorithm To use it we must perprocess the CFG: Eliminate ε productions Eliminate unit productions Convert CFG to Chomsky Normal Form

24/28

slide-45
SLIDE 45

Chomsky Normal Form

A CFG is in Chomsky Normal Form if every production has the form A → BC

  • r

A → a where neither B nor C is the start variable but we also allow S → ε for start variable S Noam Chomsky Convert to Chomsky Normal Form:

A → BcDE

= ⇒

replace termi- nals with new variables A → BCDE C → c

= ⇒

break up sequences with new variables A → BX X → CY Y → DE C → c

25/28

slide-46
SLIDE 46

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba let x[i, ℓ] = xixi+1 . . . xi+ℓ−1 b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A C A C B A C S A B S C S A

For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

26/28

slide-47
SLIDE 47

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba let x[i, ℓ] = xixi+1 . . . xi+ℓ−1 b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A C A C B A C S A B S C S A

For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

26/28

slide-48
SLIDE 48

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba let x[i, ℓ] = xixi+1 . . . xi+ℓ−1 b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S A B S C S A

For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

26/28

slide-49
SLIDE 49

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba let x[i, ℓ] = xixi+1 . . . xi+ℓ−1 b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S|A B S C S A

For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

26/28

slide-50
SLIDE 50

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba let x[i, ℓ] = xixi+1 . . . xi+ℓ−1 b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C S|A B S|C S|A

For every substring x[i, ℓ], remember all variables R that derive x[i, ℓ] Store in a table T[i, ℓ]

26/28

slide-51
SLIDE 51

Computing T[i, ℓ] for ℓ 2

Example: to compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1

A|C B

2

B S|A

3

B A|C

Look up entries regarding shorter substrings previously computed S AB BC A BA a B CC b C AB a T 2 4 S A C

27/28

slide-52
SLIDE 52

Computing T[i, ℓ] for ℓ 2

Example: to compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1

A|C B

2

B S|A

3

B A|C

Look up entries regarding shorter substrings previously computed S AB BC A BA a B CC b C AB a T 2 4 S A C

27/28

slide-53
SLIDE 53

Computing T[i, ℓ] for ℓ 2

Example: to compute T[2, 4] Try all possible ways to split x[2, 4] into two substrings b a a b a 1

A|C B

2

B S|A

3

B A|C

Look up entries regarding shorter substrings previously computed S → AB | BC A → BA | a B → CC | b C → AB | a T[2, 4] = S|A|C

27/28

slide-54
SLIDE 54

Cocke–Younger–Kasami algorithm

S → AB | BC A → BA | a B → CC | b C → AB | a Input: x = baaba b a a b a i ℓ 1 2 3 4 5 1 2 3 4 5

B A|C A|C B A|C A S| B S| C S|A

  • B

B

  • S|A|C

S |A|C

Get parse tree by tracing back derivations

28/28