CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation

cs502 compiler design syntax analysis manas thakur
SMART_READER_LITE
LIVE PREVIEW

CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation

CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 Where are we? Character stream Machine-Independent Machine-Independent Lexical Analyzer Lexical Analyzer Code Optimizer Code Optimizer B a c k e n d Intermediate


slide-1
SLIDE 1

CS502: Compiler Design Syntax Analysis Manas Thakur

Fall 2020

slide-2
SLIDE 2

Manas Thakur CS502: Compiler Design 2

Where are we?

Lexical Analyzer Lexical Analyzer Syntax Analyzer Syntax Analyzer Semantic Analyzer Semantic Analyzer Intermediate Code Generator Intermediate Code Generator Character stream Token stream Syntax tree Syntax tree Intermediate representation Machine-Independent Code Optimizer Machine-Independent Code Optimizer Code Generator Code Generator Target machine code Intermediate representation Machine-Dependent Code Optimizer Machine-Dependent Code Optimizer Target machine code Symbol Table

F r o n t e n d B a c k e n d

slide-3
SLIDE 3

Manas Thakur CS502: Compiler Design 3

Roles of Parsing / Syntax analysis

  • Read the specification given by the language implementor.
  • Get help from lexer to collect tokens.
  • Check if the sequence of tokens matches the specification.
  • Declare successful program structure or report errors in a useful

manner.

  • Later: Also identify some semantic errors.
slide-4
SLIDE 4

Manas Thakur CS502: Compiler Design 4

Specifying the syntax

  • Regular expressions are mostly

not capable enough.

  • Syntactic constructs specified using

context-free grammars.

  • The corresponding language is

called a context-free language.

  • CFGs subsume REs.

– Then why did we use REs for scanning?

  • Right tool for the right job!
slide-5
SLIDE 5

Manas Thakur CS502: Compiler Design 5

Contex-Free Grammar (CFG)

  • 1. A set of terminals called tokens.

 Terminals are elementary symbols

  • f the parsing language.
  • 2. A set of non-terminals called variables.

 A non-terminal represents a set of strings of

terminals.

  • 3. A set of productions.

– They define the syntactic rules.

  • 4. A start symbol designated by a non-terminal.

list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9

slide-6
SLIDE 6

Manas Thakur CS502: Compiler Design 6

Productions

All of the below are productions (or rules):

list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9

left or head right or body

slide-7
SLIDE 7

Manas Thakur CS502: Compiler Design 7

Derivations

  • A grammar derives strings by beginning with the start

symbol and repeatedly replacing a non-terminal by the body

  • f a production for that non-terminal.
  • The above grammar derives sentences like

– 3+1-0+8-2+0+1+5 – 0

  • The set of all such strings forms the language specified by the

above CFG.

list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9

slide-8
SLIDE 8

Manas Thakur CS502: Compiler Design 8

Practice

  • Write a CFG to generate strings of the form 0n1n.

– S --> 0S1 – S --> ε – Can also be written as:

  • S --> 0S1 | ε
  • Homework:

– wcwr

slide-9
SLIDE 9

Manas Thakur CS502: Compiler Design 9

Derivations (cont.)

  • Given a CFG, we can derive strings in the associated CFL by

succesively replacing the non-terminals based on productions.

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /

Example derivation (x + 2 * y):

goal → expr → expr op expr → id op expr → x op expr → x + expr → x + expr op expr → x + num op expr → x + 2 op expr → x + 2 * expr → x + 2 * id → x + 2 * y

slide-10
SLIDE 10

Manas Thakur CS502: Compiler Design 10

Leftmost derivations

  • What did we do at each step in the previous derivation?

– Replaced the leftmost non-terminal – Called a leftmost derivation – expr,

expr op expr, id op expr, etc. are the leftmost sentential forms Example derivation (x + 2 * y):

goal → expr → expr op expr → id op expr → x op expr → x + expr → x + expr op expr → x + num op expr → x + 2 op expr → x + 2 * expr → x + 2 * id → x + 2 * y

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /
slide-11
SLIDE 11

Manas Thakur CS502: Compiler Design 11

Rightmost derivations

  • Replace the rightmost non-terminal at each step

– Called a rightmost derivation – expr,

expr op expr, expr op id, etc. are the rightmost sentential forms Example derivation (x + 2 * y):

goal → expr → expr op expr → expr op id → expr op y → expr * y → expr op expr * y → expr op num * y → expr op 2 * y → expr + 2 * y → id + 2 * y → x + 2 * y

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /

goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9

  • p → + | - | * | /
slide-12
SLIDE 12

Manas Thakur CS502: Compiler Design 12

Formally

  • →* denotes a derivation of zero or more steps
  • →+ denotes a derivation of one or more steps
  • If S →* β, then β is a sentential form of the associated grammar G.
  • L(G) = {w | S →+ w and w consists only of terminals}; w

L(G) is ∈ called a sentence of G.

  • The process of discovering a derivation is called parsing.
  • The output is a parse tree, which we shall see tomorrow.
slide-13
SLIDE 13

CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur

Fall 2020

slide-14
SLIDE 14

Manas Thakur CS502: Compiler Design 14

Parse Tree

  • A pictorial representation of program derivation.
  • A parse tree for x + y * z:

expr expr

+ +

expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + expr | expr * expr | id id → a | b | ... | z

expr expr expr expr id id x x expr expr

* *

expr expr id id y y id id z z

slide-15
SLIDE 15

Manas Thakur CS502: Compiler Design 15

Precedence

  • Another parse tree for x+y*z:
  • Operator evaluation in a left-to-right tree walk gives: (x+y)*z

– Wrong answer! – Should have been: x+(y*z)

expr expr

* *

expr expr expr expr id id z z expr expr

+ +

expr expr id id x x id id y y

slide-16
SLIDE 16

Manas Thakur CS502: Compiler Design 16

The precedence problem

  • Our grammar has no notion of precedence or an implied order of

evaluation.

  • Ideally, multiplication should be enforced before addition.
  • Will the green grammar generate all the strings that could be

generated by the orange grammar?

  • Does it solve the problem?

expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z

slide-17
SLIDE 17

Manas Thakur CS502: Compiler Design 17

New derivation and parse tree

expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z

expr → expr + term → expr + term * factor → expr + term * id → expr + term * z → expr + factor * z → expr + id * z → expr + y * z → term + y * z → id + y * z → x + y * z expr expr

+ +

expr expr term term id id x x term term

* *

factor factor id id z z factor factor id id y y term term factor factor

C

  • r

r e c t t r e e

  • w

a l k !

slide-18
SLIDE 18

Manas Thakur CS502: Compiler Design 18

Ambiguity

  • रोको मत जाने दो

– Whether to stop or let go.

  • Sarah gave a bath to her dog wearing a pink t-shirt.

– Who was wearing the pink t-shirt?

slide-19
SLIDE 19

Manas Thakur CS502: Compiler Design 19

Ambiguity in grammars

  • If a grammar has more than one leftmost or rightmost derivation

for a single sentential form, then it is ambiguous.

  • Example:
  • Try deriving the sentential form:

– if E1 then if E2 then S1 else S2

<stmt> → if <expr> then <stmt> | if <expr> then <stmt> else <stmt> | <other stmts> <stmt> → if <expr> then <stmt> | if <expr> then <stmt> else <stmt> | <other stmts>

if E1 then if E2 then S1 else S2 if E1 then if E2 then S1 else S2

A m b i g u

  • u

s g r a m m a r !

slide-20
SLIDE 20

Manas Thakur CS502: Compiler Design 20

Resolving ambiguity

  • Need to re-arrange the grammar.
  • Match an else with the closest unmatched then:
  • Check: if E1 then if E2 then S1 else S2
  • Not a trivial task, but comes with practice.

<stmt> → <matched> | <unmatched> <matched> → if <expr> then <matched> else <matched> | <other stmts> <unmatched> → if <expr> then <stmt> | if <expr> then <matched> else <unmatched> <stmt> → <matched> | <unmatched> <matched> → if <expr> then <matched> else <matched> | <other stmts> <unmatched> → if <expr> then <stmt> | if <expr> then <matched> else <unmatched>

slide-21
SLIDE 21

Manas Thakur CS502: Compiler Design 21

Parsing

  • Given a string and a grammar, how do we check whether the

string follows the grammar?

  • In other words, how do compilers parse input programs?
  • Homework:

– Look up “C grammar”

  • Find out the number of productions.
  • Try to understand the grammar.
slide-22
SLIDE 22

Manas Thakur CS502: Compiler Design 22

Different ways of parsing

  • Top-down parsing

– Start with the root production – Go down towards the leaves trying to obtain the string

  • Bottom-up parsing

– Start with the leaves – Go up and try to reach the root production

slide-23
SLIDE 23

Manas Thakur CS502: Compiler Design 23

Top-down parsing

  • Start with the root (recall the <goal>?)
  • Keep expanding productions till you get the string
  • Finds leftmost derivation
  • Problem?

– Backtracking!

  • General method: Recursive descent
  • Special method: Predictive (aka LL(k))

– Avoids backtracking – Fixed lookahead (k)

slide-24
SLIDE 24

Manas Thakur CS502: Compiler Design 24

Left recursion

  • Top-down parsers cannot handle left recursion
  • A grammar is left recursive if:

∃a non-terminal A such that A →+ Aα for some string α

  • Have we seen an example of a left-recursive grammar?

expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z

slide-25
SLIDE 25

Manas Thakur CS502: Compiler Design 25

Eliminating left recursion

  • Consider the grammar:

– where α and β do not start with A.

  • We can rewrite this as:
  • The new grammar does not contain left recursion.

A → Aα | β A → Aα | β A → βA’ A’ → αA’ | ε A → βA’ A’ → αA’ | ε

slide-26
SLIDE 26

Manas Thakur CS502: Compiler Design 26

Classwork

  • Eliminate left recursion from our favorite grammar:
  • Answer:

expr → expr + term | term term → term * factor | factor factor → id expr → expr + term | term term → term * factor | factor factor → id E → E + T | T T → T * F | F F → id E → E + T | T T → T * F | F F → id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → id

slide-27
SLIDE 27

Manas Thakur CS502: Compiler Design 27

Recursive descent parsing

int A() { foreach (production of the form A → X1 X2 X3 ... Xk) { for (i = 1 to k) { if (Xi is a nonterminal) { if (Xi() != 0) { backtrack(); break; // Try next production } } else if (Xi == next_symbol) advance_input(); else { backtrack(); break; // Try next production } } if (i == k+1) return 0; // Success else return 1; // Failure } }

S → c A d A → a b | a

Input string: cad S S S S A A c d S S A A c d a b S S A A c d a cad cad cad cad

slide-28
SLIDE 28

CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur

Fall 2020

slide-29
SLIDE 29

Manas Thakur CS502: Compiler Design 29

Predictive parsing

  • Avoids backtracking
  • Needs to look ahead

– A → α | β; which production to choose?

  • Basic idea with lookahead of k:

– Read k extra characters and see which production to choose – What if reading k characters is not enough?

  • A → αβ1 | αβ2
  • We can left factor the grammar
slide-30
SLIDE 30

Manas Thakur CS502: Compiler Design 30

Left factoring

  • When the choice between two alternative productions is not

clear, rewrite the grammar to defer the decision until enough input is seen.

  • A → α β1 | α β2
  • Here, common prefix α can be left factored:

A → α A' A' → β1 | β2

  • Note: Left factoring doesn't change ambiguity.

– e.g., it doesn’t solve the matching else problem.

slide-31
SLIDE 31

Manas Thakur CS502: Compiler Design 31

FIRST and FOLLOW

  • Aid (top-down) predictive parsing.

– Also bottom-up parsing (a few classes later)

  • Allow a parser to choose which production to apply, based on

lookahead.

  • Informally:

– FIRST(α) gives the set of terminals that can occur at the first

position on expanding α.

– FOLLOW(α) gives the set of terminals that could occur immediately

after expanding α.

slide-32
SLIDE 32

Manas Thakur CS502: Compiler Design 32

Computation of FIRST

  • For a string of grammar symbols α, FIRST(α) is computed as:

– The set of terminals that begin strings derived from α:

{a | α →* aβ }

– If α →* ε, then ε is also in FIRST(α)

  • Find the FIRST sets of all the non-terminals in our (slightly

expanded) favorite grammar:

E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id}

slide-33
SLIDE 33

Manas Thakur CS502: Compiler Design 33

Computation of FOLLOW

  • For a non-terminal A, FOLLOW(A) is computed as:

– If S

* αAaβ, then FOLLOW(A) contains a -- basically FIRST( ⇨ aβ).

– If S

* αABaβ and B * ϵ then FOLLOW(A) contains a. ⇨ ⇨

– If S

* αA, then FOLLOW(A) contains FOLLOW(S). ⇨

– FOLLOW(S) always contains $.

  • Find the FOLLOW sets of all the non-terminals in our new

favorite grammar:

E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {*,+,),$}

slide-34
SLIDE 34

Manas Thakur CS502: Compiler Design 34

Predictive parsing scheme

scanner t abl e-dri ven par ser I R par si ng t abl es st ack sour ce code t okens

  • Rather than writing recursive code, predictive parsers (and even

bottom-up parsers) are table-driven..

slide-35
SLIDE 35

Manas Thakur CS502: Compiler Design 35

Buildup to predictive parsing

scanner t abl e-dri ven par ser I R par si ng t abl es st ack sour ce code t okens

E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}

  • Removal of ambiguity.
  • Elimination of left recursion.
  • Left factoring.
slide-36
SLIDE 36

Manas Thakur CS502: Compiler Design 36

Predictive parsing table

  • A cell against non-terminal α and terminal t tells which

production to pick for deriving t.

  • We need to populate this table using the FIRST and the

FOLLOW sets, and the algorithm on the next slide.

Non- terminal

id + * ( ) $ E E' T T' F

slide-37
SLIDE 37

Manas Thakur CS502: Compiler Design 37

Table (M) construction

  • ∀ productions A → α:

– ∀a FIRST(α), add A → α to M[A, a]

– If ε FIRST(α):

  • ∀b FOLLOW (A), add A → α to M[A, b]

  • If $ FOLLOW(A), add A → α to M[A, $]

  • Set each undefined entry of M to error

E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}

slide-38
SLIDE 38

Manas Thakur CS502: Compiler Design 38

Homework

  • Construct the predictive parsing table for the following grammar,

and post a picture on Teams/Moodle: S → i E t S S' | a S' eS → | ϵ E → b

slide-39
SLIDE 39

CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur

Fall 2020

slide-40
SLIDE 40

Manas Thakur CS502: Compiler Design 40

Predictive Parsing

  • Let’s use the table to derive:

– id + id – +id – id+

Non- terminal

id + * ( ) $

E E → T E' E T E' → Accept E' E' +TE' → E' ϵ → E' ϵ → T T F T' → T F T' → T' T' ϵ → T' *FT' → T' ϵ → T' ϵ → F F id → F (E) →

slide-41
SLIDE 41

Manas Thakur CS502: Compiler Design 41

LL(1) Grammars

  • A grammar G is LL(1) iff for each set of productions

A → α1 | α2 | · · · | αn:

– FIRST (α1), FIRST(α2), . . . , FIRST(αn) are all disjoint – If αi can derive ε, then FIRST(αj) and FOLLOW(A) are disjoint j

∀ ≠i

  • When is the fjrst condition suffjcient?
  • Was our fmagship grammar LL(1)?

E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}

slide-42
SLIDE 42

Manas Thakur CS502: Compiler Design 42

LL(1) Grammars

LL(k)

Left to right scanning Leftmost derivation Maximum lookahead

  • A non-LL(1) grammar:

– S → aS | a ; because FIRST(aS) = FIRST(a) = {a}. – S → aS’ ; S’ → aS’ | ε accepts the same language and is LL(1).

  • Some facts:

– No left-recursive grammar is LL(1) – No ambiguous grammar is LL(1) – Some languages have no LL(1) grammar!

slide-43
SLIDE 43

Manas Thakur CS502: Compiler Design 43

Classwork

  • Construct predictive parsing table for the following grammar:
  • What was this grammar?
  • Homework: Try constructing the table for its LL(1) equivalent.

S → i E t S S' | a S'

eS →

| ϵ E → b Non-terminal FIRST FOLLOW S {i,a} {e,$} S’ {e,ϵ} {e,$} E {b} {t} Non- terminal i t a e b $ S

S → i E t S S' S → a

Accept S'

S' → e S S' → ϵ S' → ϵ

E

E → b

slide-44
SLIDE 44

Manas Thakur CS502: Compiler Design 44

Abstract Syntax Tree (AST)

  • Parse tree contains a lot of information

– Eliminate intermediate nodes – Move operators up to parent nodes

  • ASTs will be useful in Assignment 1 (and also in the rest).

expr expr

+ +

expr expr expr expr id id x x expr expr

* *

expr expr id id y y id id z z x x

* *

y y z z

+ +

slide-45
SLIDE 45

Manas Thakur CS502: Compiler Design 45

Visitor Pattern: Motivation

  • Problem: We want to sum an integer list:

interface List {} class Nil implements List {} class Cons implements List { int head; List tail; }

slide-46
SLIDE 46

Manas Thakur CS502: Compiler Design 46

Approach 1: instanceof and typecasts

  • Good: No need to touch Nil and Cons.
  • Bad: Typecasts and instanceof checks.

List l; int sum = 0; boolean proceed = true; while (proceed) { if (l instanceof Nil) { proceed = false; else if (l instanceof Cons) { sum = sum + ((Cons) l).head; l = ((Cons) l).tail; } }

slide-47
SLIDE 47

Manas Thakur CS502: Compiler Design 47

Approach 2: Use the power of OO

  • Good: No typecasts and instanceofs.
  • Bad: Original classes to be recompiled for each new operation.

interface List { int sum(); } class Nil implements List { public int sum() { return 0; } } class Cons implements List { int head; List tail; public int sum() { return head + tail.sum(); } }

slide-48
SLIDE 48

Manas Thakur CS502: Compiler Design 48

Approach 3: Visitor Pattern

  • Divide code into an object structure and a Visitor.
  • Insert an accept method in each class. Each accept method

takes a Visitor as an argument.

  • A Visitor contains a visit method for each class.

interface List { void accept(Visitor v); } interface Visitor { void visit(Nil x); void visit(Cons x); }

slide-49
SLIDE 49

Manas Thakur CS502: Compiler Design 49

Approach 3: Visitor Pattern (Cont.)

  • The purpose of accept methods is to invoke the visit method in

the Visitor that can handle the current object.

class Nil implements List { public void accept(Visitor v) { v.visit(this); } } class Cons implements List { int head; List tail; public void accept(Visitor v) { v.visit(this); } }

slide-50
SLIDE 50

Manas Thakur CS502: Compiler Design 50

Approach 3: Visitor Pattern (Cont.)

  • The control flows back and forth between the visit methods in the

Visitor and the accept methods in the object structure.

class SumVisitor implements Visitor { int sum = 0; public void visit(Nil x) {} public void visit(Cons x) { sum = sum + x.head; x.tail.accept(this); } } List l; ... SumVisitor sv = new SumVisitor(); l.accept(sv); System.out.println(sv.sum);

slide-51
SLIDE 51

Manas Thakur CS502: Compiler Design 51

ASTs, JavaCC, JTB, Visitor Pattern, and A1