CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation
CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 - - PowerPoint PPT Presentation
CS502: Compiler Design Syntax Analysis Manas Thakur Fall 2020 Where are we? Character stream Machine-Independent Machine-Independent Lexical Analyzer Lexical Analyzer Code Optimizer Code Optimizer B a c k e n d Intermediate
Manas Thakur CS502: Compiler Design 2
Where are we?
Lexical Analyzer Lexical Analyzer Syntax Analyzer Syntax Analyzer Semantic Analyzer Semantic Analyzer Intermediate Code Generator Intermediate Code Generator Character stream Token stream Syntax tree Syntax tree Intermediate representation Machine-Independent Code Optimizer Machine-Independent Code Optimizer Code Generator Code Generator Target machine code Intermediate representation Machine-Dependent Code Optimizer Machine-Dependent Code Optimizer Target machine code Symbol Table
F r o n t e n d B a c k e n d
Manas Thakur CS502: Compiler Design 3
Roles of Parsing / Syntax analysis
- Read the specification given by the language implementor.
- Get help from lexer to collect tokens.
- Check if the sequence of tokens matches the specification.
- Declare successful program structure or report errors in a useful
manner.
- Later: Also identify some semantic errors.
Manas Thakur CS502: Compiler Design 4
Specifying the syntax
- Regular expressions are mostly
not capable enough.
- Syntactic constructs specified using
context-free grammars.
- The corresponding language is
called a context-free language.
- CFGs subsume REs.
– Then why did we use REs for scanning?
- Right tool for the right job!
Manas Thakur CS502: Compiler Design 5
Contex-Free Grammar (CFG)
- 1. A set of terminals called tokens.
Terminals are elementary symbols
- f the parsing language.
- 2. A set of non-terminals called variables.
A non-terminal represents a set of strings of
terminals.
- 3. A set of productions.
– They define the syntactic rules.
- 4. A start symbol designated by a non-terminal.
list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9
Manas Thakur CS502: Compiler Design 6
Productions
All of the below are productions (or rules):
list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9
left or head right or body
Manas Thakur CS502: Compiler Design 7
Derivations
- A grammar derives strings by beginning with the start
symbol and repeatedly replacing a non-terminal by the body
- f a production for that non-terminal.
- The above grammar derives sentences like
– 3+1-0+8-2+0+1+5 – 0
- The set of all such strings forms the language specified by the
above CFG.
list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9 list → list + digit list → list – digit list → digit digit → 0 | 1 | ... | 8 | 9
Manas Thakur CS502: Compiler Design 8
Practice
- Write a CFG to generate strings of the form 0n1n.
– S --> 0S1 – S --> ε – Can also be written as:
- S --> 0S1 | ε
- Homework:
– wcwr
Manas Thakur CS502: Compiler Design 9
Derivations (cont.)
- Given a CFG, we can derive strings in the associated CFL by
succesively replacing the non-terminals based on productions.
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
Example derivation (x + 2 * y):
goal → expr → expr op expr → id op expr → x op expr → x + expr → x + expr op expr → x + num op expr → x + 2 op expr → x + 2 * expr → x + 2 * id → x + 2 * y
Manas Thakur CS502: Compiler Design 10
Leftmost derivations
- What did we do at each step in the previous derivation?
– Replaced the leftmost non-terminal – Called a leftmost derivation – expr,
expr op expr, id op expr, etc. are the leftmost sentential forms Example derivation (x + 2 * y):
goal → expr → expr op expr → id op expr → x op expr → x + expr → x + expr op expr → x + num op expr → x + 2 op expr → x + 2 * expr → x + 2 * id → x + 2 * y
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
Manas Thakur CS502: Compiler Design 11
Rightmost derivations
- Replace the rightmost non-terminal at each step
– Called a rightmost derivation – expr,
expr op expr, expr op id, etc. are the rightmost sentential forms Example derivation (x + 2 * y):
goal → expr → expr op expr → expr op id → expr op y → expr * y → expr op expr * y → expr op num * y → expr op 2 * y → expr + 2 * y → id + 2 * y → x + 2 * y
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
goal → expr expr → expr op expr | num | id id → a | b | ... | z num → 0 | 1 | ... | 9
- p → + | - | * | /
Manas Thakur CS502: Compiler Design 12
Formally
- →* denotes a derivation of zero or more steps
- →+ denotes a derivation of one or more steps
- If S →* β, then β is a sentential form of the associated grammar G.
- L(G) = {w | S →+ w and w consists only of terminals}; w
L(G) is ∈ called a sentence of G.
- The process of discovering a derivation is called parsing.
- The output is a parse tree, which we shall see tomorrow.
CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 14
Parse Tree
- A pictorial representation of program derivation.
- A parse tree for x + y * z:
expr expr
+ +
expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + expr | expr * expr | id id → a | b | ... | z
expr expr expr expr id id x x expr expr
* *
expr expr id id y y id id z z
Manas Thakur CS502: Compiler Design 15
Precedence
- Another parse tree for x+y*z:
- Operator evaluation in a left-to-right tree walk gives: (x+y)*z
– Wrong answer! – Should have been: x+(y*z)
expr expr
* *
expr expr expr expr id id z z expr expr
+ +
expr expr id id x x id id y y
Manas Thakur CS502: Compiler Design 16
The precedence problem
- Our grammar has no notion of precedence or an implied order of
evaluation.
- Ideally, multiplication should be enforced before addition.
- Will the green grammar generate all the strings that could be
generated by the orange grammar?
- Does it solve the problem?
expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + expr | expr * expr | id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z
Manas Thakur CS502: Compiler Design 17
New derivation and parse tree
expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z
expr → expr + term → expr + term * factor → expr + term * id → expr + term * z → expr + factor * z → expr + id * z → expr + y * z → term + y * z → id + y * z → x + y * z expr expr
+ +
expr expr term term id id x x term term
* *
factor factor id id z z factor factor id id y y term term factor factor
C
- r
r e c t t r e e
- w
a l k !
Manas Thakur CS502: Compiler Design 18
Ambiguity
- रोको मत जाने दो
– Whether to stop or let go.
- Sarah gave a bath to her dog wearing a pink t-shirt.
– Who was wearing the pink t-shirt?
Manas Thakur CS502: Compiler Design 19
Ambiguity in grammars
- If a grammar has more than one leftmost or rightmost derivation
for a single sentential form, then it is ambiguous.
- Example:
- Try deriving the sentential form:
– if E1 then if E2 then S1 else S2
<stmt> → if <expr> then <stmt> | if <expr> then <stmt> else <stmt> | <other stmts> <stmt> → if <expr> then <stmt> | if <expr> then <stmt> else <stmt> | <other stmts>
if E1 then if E2 then S1 else S2 if E1 then if E2 then S1 else S2
A m b i g u
- u
s g r a m m a r !
Manas Thakur CS502: Compiler Design 20
Resolving ambiguity
- Need to re-arrange the grammar.
- Match an else with the closest unmatched then:
- Check: if E1 then if E2 then S1 else S2
- Not a trivial task, but comes with practice.
<stmt> → <matched> | <unmatched> <matched> → if <expr> then <matched> else <matched> | <other stmts> <unmatched> → if <expr> then <stmt> | if <expr> then <matched> else <unmatched> <stmt> → <matched> | <unmatched> <matched> → if <expr> then <matched> else <matched> | <other stmts> <unmatched> → if <expr> then <stmt> | if <expr> then <matched> else <unmatched>
Manas Thakur CS502: Compiler Design 21
Parsing
- Given a string and a grammar, how do we check whether the
string follows the grammar?
- In other words, how do compilers parse input programs?
- Homework:
– Look up “C grammar”
- Find out the number of productions.
- Try to understand the grammar.
Manas Thakur CS502: Compiler Design 22
Different ways of parsing
- Top-down parsing
– Start with the root production – Go down towards the leaves trying to obtain the string
- Bottom-up parsing
– Start with the leaves – Go up and try to reach the root production
Manas Thakur CS502: Compiler Design 23
Top-down parsing
- Start with the root (recall the <goal>?)
- Keep expanding productions till you get the string
- Finds leftmost derivation
- Problem?
– Backtracking!
- General method: Recursive descent
- Special method: Predictive (aka LL(k))
– Avoids backtracking – Fixed lookahead (k)
Manas Thakur CS502: Compiler Design 24
Left recursion
- Top-down parsers cannot handle left recursion
- A grammar is left recursive if:
∃a non-terminal A such that A →+ Aα for some string α
- Have we seen an example of a left-recursive grammar?
expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z expr → expr + term | term term → term * factor | factor factor → id id → a | b | ... | z
Manas Thakur CS502: Compiler Design 25
Eliminating left recursion
- Consider the grammar:
– where α and β do not start with A.
- We can rewrite this as:
- The new grammar does not contain left recursion.
A → Aα | β A → Aα | β A → βA’ A’ → αA’ | ε A → βA’ A’ → αA’ | ε
Manas Thakur CS502: Compiler Design 26
Classwork
- Eliminate left recursion from our favorite grammar:
- Answer:
expr → expr + term | term term → term * factor | factor factor → id expr → expr + term | term term → term * factor | factor factor → id E → E + T | T T → T * F | F F → id E → E + T | T T → T * F | F F → id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → id
Manas Thakur CS502: Compiler Design 27
Recursive descent parsing
int A() { foreach (production of the form A → X1 X2 X3 ... Xk) { for (i = 1 to k) { if (Xi is a nonterminal) { if (Xi() != 0) { backtrack(); break; // Try next production } } else if (Xi == next_symbol) advance_input(); else { backtrack(); break; // Try next production } } if (i == k+1) return 0; // Success else return 1; // Failure } }
S → c A d A → a b | a
Input string: cad S S S S A A c d S S A A c d a b S S A A c d a cad cad cad cad
CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 29
Predictive parsing
- Avoids backtracking
- Needs to look ahead
– A → α | β; which production to choose?
- Basic idea with lookahead of k:
– Read k extra characters and see which production to choose – What if reading k characters is not enough?
- A → αβ1 | αβ2
- We can left factor the grammar
Manas Thakur CS502: Compiler Design 30
Left factoring
- When the choice between two alternative productions is not
clear, rewrite the grammar to defer the decision until enough input is seen.
- A → α β1 | α β2
- Here, common prefix α can be left factored:
A → α A' A' → β1 | β2
- Note: Left factoring doesn't change ambiguity.
– e.g., it doesn’t solve the matching else problem.
Manas Thakur CS502: Compiler Design 31
FIRST and FOLLOW
- Aid (top-down) predictive parsing.
– Also bottom-up parsing (a few classes later)
- Allow a parser to choose which production to apply, based on
lookahead.
- Informally:
– FIRST(α) gives the set of terminals that can occur at the first
position on expanding α.
– FOLLOW(α) gives the set of terminals that could occur immediately
after expanding α.
Manas Thakur CS502: Compiler Design 32
Computation of FIRST
- For a string of grammar symbols α, FIRST(α) is computed as:
– The set of terminals that begin strings derived from α:
{a | α →* aβ }
– If α →* ε, then ε is also in FIRST(α)
- Find the FIRST sets of all the non-terminals in our (slightly
expanded) favorite grammar:
E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id}
Manas Thakur CS502: Compiler Design 33
Computation of FOLLOW
- For a non-terminal A, FOLLOW(A) is computed as:
– If S
* αAaβ, then FOLLOW(A) contains a -- basically FIRST( ⇨ aβ).
– If S
* αABaβ and B * ϵ then FOLLOW(A) contains a. ⇨ ⇨
– If S
* αA, then FOLLOW(A) contains FOLLOW(S). ⇨
– FOLLOW(S) always contains $.
- Find the FOLLOW sets of all the non-terminals in our new
favorite grammar:
E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {*,+,),$}
Manas Thakur CS502: Compiler Design 34
Predictive parsing scheme
scanner t abl e-dri ven par ser I R par si ng t abl es st ack sour ce code t okens
- Rather than writing recursive code, predictive parsers (and even
bottom-up parsers) are table-driven..
Manas Thakur CS502: Compiler Design 35
Buildup to predictive parsing
scanner t abl e-dri ven par ser I R par si ng t abl es st ack sour ce code t okens
E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}
- Removal of ambiguity.
- Elimination of left recursion.
- Left factoring.
Manas Thakur CS502: Compiler Design 36
Predictive parsing table
- A cell against non-terminal α and terminal t tells which
production to pick for deriving t.
- We need to populate this table using the FIRST and the
FOLLOW sets, and the algorithm on the next slide.
Non- terminal
id + * ( ) $ E E' T T' F
Manas Thakur CS502: Compiler Design 37
Table (M) construction
- ∀ productions A → α:
– ∀a FIRST(α), add A → α to M[A, a]
∈
– If ε FIRST(α):
∈
- ∀b FOLLOW (A), add A → α to M[A, b]
∈
- If $ FOLLOW(A), add A → α to M[A, $]
∈
- Set each undefined entry of M to error
E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}
Manas Thakur CS502: Compiler Design 38
Homework
- Construct the predictive parsing table for the following grammar,
and post a picture on Teams/Moodle: S → i E t S S' | a S' eS → | ϵ E → b
CS502: Compiler Design Syntax Analysis (Cont.) Manas Thakur
Fall 2020
Manas Thakur CS502: Compiler Design 40
Predictive Parsing
- Let’s use the table to derive:
– id + id – +id – id+
Non- terminal
id + * ( ) $
E E → T E' E T E' → Accept E' E' +TE' → E' ϵ → E' ϵ → T T F T' → T F T' → T' T' ϵ → T' *FT' → T' ϵ → T' ϵ → F F id → F (E) →
Manas Thakur CS502: Compiler Design 41
LL(1) Grammars
- A grammar G is LL(1) iff for each set of productions
A → α1 | α2 | · · · | αn:
– FIRST (α1), FIRST(α2), . . . , FIRST(αn) are all disjoint – If αi can derive ε, then FIRST(αj) and FOLLOW(A) are disjoint j
∀ ≠i
- When is the fjrst condition suffjcient?
- Was our fmagship grammar LL(1)?
E → TE’ E’ → +TE’ | ε T → FT’ T’ → *FT’ | ε F → (E) | id FIRST(E) = {(,id} FIRST(E’) = {+,ε} FIRST(T) = {(,id} FIRST(T’) = {*,ε} FIRST(F) = {(,id} FOLLOW(E) = {),$} FOLLOW(E’) = {),$} FOLLOW(T) = {+,),$} FOLLOW(T’) = {+,),$} FOLLOW(F) = {+,*,),$}
Manas Thakur CS502: Compiler Design 42
LL(1) Grammars
LL(k)
Left to right scanning Leftmost derivation Maximum lookahead
- A non-LL(1) grammar:
– S → aS | a ; because FIRST(aS) = FIRST(a) = {a}. – S → aS’ ; S’ → aS’ | ε accepts the same language and is LL(1).
- Some facts:
– No left-recursive grammar is LL(1) – No ambiguous grammar is LL(1) – Some languages have no LL(1) grammar!
Manas Thakur CS502: Compiler Design 43
Classwork
- Construct predictive parsing table for the following grammar:
- What was this grammar?
- Homework: Try constructing the table for its LL(1) equivalent.
S → i E t S S' | a S'
eS →
| ϵ E → b Non-terminal FIRST FOLLOW S {i,a} {e,$} S’ {e,ϵ} {e,$} E {b} {t} Non- terminal i t a e b $ S
S → i E t S S' S → a
Accept S'
S' → e S S' → ϵ S' → ϵ
E
E → b
Manas Thakur CS502: Compiler Design 44
Abstract Syntax Tree (AST)
- Parse tree contains a lot of information
– Eliminate intermediate nodes – Move operators up to parent nodes
- ASTs will be useful in Assignment 1 (and also in the rest).
expr expr
+ +
expr expr expr expr id id x x expr expr
* *
expr expr id id y y id id z z x x
* *
y y z z
+ +
Manas Thakur CS502: Compiler Design 45
Visitor Pattern: Motivation
- Problem: We want to sum an integer list:
interface List {} class Nil implements List {} class Cons implements List { int head; List tail; }
Manas Thakur CS502: Compiler Design 46
Approach 1: instanceof and typecasts
- Good: No need to touch Nil and Cons.
- Bad: Typecasts and instanceof checks.
List l; int sum = 0; boolean proceed = true; while (proceed) { if (l instanceof Nil) { proceed = false; else if (l instanceof Cons) { sum = sum + ((Cons) l).head; l = ((Cons) l).tail; } }
Manas Thakur CS502: Compiler Design 47
Approach 2: Use the power of OO
- Good: No typecasts and instanceofs.
- Bad: Original classes to be recompiled for each new operation.
interface List { int sum(); } class Nil implements List { public int sum() { return 0; } } class Cons implements List { int head; List tail; public int sum() { return head + tail.sum(); } }
Manas Thakur CS502: Compiler Design 48
Approach 3: Visitor Pattern
- Divide code into an object structure and a Visitor.
- Insert an accept method in each class. Each accept method
takes a Visitor as an argument.
- A Visitor contains a visit method for each class.
interface List { void accept(Visitor v); } interface Visitor { void visit(Nil x); void visit(Cons x); }
Manas Thakur CS502: Compiler Design 49
Approach 3: Visitor Pattern (Cont.)
- The purpose of accept methods is to invoke the visit method in
the Visitor that can handle the current object.
class Nil implements List { public void accept(Visitor v) { v.visit(this); } } class Cons implements List { int head; List tail; public void accept(Visitor v) { v.visit(this); } }
Manas Thakur CS502: Compiler Design 50
Approach 3: Visitor Pattern (Cont.)
- The control flows back and forth between the visit methods in the
Visitor and the accept methods in the object structure.
class SumVisitor implements Visitor { int sum = 0; public void visit(Nil x) {} public void visit(Cons x) { sum = sum + x.head; x.tail.accept(this); } } List l; ... SumVisitor sv = new SumVisitor(); l.accept(sv); System.out.println(sv.sum);
Manas Thakur CS502: Compiler Design 51