Compiling T echniques
Lecture 5: Introduction to Parsing
Compiling T echniques Lecture 5: Introduction to Parsing - - PowerPoint PPT Presentation
Compiling T echniques Lecture 5: Introduction to Parsing Christophe Dubach Overview Context Free Grammars Derivations and Parse T rees Ambiguity T op-Down Parsing Left Recursion Front End: Parser IR tokens Source Parser Scanner
Lecture 5: Introduction to Parsing
Context Free Grammars Derivations and Parse T rees Ambiguity T
Left Recursion
Checks the stream of words and their parts of speech (produced by the scanner) for grammatical correctness Determines if the input is syntactically well formed Guides checking at deeper levels than syntax Builds an IR representation of the code Think of this as the mathematics of diagramming sentences
Source code Scanner
IR
Parser
Errors tokens
The process of discovering a derivation for some sentence Need a mathematical model of syntax — a grammar G Need an algorithm for testing membership in L(G) Need to keep in mind that our goal is building parsers, not studying the mathematics of arbitrary languages Roadmap Context-free grammars and derivations T
LL(1) == Left-to-right, Leftmost derivation, 1 token of lookahead Bottom-up parsing: Operator precedence parser LR(1) == Left-to-right, Rightmost derivation, 1 token of lookahead
Context-free syntax is specifjed with a grammar This grammar defjnes the set of noises that a sheep makes under normal circumstances It is written in a variant of Backus–Naur Form (BNF) Formally, a grammar G = (S,N,T,P) S is the start symbol N is a set of non-terminal symbols T is a set of terminal symbols or words P is a set of productions or rewrite rules (P:N→N∪T)
SheepNoise
→ |
SheepNoise baa baa 1 2
We can use the SheepNoise grammar to create sentences: use the productions as rewriting rules
And so on ...
While it is cute, this example quickly runs out of intellectual steam ...
Such a sequence of rewrites is called a derivation Process of discovering a derivation is called parsing
Expr Op
→ | | → | | |
Expr Op Expr num id +
/ 1 2 3 4 5 6 7
this derivation represents x - 2 * y
At each step, we choose a non-terminal to replace Difgerent choices can lead to difgerent derivations T wo derivations are of interest Leftmost derivation — replace leftmost NT at each step Rightmost derivation — replace rightmost NT at each step These are the two systematic derivations (We don’t care about randomly-ordered derivations!) The example on the preceding slide was a leftmost derivation Of course, there is also a rightmost derivation Interestingly, it turns out to be difgerent
In both cases, id – num * id The two derivations produce difgerent parse trees The parse trees imply difgerent evaluation orders!
Leftmost derivation Rightmost derivation
G x E E Op – 2 E E E y Op *
LEFTMOST DERIVATION
This evaluates as x – ( 2 * y )
RIGHTMOST DERIVATION
This evaluates as ( x – 2 ) * y
x 2 G E Op E E E Op E y – *
These two derivations point out a problem with the grammar: It has no notion of precedence, or implied order of evaluation T
Create a non-terminal for each level of precedence Isolate the corresponding part of the grammar Force the parser to recognise high precedence subexpressions fjrst For algebraic expressions Multiplication and division, fjrst (level one) Subtraction and addition, next (level two)
This grammar is slightly larger
akes more rewriting to reach some of the terminal symbols
under leftmost & rightmost derivations Let’s see how it parses x - 2 * y
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
level
level two
The rightmost derivation
G E
–
E T F <id,x> T T F F * <num,2> <id,y>
Its parse tree This produces x – ( 2 * y ), along with an appropriate parse tree. Both the leftmost and rightmost derivations give the same expression, because the grammar directly encodes the desired precedence.
Our original expression grammar had other problems
difgerent choice than the fjrst time
Expr Op
→ | | → | | |
Expr Op Expr num id +
/ 1 2 3 4 5 6 7
The Difgerence:
Difgerent productions chosen on the second step Both derivations succeed in producing x - 2 * y
Original choice New choice
If a grammar has more than one leftmost derivation for a single sentential form, the grammar is ambiguous If a grammar has more than one rightmost derivation for a single sentential form, the grammar is ambiguous
The leftmost and rightmost derivations for a sentential form may difger, even in an unambiguous grammar
Classic example — the if-then-else problem This ambiguity is entirely grammatical in nature
Stmt →
| |
if Expr then Stmt if Expr then Stmt else Stmt OtherStmt 1 2 3
then else if then if E1 E2 S2 S1
production 2, then production 1
then if then if E1 E2 S1 else S2
production 1, then production 2
This sentential form has two derivations if E1 then if E2 then S1 else S2
if E1 then if E2 then S1 else S2 if E1 then if E2 then S1 else S2
Removing the ambiguity Must rewrite the grammar to avoid generating the problem Match each else to innermost unmatched if (common sense rule)
Intuition: a NoElse always has no else on its last cascaded else if statement
With this grammar, the example has only one derivation
Stmt WithElse NoElse
→ | → | → |
WithElse NoElse if Expr then WithElse else WithElse OtherStmt if Expr then Stmt if Expr then WithElse else NoElse 1 2 3 4 5 6
if E1 then if E2 then S1 else S2 This binds the else controlling S2 to the inner if
Stmt WithElse NoElse
→ | → | → |
WithElse NoElse if Expr then WithElse else WithElse OtherStmt if Expr then Stmt if Expr then WithElse else NoElse 1 2 3 4 5 6
Ambiguity usually refers to confusion in the CFG (Context-Free Grammar) Consider the following case: a = f(17) In Algol-like languages, f could be either a function or an array In such cases, a context is required Need to track declarations Really a type issue, not context-free syntax Requires an extra-grammatical solution (not in the CFG) Must handle these with a difgerent mechanism Step outside the grammar rather than making it more complex
Ambiguity arises from two distinct sources
Resolving ambiguity
→Knowledge of declarations, types, … →Accept a superset of L(G) & check it by other means →This is a language design problem
Sometimes, the compiler writer accepts an ambiguous grammar
→
Parsing techniques that “do the right thing”
→i.e., always select the same derivation
Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production & try to match the input Bad “pick” ⇒ may need to backtrack Some grammars are backtrack-free (predictive parsing) Bottom-up parsers (LR(1), operator precedence) Start at the leaves and grow toward root As input is consumed, encode possibilities in an internal state Start in a state valid for legal fjrst tokens Bottom-up parsers handle a large class of grammars
A top-down parser starts with the root of the parse tree The root node is labelled with the goal symbol of the grammar T
Construct the root node of the parse tree Repeat until the fringe of the parse tree matches the input string 1 At a node labelled A, select a production with A on its lhs and, for each symbol on its rhs, construct the appropriate child 2 When a terminal symbol is added to the fringe and it doesn’t match the fringe, backtrack 3 Find the next node to be expanded (label ∈ NT)
→That choice should be guided by the input string
Let’s try x – 2 * y :
Goal Expr T erm + Expr T erm Fact. <id,x>
Leftmost derivation, choose productions in an order that exposes problems
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
Let’s try x – 2 * y : This worked well, except that “–” doesn’t match “+” The parser must backtrack to here
Goal Expr T erm + Expr T erm Fact. <id,x>
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
Continuing with x – 2 * y :
Goal Expr T erm – Expr T erm Fact. <id,x>
This time, “–” and “–” matched We can advance past “–” to look at “2”
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
Trying to match the “2” in x – 2 * y : Where are we?
⇒ Need to backtrack
Goal Expr T erm – Expr T erm Fact. <id,x> Fact. <num,2>
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
T rying again with “2” in x – 2 * y : This time, we matched & consumed all the input ⇒ Success!
Goal Expr T erm – Expr T erm Fact. <id,x> Fact. <id,y> T erm Fact. <num,2> *
Goal Expr T erm Factor
→ → | | → | | → |
Expr Expr + T erm Expr - T erm T erm T erm * Factor T erm / Factor Factor number id 1 2 3 4 5 6 7 8 9
T
Formally, A grammar is left recursive if ∃ A ∈ NT such that ∃ derivation A → Aα+, for some string α ∈ (NT ∪ T )+ Our expression grammar is left recursive
Non-termination is a bad property in any part of a compiler
T
Consider a grammar fragment of the form Fee → Fee β | α where neither α nor β start with Fee We can rewrite this as Fee → α Faa Faa → β Faa | ε where Faa is a new non-terminal This accepts the same language, but uses only right recursion Exercise: eliminate left recursion from previous grammar
T
LL(1) Property T able-driven LL(1) parsers