Compiler Design Spring 2018 3.0 Frontend Thomas R. Gross Computer - - PowerPoint PPT Presentation

compiler design
SMART_READER_LITE
LIVE PREVIEW

Compiler Design Spring 2018 3.0 Frontend Thomas R. Gross Computer - - PowerPoint PPT Presentation

Compiler Design Spring 2018 3.0 Frontend Thomas R. Gross Computer Science Department ETH Zurich, Switzerland 1 Admin issues Recitation sessions take place only when announced In the lecture / on course website / on the mailing list


slide-1
SLIDE 1

Compiler Design

Spring 2018

3.0 Frontend

1

Thomas R. Gross Computer Science Department ETH Zurich, Switzerland

slide-2
SLIDE 2

Admin issues

§ Recitation sessions take place only when announced

§ In the lecture / on course website / on the mailing list

§ No recitation session this week § Next recitation session

§ March 15, 2018 @ 15:00 § ETF E1 (tentative)

2

slide-3
SLIDE 3

Compiler model

3

Source program ASM file “Front-end” IR “Back-end” Optimizer Question: How to build IR (tree)?

slide-4
SLIDE 4

Overview

§ 3.1 Introduction § 3.2 Lexical analysis § 3.3 “Top down” parsing § 3.4 “Bottom up” parsing

4

slide-5
SLIDE 5

3.1 Introduction

§ Frontend responsible to turn input program into IR

§ Input: Usually a string of ASCII or Unicode characters § IR: As required by later stages of the compiler

§ Frontend divided into

§ Lexical analysis – deals with reading the input program

§ Also known as scanning § Scanner, Lexer

§ Syntactic analysis – understand structure of the input program

§ Also known as parsing § Parser

5

slide-6
SLIDE 6

3.1 Introduction (cont’d)

§ Good news: Syntactic and lexical analysis well understood

§ Good theory and books, e.g., Aho et al., Chapters 2 (in part), 3, and 4 § Good tools

§ Bad news: Even good tools may be painful to use

§ Good == powerful § Many options § Still can’t handle all possible languages § May give cryptic error messages

6

slide-7
SLIDE 7

3.1 Introduction (cont’d)

§ Need to understand theory to use tool

§ Same theory that allows building tool § Tools made hand-crafted frontends obsolete § Frontend tools used for other domains

7

slide-8
SLIDE 8

Languages

§ Frontend processes input program § Need a way to describe what input is allowed

§ Formal languages § Well-researched area § First part of compilers supported by tools

§ In this lecture: brief review

§ Aho et al. covers topic in more depth § Focus on essentials

§ (Speed an issue in real life)

§ Theory behind tools

8

slide-9
SLIDE 9

Languages: Grammar

§ Grammars provide a set of rules to generate “strings” § A grammar consists of

§ Terminals: a, b, c, … § Non-terminals: X, Y, Z, … § Set of productions § Start symbol: S

§ Some terminology

§ Terminal symbols: Sometimes called characters or tokens § Non-terminal symbols: Also called syntactic variables § String: Sequence of symbols from some alphabet

§ Other terms: Word, sentence

9

slide-10
SLIDE 10

Productions

§ General form

§ Left-hand side à Right-hand side § LHS à RHS (for short) § LHS, RHS: Strings over alphabets of terminal and non-terminal symbols

§ Example: Grammar G1

S à aBa S à aXa Xb à Xbc | c Ba à aBa | b

§ How does a grammar generate a language (known as L(G))?

§ Using the grammar G1 as an example

10

slide-11
SLIDE 11

L(G)

§ From production to derivation

Given § w -- a word over (T ∪ NT), § a, b, g words over (T ∪ NT)

§ (a, b, g may be empty)

s.t. w = a b g and P a production b à d We say that w’ = a d g is derived from w, i.e., w ⇒ w’.

§ Example derivation (with G1)

§ S ⇒ aBa ⇒ aaBa ⇒ aab

§ L(G1) = anb, n ≥ 1

12

S à aBa S à aXa Xb à Xbc | c Ba à aBa | b

slide-12
SLIDE 12

L(G)

§ L(G) = set of strings w such that

§ w consists only of symbols from the set of terminals § There exists a sequence of productions P1, P2, …Pn such that S ⇒ RHS1 by P1, … (by Pi), …. ⇒ w (by Pn) § In other words: there exists a derivation S ⇒ P1 … … ⇒ Pn w (or S ⇒* w)

14

slide-13
SLIDE 13

Productions, 2nd look

§ No constraints on LHS, RHS § Some RHS could be dead-end street

S à aXa Xb à …

§ Remove dead-end streets § Updated grammar G1’

S à aBa Ba à aBa | b

16

slide-14
SLIDE 14

Productions, 3rd look

§ We care about L(G) – prune productions that do not contribute § Restrictions on LHS

§ Only a single non-terminal is allowed on the left hand side § For example: A à a § “Context free” grammar or Type-2 grammar

§ Context-free grammars important

§ Efficient analysis techniques known § From now on only context-free grammars unless noted

17

slide-15
SLIDE 15

Regular and linear grammars

§ Linear grammar: Context-free, at most 1 NT in the RHS § Left-linear grammar: Linear, NT appears at left end of RHS § Right-linear grammar: Linear, NT appears at the right end of RHS § Regular grammar: Either right-linear or left-linear § Regular grammars generate regular languages

§ Could also be described by regular expression § Can be recognized by Finite Deterministic Automaton § Type-3 grammar

18

slide-16
SLIDE 16

Special cases

§ ∅ – a language (but not an interesting one) § e – the empty string

§ Must use a symbol so that we can see it

§ Can be the RHS

§ A à e

22

slide-17
SLIDE 17

3.1 Introduction

§ So far: Brief summary of grammars § Using multiple grammars to save work § Properties of derivations § Parse trees § Properties of grammars

§ Detect ambiguity § Avoid ambiguity

23

slide-18
SLIDE 18

3.1.1 Example grammar G2

§ Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ) } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions

S à E E à E Op E | - E | ( E ) | Id Op à + | - | * | / Id à L M L à a | b | ... | z M à L M | N M | e N à 0 | 1 | ... | 9

24

Note: ℇ-production allows us to make M “disappear”

S ⇒ E ⇒ Id ⇒ L M ⇒ L L M ⇒ a L M ⇒ ap M ⇒ ap

slide-19
SLIDE 19

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?” § Analysis problem

§ Answer is either YES or NO § ap ∈L(G2) § ap + bp ∈L(G2) § ap++ ∉ L(G2)

§ For YES we need to find a sequence of productions so that S ⇒ … … ⇒ w

§ (or S ⇒* w for short)

26

slide-20
SLIDE 20

w = a3 + b

§ Derivation

S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id ⇒ Id + LM ⇒ Id + L ⇒ Id + b ⇒ LM + b ⇒ a M + b ⇒ a N M + b ⇒ a3 M + b ⇒ a3 + b

29

slide-21
SLIDE 21

Comments

§ If a string w contains multiple non-terminals we have a choice when expanding w ⇒ w’

§ Grammars that are context-free and without useless non-terminals: must have a production for each non-terminal in w § Assume A, B ∈ NT, A à a , B à b are productions P1, P2 § w = d A t B g § Choice #1: w1 = d a t B g § Choice #2: w2 = d A t b g § (Both w ⇒ w1 or w ⇒ w2 possible)

30

slide-22
SLIDE 22

More comments

§ Question: Does the choice influence L(G)?

§ Or, is (w1 ⇒ * x ∈ L(G)) ⇔ (w2 ⇒ * x ∈ L(G)) § Answer: choice does not matter for context-free grammars

§ How to decide which production to pick?

§ Everything worked out in the example

§ We’ve always picked the right production § Found w = a3 + b

§ Later more…

31

slide-23
SLIDE 23

More comments

§ Part of the derivation is pretty boring

§ Do we care about exact steps to generate identifier “a3”? § Details (not always) important

32

slide-24
SLIDE 24

3.1.1 Example grammar G2

§ Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ) } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions

S à E E à E Op E | - E | ( E ) | Id Op à + | - | * | / Id à L M L à a | b | ... | z M à L M | N M | e N à 0 | 1 | ... | 9

33

slide-25
SLIDE 25

More comments

§ Part of the derivation is pretty boring

§ Do we care about exact steps to generate identifier “a3”? § Details (not always) important

§ Can we find a better way to deal with this aspect?

§ Better: Simpler § Better: Maybe also more efficient

34

slide-26
SLIDE 26

36

slide-27
SLIDE 27

Regular expressions

§ Idea: Use regular expression to capture “uninteresting” part

  • f a grammar

§ Here: Exact rules for identifier names § Replace part of grammar G2

… Id à L M L à a | b | ... | z M à L M | N M | e N à 0 | 1 | ... | 9

§ Regular expressions recognized by Finite State Machines

§ Either a Deterministic Finite Automaton (DFA) § Or a Nondeterministic Finite Automaton (NFA)

37

slide-28
SLIDE 28

Token

§ Idea: Introduce grammar symbol that represents string described by regular expression

§ Terminal for the grammar § Rules/production to generate regular expression string

§ When looking for a derivation identify strings that can be described by regular expression

§ “Token” § Example: a3 + b Tokens: Id (“a3”) + Id (“b”) § Chunks of the input stream § More in 3.2 Lexical analysis

38

regexp regexp

slide-29
SLIDE 29

Examples

§ a3 + b … really … Id(“a3”) + Id(“b”) § z * u + x … really … Id(“z”) * Id(“u”) + Id(“x”)

§ Id * Id + Id ∈ L(G2)

§ Treat terminals the same way

§ Id(“z”) Term(“*”) Id(“u”) Term(“+”) Id(“x”)

40

slide-30
SLIDE 30

3.1.2 Simplified grammar G3

§ Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ), Id } § Non-Terminals: { S, E, Op, Id, L, M, N } § Productions and regular definitions

S à E E à E Op E | - E | ( E ) | Id Op à + | - | * | / Id: L { L | N } *

41

regexp

L = { a | b | c | … | z } N = { 0 | 1 | 2 | … | 9 }

slide-31
SLIDE 31

More simplifications?

§ Can grammar G3 simplified even further? § Are there other productions we can replace with a regular expression? § Productions

S à E E à E Op E | - E | ( E ) | Id Id à L { L | N } * L = { a | b | c | … | z } N = { 0 | 1 | 2 | … | 9 } Op à + | - | * | /

§ Could treat Op the same way

Op: { + | - | * | / }

43

slide-32
SLIDE 32

Simplified grammar G4

§ Start symbol: S § Terminals: { a, b, …, z, +, -, *, /, ( , ), Id } § Non-Terminals: { S, E, Op}

§ Productions and regular definitions

S à E (1) E à E Op E (2) | - E (3) | ( E ) (4) | Id (5) Op à + | - | * | / (6) Id: L { L | N } * L = { a | b | c | … | z }, N = { 0 | 1 | 2 | … | 9 }

44

slide-33
SLIDE 33

w = a 3 + b

Please take a piece of paper and find a derivation for “a 3 + b” (Raise your hand when you are done.) Compare your solution with your neighbor’s solution.

§ Do you start with the same production? § Do you use the same production in the 2nd step?

45

slide-34
SLIDE 34

w = a3 + b

§ Some example derivations § Derivation #1

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

§ Derivation #2

S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ More?

48

slide-35
SLIDE 35

Looking at derivations

§ a3 + b ∈L(G4), i.e., a3 + b is a legal program

§ At least according to grammar G4 § Are we done?

§ More analysis is needed

§ Looking at derivation helps start analysis § Derivations may provide information on structure

50

slide-36
SLIDE 36

Questions for derivations

§ Does the order of applying productions matter? § Are derivations unique?

§ How do we compare derivations?

51

slide-37
SLIDE 37

Choice of non-terminal in derivation step

§ Given w = d A t B g (with A, B ∈ NT, A à a , B à b productions) § Two choices

§ w ⇒ d a t B g § w ⇒ d A t b g

52

slide-38
SLIDE 38

Many options

§ In Derivation #1: Always the left-most non-terminal is picked for replacement

§ “Left-most” derivation

§ In Derivation #2: Always the right-most non-terminal is picked for replacement

§ “Right-most” derivation

§ No influence on L(G)

§ But useful to distinguish “different” derivations § Intuitively: Different derivations might convey different “meaning”

53

slide-39
SLIDE 39

Derivations

§ Given a grammar G with productions Pi. § Consider two derivations D1 = Pa Pb Pc … Pn and D2 = P’a P’b P’c … P’n

§ Pj, P’k productions, applied as intended

§ Are D1 and D2 the same?

§ Again (intuitively): Do they ”mean” the same?

54

slide-40
SLIDE 40

Derivations

§ Question: Are D1 and D2 both right-most derivations (or both left-most derivations)?

§ YES: if Pm = P’m for all 1 ≤ m ≤ n § NO: We can’t easily compare

§ Later more (parse trees)

§ Looking at right-most (or left-most) derivations allows us to compare derivations

§ Different derivations don’t matter always § … but sometimes they do (more later)

55

slide-41
SLIDE 41

Parse tree

§ Want to identify structure expressed by derivation

§ Compare two derivations that are not both right-most (or both left- most) derivations

§ Summary of derivation

§ Ignore the order of applying productions § Leaves: Terminals § Interior nodes: Application of a production

56

slide-42
SLIDE 42

Parse tree construction

§ How to construct parse tree? § Induction

§ Given derivation a1 ⇒ a2 ⇒ … ai ⇒ ai+1 ⇒ ... ⇒ an

§ Step 1: Construct tree for a1

§ Really tree for A = a1 § Single node labeled A

§ Step i: Assume tree for a1 ⇒ a2 ⇒ … ⇒ ai already constructed

§ ai = X1 X2 … Xj … Xk § Assume Xjà b = Y1 … Ym leads to ai ⇒ ai+1 § Take tree built for a1 ⇒ a2 ⇒ … ⇒ ai § Find j-th leaf in this tree – this is labeled Xj. § Add m new children (all leaves), labeled Y1 … Ym § Special case: m = 0, i.e. b = e § Add one child with label e

58

slide-43
SLIDE 43

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

59 S

slide-44
SLIDE 44

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

60 S E

S à E

slide-45
SLIDE 45

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

61 S E Op E E

S à E E àE Op E

slide-46
SLIDE 46

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

62 S E Op E E Id

S à E E àE Op E E à Id

slide-47
SLIDE 47

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

63 S E Op E E Id +

S à E E àE Op E E à Id Op à +

slide-48
SLIDE 48

Example: Constructing a parse tree

§ Derivation #1 (left-most derivation)

S ⇒ E ⇒E Op E ⇒ Id Op E ⇒ Id + E ⇒ Id + Id

64 S E Op E E Id + Id

S à E E àE Op E E à Id Op à + E à Id

slide-49
SLIDE 49

Example: Constructing a parse tree

§ Derivation #2 (right-most derivation)

S ⇒ E ⇒E Op E ⇒ E Op Id ⇒ E + Id ⇒ Id + Id

§ Same tree!

§ Parse tree summarizes derivation (you can find production used) § No statement regarding the right-most or left-most derivation

67 S E Op E E Id + Id

slide-50
SLIDE 50

a + b * c

Talk to your neighbor and find a derivation for “a + b * c”

(Hint: right-most or left-most)

Construct the parse tree for your derivation Compare your tree with the result obtained by your neighbor team

68

slide-51
SLIDE 51

Derivations for a + b * c

§ Tree #1 § Tree #2

70 S E Op E E Id + Op E E Id * Id S E Op E E Id * Op E E Id + Id

Note: Each tree can be obtained using both a left-most and a right-most derivation.

slide-52
SLIDE 52

Derivations and parse trees

§ Derivations with different parse trees

§ For the same string w

§ What was intended by the programmer?

§ Tree #1 means: a + (b * c) § Tree #2 means: (a + b) * c

71

slide-53
SLIDE 53

Derivations and parse trees

§ Derivations with different parse trees

§ For the same string w

§ What was intended by the programmer?

§ Tree #1 means: a + (b * c) § Tree #2 means: (a + b) * c

§ Should we allow grammars with different parse trees for w?

§ Probably not for programming languages (if derivations capture structure)

72

slide-54
SLIDE 54

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)

§ (Or more than one left-most derivation)

§ Different right-most (left-most) derivations result in different parse trees

§ Capture different structure

74

slide-55
SLIDE 55

Different parse trees

§ There are grammars that allow more than one right-most derivation for w ∈ L(G)

§ (Or more than one left-most derivation)

§ Example (right-most)

§ Derivation #1: S ⇒ E ⇒ E Op E ⇒ E Op E Op E ⇒ E Op E Op Id ⇒ E Op E * Id ⇒ E Op Id * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id § Derivation #2: S ⇒ E ⇒ E Op E ⇒ E Op Id ⇒ E * Id ⇒ E Op E * Id ⇒ E Op Id * Id ⇒ E + Id * Id ⇒ Id + Id * Id

76

slide-56
SLIDE 56

Derivations and parse trees

Tree #1 Tree #2

77 S E Op E E Id * Op E E Id + Id S E Op E E Id + Op E E Id * Id

slide-57
SLIDE 57

3.1.3 Ambiguity

§ A grammar that allows more than on parse tree for at least

  • ne w ∈ L(G) is called ambiguous

§ Note: Ambiguity is property of the grammar

§ We give later a non-ambiguous grammar for expressions

§ We need to compare parse trees (and derivations)

§ Comparing derivations easy if only left-most (right-most) used

§ Alternative definition: A grammar that allows more than one (right | left)-most derivation for at least one w ∈ L(G) is called ambiguous

78

slide-58
SLIDE 58

Problems w/ ambiguity

§ Compiler does not know how to interpret “a + b * c”

§ Is it Tree #1? I.e., (a + b) * c § Or is it Tree #2? I.e., a + (b * c)

§ What can we do?

79

slide-59
SLIDE 59

Addressing ambiguity

§ Change the grammar

§ See later for better grammar § May not always be possible

§ Change language § Add rules that “*” binds more strongly than “+”

§ Precedence § Resolves conflicts

§ Bad idea: Let the compiler (writer) decide

§ Or let the user worry

80

slide-60
SLIDE 60

Another example

§ “If” statement § Two forms

§ if (Condition) then (Body) § if (Condition) then (Body) else (Body)

81

slide-61
SLIDE 61

Another example – G5

§ Start symbol: S § Productions

S à stmt-list S | stmt-list stmt-list à …. | if-stmt if-stmt à if cond-expr then S | if cond-expr then S else S cond-expr à …

§ Other statements (assign, function call, …) and expression details omitted (abbreviate Body, Cond)

82

slide-62
SLIDE 62

Please construct with your neighbor a program fragment that shows that this grammar is ambiguous. (Find an example with two parse trees)

83

slide-63
SLIDE 63

Another example

if (Cond) then if (Cond) then (Body) else (Body)

Body: some other stmts Cond: some condition expression

What did the programmer mean?

84

if (Cond) then (Body) if (Cond) then (Body) else (Body) if (Cond) then (Body) if (Cond) then (Body) else (Body)

slide-64
SLIDE 64

85

slide-65
SLIDE 65

Grammar G6

§ S à SS | (S) | () | ε § Is G6 ambiguous? § What is L(G6)? Find two right-most (left-most) derivations for some w. § Find a grammar G6’ such that L(G6) == L(G6’) and G6’ is not ambiguous.

88

slide-66
SLIDE 66

89

slide-67
SLIDE 67

Ambiguous languages

§ Ambiguity is a property of the grammar

§ One word is enough to show ambiguity § How do you show that a grammar is not ambiguous?

§ Proof (for one grammar) § Some kinds of grammars are certified unambiguous

§ We will look at those in compiler design

§ Unfortunately there are languages that are inherently ambiguous

§ All grammars that generate such a language are ambiguous § Even for Type-2 (context free) grammars

91

slide-68
SLIDE 68

92

slide-69
SLIDE 69

Transition from parse tree to IR

§ Parse tree

§ Sometimes called concrete syntax tree § Interior nodes represent non-terminals

§ Our tree-based IR: Abstract-syntax tree

§ Interior nodes represent programming constructs § Non-terminals not (directly) preserved § Structure close to that of the parse tree

§ Building IR: Via derivations or separate transformation step

94

slide-70
SLIDE 70

Parse tree vs IR

Concrete syntax tree Abstract syntax tree (IR)

95 S E Op E E Id a7 + Id b + VAR b VAR a7

slide-71
SLIDE 71

Parsing

§ Given G and a word w ∈T*: we want to know if “w ∈ L(G)?” § Analysis problem

§ Answer is either YES or NO

§ If (and only if) we find a sequence of productions so that S ⇒* w then w ∈ L(G)

96

slide-72
SLIDE 72

Summary

§ Frontend performs two tasks

§ Break input into tokens § Analyze that sequence of tokens is legal input

§ Find derivation S ⇒* w

§ Goal: produce IR § Parse trees capture derivations

§ Information about structure – needed for IR

§ Our IR is tree-based, so step from parse tree to IR tree not that large

98