This slide contains no jokes. How to Write Compilers an d solve - - PowerPoint PPT Presentation

this slide contains no jokes how to write compilers an d
SMART_READER_LITE
LIVE PREVIEW

This slide contains no jokes. How to Write Compilers an d solve - - PowerPoint PPT Presentation

This slide contains no jokes. How to Write Compilers an d solve data transformation problems. Shevek shevek@anarres.org shevek@nebula.com Introduction I have written a lot of compilers. Sequential and parallel languages. Machine


slide-1
SLIDE 1

This slide contains no jokes.

slide-2
SLIDE 2

d solve data transformation problems.

Shevek shevek@anarres.org shevek@nebula.com

How to Write Compilers an

slide-3
SLIDE 3

Introduction

  • I have written a lot of compilers.
  • Sequential and parallel languages.
  • Machine code and interpreted.
  • Some of them didn't work!
  • This presentation contains practical knowledge.
  • Almost nothing in it is researched.
  • The gaps are stuff I don't know, or don't need.
  • Some basic material is omitted.
  • Read a book.
  • Ask questions! Already!
slide-4
SLIDE 4

Nebula

http://www.nebula.com/careers/

slide-5
SLIDE 5

Example of a Compiler

slide-6
SLIDE 6

The Mini-language Philosophy

  • Describe your problem in some language.
  • Implement the

description in code.

  • 1 month
  • Write a compiler

for that language.

  • 1 month
  • Update the problem description.
  • Reimplement the

description in code.

  • 1 month
  • Run your

compiler again.

  • 1 week!
slide-7
SLIDE 7

Why Write a Compiler?

  • Save time over the life of a project.
  • Simplify the runtime.
  • Detect errors earlier.
slide-8
SLIDE 8

Errors in Development

Code leaves developer's desktop Runtime errors detected Kind of expensive Say prayers IDE highlights bug From static analyzer

slide-9
SLIDE 9

Why Not Write a Compiler?

  • Reinventing world, plus dog.
  • Usually an expression parser.
  • Write a library for an existing language.
  • Use fluent APIs.
  • Cognitive load.
  • This applies to annotations as well.

– Note to self: Rag on python a bit.

  • Confusing education with production.
slide-10
SLIDE 10

General Engineering

  • TIP: Write type-safe code.
  • TIP: Write synchronous code.
  • TIP: Write code with a well-defined thread

contract.

  • TIP: Use monads or immutable structures.
slide-11
SLIDE 11

Overview of a Compiler

  • Treat the stages independently.
  • Give each stage a contract.
  • TIP: Make each stage check its input.
  • TIP: Treat the input as a read-only structure.
slide-12
SLIDE 12

Example Stage: The Front End

  • Each phase:
  • Consumes some previous outputs.
  • Produces one immutable output.
  • The last phase
  • Produces a consolidated output structure, which is also immutable.
  • TIP: Use Context and InstanceMap patterns.
slide-13
SLIDE 13

Mistakes

  • Writing an ugly monolith.
  • Cannot debug or modify.
  • Modifying data structures.
  • Cannot localize bugs.
  • No clear contract for the structure.
  • Attaching metadata to the parse tree.
  • It looks like a christmas tree.
slide-14
SLIDE 14

Assembling Compilers

  • A compiler S → T might consist of:
  • S → X, X → Y, Y → T
  • TIP: Write the data structures first.
  • TIP: Use service discovery.
  • TIP: Declare phase dependencies.
  • TIP: Use jgrapht.
  • TIP: Dijkstra's shortest path.
  • TIP: Print everything.
  • TIP: Use graphviz.
slide-15
SLIDE 15

Language Design

  • Languages should have low information

content.

  • Low information content (or high redundancy)

increases the ability of the compiler to detect errors.

  • An arbitrary (randomly generated) input should

have a high probability of being invalid.

  • Note to self: Rag on Perl a bit, but not as much as

python.

  • See @Override in Java, not present in C++.
  • The output of a compiler will probably have very

high information content.

slide-16
SLIDE 16

More Language Design

  • The programmer has better things to do than:
  • Format code.

– Make your code auto-formatable. – Do not make whitespace significant.

  • Work out the valid options for …

– Parameter types to an overloaded function. – Available symbols, variables or types.

  • Keep track of whether a variable exists.

– Grrrrr.

  • You might not need a syntax.
  • Just add semantics to an existing syntax, e.g. XML.
slide-17
SLIDE 17

Lexer and Parser

  • The lexer turns a sequence of characters into a

sequence of tokens.

  • The parser turns the sequence of tokens into a

parse tree.

  • They are generally rules-based.
slide-18
SLIDE 18

Parsers: Ambiguity

  • Sentence = NounPhrase Verb NounPhrase .
  • NounPhrase = Article Adjective? Noun

The

  • What happened?
  • ld man the boats.
slide-19
SLIDE 19

Parsers: A Parse Tree

slide-20
SLIDE 20

Parsers: LL

public Statement parse_statement() { Token t = token(); switch (t.type) { case SELECT: return parse_select(); ... } } public Select parse_select() { Select s = new Select(); s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause(); return s; }

  • An example parser for SQL
slide-21
SLIDE 21

Parsers: LL

s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();

slide-22
SLIDE 22

Parsers: LL

s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();

slide-23
SLIDE 23

Parsers: LL

s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();

slide-24
SLIDE 24

Parsers: LR

  • An LR parser looks at the top few tokens on the

stack, and decides one of two things:

  • Push the new token onto the stack.
  • Reduce the top of the stack to a compound token.
  • LR parsers tend to be automatically generated.
  • TIP: Always use LR parsers.
slide-25
SLIDE 25

Parsers: LR

Stack: select expression_list from_list where a0 < 5 ... Rule: a0 < 5 → expr

slide-26
SLIDE 26

Parsers: LR

Stack: select expression_list from_list where expr and b1 > 7 … Rule: b1 > 7 → expr

slide-27
SLIDE 27

Parsers: LR

Stack: select expression_list from_list where expr and expr … Rule: expr and expr → expr

slide-28
SLIDE 28

Parsers: LR

Stack: select expression_list from_list where expr ... Rule: where expr where_clause →

slide-29
SLIDE 29

Parsers: Ambiguity

  • Expressions:
  • 5 + 6
  • 3 * (5 + 6)
  • (5 + 6)
  • (6)
  • Lists:
  • (6, 7, 8)
  • (6, 7)
  • ()
  • (6)
  • F^HOops!
  • Perl has +{}, @x = (6), $x = (6)
  • Look for Meredith Patterson's 28c3 talk.
slide-30
SLIDE 30

Parsers: Shift-Reduce Conflict

statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); if (a) if (bar) foo(); else baz(); if (a) if (bar) foo(); else baz();

slide-31
SLIDE 31

Parsers: Shift-Reduce Conflict

statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); Initial stack: if[0] condition if[1] condition statement else … Rule: … if condition statement → statement Result: if[0] condition statement[1] Rule: … if condition statement [else statement] → (shift else) Result: if[0] condition if[1] condition statement else

slide-32
SLIDE 32

Parsers: Factoring in LR

statement = {no_dangling} no_dangling_statement { -> no_dangling_statement.statement } | {dangling} dangling_statement { -> dangling_statement.statement } ; /* productions NOT ending in 'statement' */ no_dangling_statement { -> statement } = {comp} compound_statement { -> compound_statement.statement } | {exp} expression_statement { -> expression_statement.statement } | {jmp} jump_statement { -> jump_statement.statement } | {if_else} kw_if tok_lpar expression tok_rpar no_dangling_statement kw_else [other]:no_dangling_statement { -> New statement.if_else(expression, no_dangling_statement.statement,

  • ther.statement) } |

{catch} catch_statement { -> catch_statement.statement } ; /* productions ending in 'statement' */ dangling_statement { -> statement } = {label} labeled_statement { -> labeled_statement.statement } | {select} selection_statement { -> selection_statement.statement } | {iter} iteration_statement { -> iteration_statement.statement } ;

slide-33
SLIDE 33

Parsers: Priority in Bison

  • For reference only:

... %nonassoc LOWER_THAN_ELSE %nonassoc L_ELSE ... %% ... statement : ... | L_IF '(' nv_list_exp ')' statement opt_else | ... ;

  • pt_else : %prec LOWER_THAN_ELSE

| L_ELSE statement ;

slide-34
SLIDE 34

Parsers: PEGs

  • Like an LR CFG but where rules have priority.

statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ;

  • PEG makes a deterministic decision in case of

ambiguity.

  • This also means your language design is a dog's

breakfast.

slide-35
SLIDE 35

SableCC

  • A beautiful Java LR parser generator.
  • Somewhat under-documented.
  • TIP: Use SableCC.
slide-36
SLIDE 36

SableCC: Example Grammar

/* 6.5.9 equality-expression */ equality_expression { -> expression } = {no} relational_expression { -> relational_expression.expression } | {eq} equality_expression tok_eq_eq relational_expression { -> New expression.eq( equality_expression.expression, relational_expression.expression) }| {ne} equality_expression tok_ne relational_expression { -> New expression.ne( equality_expression.expression, relational_expression.expression) }; /* 6.5.10 AND-expression */ and_expression { -> expression } = {no} equality_expression { -> equality_expression.expression } | {and} and_expression tok_and equality_expression { -> New expression.bitwise_and( and_expression.expression, equality_expression.expression) };

  • A fragment of a SableCC grammar:
slide-37
SLIDE 37

Concrete Syntax Trees

  • Lists are parsed using recursion.

/* 6.5.2 argument-expression-list */ argument_expression_list = {single} assignment_expression | {list} argument_expression_list tok_comma assignment_expression ;

Ugly, and doesn't even fit on the slide.

argument_list : argument { $$ = newAV(); av_push($$, $1); } | argument_list ',' argument { av_push($1, $3); $$ = $1; } ;

  • We can embed code to

disengangle it.

slide-38
SLIDE 38

SableCC: Abstract Syntax Tree

  • We want it to be flat.

/* 6.5.2 argument-expression-list */ argument_expression_list { -> expression* } = {single} assignment_expression { -> [assignment_expression.expression] } | {list} argument_expression_list tok_comma assignment_expression { -> [argument_expression_list.expression, assignment_expression.expression] } ;

  • Ahhh, beauty.
slide-39
SLIDE 39

SableCC: Visitors

  • We want a type-safe visitor pattern.

Output: public class AAddExpression extends PExpression { private PExpression left; private PExpression right; ... public void apply(Visitor v) { v.inAddExpression(this); left.apply(v); right.apply(v); v.outAddExpression(this); } } Input: expression = ... {sub} [left]:expression [right]:expression | {add} [left]:expression [right]:expression | {rem} [left]:expression [right]:expression | {div} [left]:expression [right]:expression | {mul} [left]:expression [right]:expression | ... ;

slide-40
SLIDE 40

SableCC: String Concatenation

  • In C, constant strings concatenate.

public class StringConcatenationAnalysis extends DepthFirstAdapter { @Override public void outAMultiStringConstant(AMultiStringConstant node) { StringBuilder buf = new StringBuilder(); for (TStringLiteral sl : node.getStringLiteral()) buf.append(constants.getConstant(sl)); AStringConstant sc = new AStringConstant(new TStringLiteral(buf)); constants.addConstant(sc, value); node.replaceBy(sc); } }

I didn't even need all the space on this slide.

slide-41
SLIDE 41

SableCC: Symbol Table

  • inABlockStatement()
  • push(new Scope());
  • outABlockStatement()
  • pop();
  • caseADeclarator(ADeclarator declarator)
  • peek().add(declarator);
  • caseAIdentifier(AIdentifier identifier)
  • setSymbol(identifier, peek().get(identifier));

@Override public void caseAIdentifierExpression(AIdentifierExpression node) { String name = node.getIdentifier().getText(); ExpressionSymbol symbol = relation.getSymbol(name); if (symbol == null) errors.addError("No such field " + name + " in " + relation); else setSymbol(node, symbol); }

slide-42
SLIDE 42

SableCC: Managing an AST

  • Simplify the AST.
  • TIP: Eliminate special cases early.
  • TIP: Write assertions.
  • TIP: Overload getOut() to make type-safe and @Nonnull
  • TIP: Make defaultCase() throw an exception.
  • Now you can detect unhandled terminals.
  • If you need to detect unhandled nonterminals, implement the

interface instead of extending the adapter.

@Nonnull public Type getType(Node expr) { Type t = (Type)getOut(expr); assert t != null : “No type for “ + StringVisitor.toString(expr); return t; }

slide-43
SLIDE 43

Handling Errors

  • Detect errors early; report them late.
  • You might not have hit the root cause yet.
  • The programmer would like to fix multiple errors at a time.
  • Make them accurate, with useful context.
  • TIP: Use an ErrorHandler:

– pushContext(...) – popContext() – addMessage(..., Throwable t, boolean fatal);

  • Preserve error metadata as a structure.
  • An IDE or framework can use them without parsing an exception message.

– Reminder: Rag on antlr a bit, if didn't rag on it enough already for being LL.

  • The + operator at line 5 character 32
  • With left argument from L3C42 to L5C30
  • With …
  • E_CANNOTPROMOTE_INT_LONG: Cannot promote int to long.
  • Insert stubs and continue?
  • TIP: Use exceptions for internal and unexpected errors only!
slide-44
SLIDE 44

Error Reporting and Recovery

  • Common mistakes are easy to cater for.

bad_octal_constant = octal_constant ['8'..'9'] digit*; bad_constant = bad_octal_constant; bad_string_literal = 'L'? '"' s_char_seq?; bad_character_constant = 'L'? ''' c_char_seq?; bad_identifier = digit identifier_nondigit+; bad_token = all; error_expression = {bad_constant} bad_constant { -> New error_expression.bad_constant(bad_constant) } ; primary_expression { -> expression } = ... {error} error_expression { -> New expression.error(error_expression) } | ...

  • From most of my C-like grammars.
  • The parser now throws 'Unexpected “unterminated_string_literal”'!
slide-45
SLIDE 45

Error Reporting and Recovery

  • Or find a synchronization point.

private Token string(char open, char close) { for (char c : ...) { ... else if (isLineSeparator(c)) { unread(c); // error("Unterminated string literal after " + buf); return new Token(INVALID, text.toString(), "Unterminated string literal after " + buf); } ... } }

  • From jcpp, the synchronization point is end of line.
  • This allows preprocessing to continue even if a token is semantically bad.
  • Other grammars may have synchronization points, even in LR.
slide-46
SLIDE 46

More Mistakes

  • Do not tie everything to a successful compile.
  • Invalid code is the norm.
  • Pay more attention to the utility of your compiler for

invalid input than for valid input.

  • Rag on eclipse a bit, because you can't even

reformat code which doesn't compile.

  • NetBeans can even autocomplete within invalid

code!

slide-47
SLIDE 47

The Middle End

slide-48
SLIDE 48

What does your language allow?

  • Loops
  • Flow languages like guaranteed termination of

loops, and therefore tend to disallow them.

  • Types
  • We can have “more” or “less” typing, and there are

interesting consequences.

  • Minor features
  • Polymorphism, overloading, garbage collection,

threads, … (minor due to conceptually simpler implementation).

slide-49
SLIDE 49

Types: When to Handle Them

  • Compile time
  • Detects errors earlier.
  • This is C++.
  • Run time
  • More runtime complexity.
  • Space and time overhead.
  • Reminder: Rag on python some more.
  • Don't handle
  • Let the kernel handle segmentation faults.
  • Hard to build a method dispatch mechanism.
slide-50
SLIDE 50

Types: Partial Handling

  • Partial handling at compile-time
  • Detects some errors earlier.
  • Does not prevent errors.

– Emitter gets partial data structures.

  • Does not avoid runtime costs.
  • Pike (iengine/JVM)
  • Two VMs: One for typed and one for untyped

runtime.

  • Perl
  • “known” and “unknown” comparison ops.
slide-51
SLIDE 51

Creating Safety with Types

  • Consider a language with type-safe indexes.
  • char[8] a;
  • a[5] is legal. a[9] is illegal.
  • a[i] is of unknown legality unless we can prove something about i.
  • SPARK/Ada can do this.

– The runtime can rely on the correctness.

  • Coverity and Sparse can do this sometimes.

– The runtime cannot rely on the correctness.

  • How do we compile memcpy(char[?], char[?])
  • Either we expand a template, or we allow unsafe code.
  • Other rat-holes include intrusive lists, which are:
  • Possible in C.
  • Typesafe in C++.
  • Impossible in Java.
slide-52
SLIDE 52

Instruction Ordering

  • Consider *a++ + *a++
  • Adds two consecutive chars in a string.
  • The order of operations is left to right.
  • Read, increment, write.
  • Read, increment, write.
  • Add the reads.
slide-53
SLIDE 53

Instruction Ordering in C

  • Consider *a++ + *a++
  • C doesn't guarantee that the LHS of + is evaluated

first.

  • The order of operations is not well defined.
  • We might not even add 2 to a!
slide-54
SLIDE 54
  • Consider *a++ + *a++
  • Java does guarantee that the LHS of + is evaluated

first, including all side-effects.

Instruction Ordering in Java

  • The VM still has freedom to reorder the operations

if there is no memory-visible difference.

  • TIP: Use graphviz.
slide-55
SLIDE 55

Optimizers

  • We aim to minimize some objective function.
  • Usually time.
  • Occasionally space.
  • Three categories:
  • Basic hill-climbing
  • Algebraic
  • NP-Hard / nonalgebraic / superoptimizers
slide-56
SLIDE 56

Hill Climbing Optimization: TAC

  • The cost function is time.
  • Each change improves the code.
  • Often used for register machines.
  • The subsequent register-colouring problem is Hard.
slide-57
SLIDE 57

Algebraic Optimization

  • Consider a game of chess.
  • We understand the valid moves.
  • The objective function is victory/defeat in a leaf

node.

  • We estimate the objective at each non-leaf node.
  • Moves in the game-tree may immediately lose

material (incur cost), but still improve our

  • verall likelihood of winning.
  • Therefore, this is not hill-climbing.
slide-58
SLIDE 58

Algebraic Optimization: SQL

  • Join A, B, C, D.
  • These are the objective plans.
  • They are leaf nodes in the algebraic tree.
  • They are themselves trees!
slide-59
SLIDE 59

Algebraic Optimization: SQL

  • A partial result is (A x B).
  • Is this more likely to lead to a fast solution than (C x D)?
  • Yes, because it is smaller?
  • Multiple ways to

reach a partial result.

  • A: 2 records
  • B: 10 records
  • C: 100 records
  • D: 10 records
slide-60
SLIDE 60

Algebraic Optimization: Example

This is really a DAG of trees, where the parents of any node are the partial results combined to compute it. Dominated partial results have been pruned.

slide-61
SLIDE 61

Algebraic Optimization: Hints

  • Unlike the TAC optimization where every stage

was a valid answer, algebraic methods only generate an answer at the leaf node.

  • You may want to prove that:
  • At least one leaf is reachable.
  • All leaves are reachable (for optimality).
  • The algorithm finishes in reasonable polynomial

time.

  • Dominated partial results are pruned early.
  • TIP: Breadth first. RAM is cheap, but prune lots.
slide-62
SLIDE 62

NP-Hard Optimization

  • In one cycle, a CPU can execute:
  • Ten adds
  • Four multiplies
  • Two divides
  • Schedule this:
slide-63
SLIDE 63

NP-Hard Optimization

  • We use a general purpose constraint solver.
  • Each operation becomes a variable.
  • The value of the variable will be the instruction

index.

slide-64
SLIDE 64

NP-Hard Optimization

  • Each relationship becomes a constraint.
  • m0 < a1, m0 < c3, etc.
  • We add the other constraints to the system.
  • a1 != c3, as we cannot schedule add and multiply

together

slide-65
SLIDE 65

NP-Hard Optimization

  • We solve it.
  • The solution sucks.
slide-66
SLIDE 66

CSP: Symmetry

  • The eight-queens problem has symmetric

solutions.

  • We can defeat symmetry and double the performance of our solver

if we tell it never to put a queen in a5, a6, a7, a8.

  • If one solution has a queen in a[5..8] then its mirror has not.
slide-67
SLIDE 67

Using symmetry

  • We add a new constraint:

If there are no instructions in time slot k then there must be no instructions in time slot k+1.

  • This causes all instructions to be scheduled densely at 0.
  • And now we optimize:

There are no instructions in time slot N.

  • If the solver succeeds, we reduce N and try again.
  • If it fails, the last discovered solution must have been optimal.
slide-68
SLIDE 68

Revisiting Scheduling

  • k is empty → k+1 is empty
  • 9 is empty (later, 8 is empty, etc)
slide-69
SLIDE 69

Applications of SAT/CSP

  • Modelling a CPU for classic superoptimization.
  • Hindley-Milner type unification.
  • Cluster scheduling and placement.
  • Cases where you just can't be bothered to write

the code or derive the algebra.

  • Beware: It might go out to lunch, or fail totally.
slide-70
SLIDE 70

Consequences of Optimization

  • Optimization destroys some of the naivety of

the AST or basic block graph we are used to.

  • For example, TAC optimizations make it harder

(but not impossible) to emit for a stack machine.

  • Because sometimes we need to load a partial result

which is not at the top of the stack.

slide-71
SLIDE 71

Backend

  • Isn't actually magic after all.
slide-72
SLIDE 72

Example: jcpp

  • The C Preprocessor can only be implemented

by hand.

  • There are lots of nearly-correct implementations.
  • There are very few correct implementations.
slide-73
SLIDE 73

Example: iengine

slide-74
SLIDE 74

Example: SQL

slide-75
SLIDE 75

Example: Apache Pig

slide-76
SLIDE 76

Party Tricks in C

  • Some C compilers accept nested comments.
  • /* /* something */ still a comment */
  • Some do not.
  • /* /* something */ error */
  • Write a program which prints which type of

compiler you are using.

  • It must compile on all compilers.
  • It can be done elegantly in a couple of lines.
  • No preprocessor tricks.
  • No pragmas or system hacks.
slide-77
SLIDE 77

Party Tricks in C++

  • C++ was originally a preprocessor called cfront.
  • It got standardized by ISO.
  • Write a program which tells you which type of

compiler you are using.

  • It must compile on all compilers.
  • It can be done elegantly in a couple of lines.
  • You know the rest.
slide-78
SLIDE 78

Party Tricks with VMs

slide-79
SLIDE 79

Office Hours

slide-80
SLIDE 80

Thank you

“Begin at the beginning,” the King said gravely, “and go on till you come to the end: then stop.”