This slide contains no jokes. How to Write Compilers an d solve - - PowerPoint PPT Presentation
This slide contains no jokes. How to Write Compilers an d solve - - PowerPoint PPT Presentation
This slide contains no jokes. How to Write Compilers an d solve data transformation problems. Shevek shevek@anarres.org shevek@nebula.com Introduction I have written a lot of compilers. Sequential and parallel languages. Machine
d solve data transformation problems.
Shevek shevek@anarres.org shevek@nebula.com
How to Write Compilers an
Introduction
- I have written a lot of compilers.
- Sequential and parallel languages.
- Machine code and interpreted.
- Some of them didn't work!
- This presentation contains practical knowledge.
- Almost nothing in it is researched.
- The gaps are stuff I don't know, or don't need.
- Some basic material is omitted.
- Read a book.
- Ask questions! Already!
Nebula
http://www.nebula.com/careers/
Example of a Compiler
The Mini-language Philosophy
- Describe your problem in some language.
- Implement the
description in code.
- 1 month
- Write a compiler
for that language.
- 1 month
- Update the problem description.
- Reimplement the
description in code.
- 1 month
- Run your
compiler again.
- 1 week!
Why Write a Compiler?
- Save time over the life of a project.
- Simplify the runtime.
- Detect errors earlier.
Errors in Development
Code leaves developer's desktop Runtime errors detected Kind of expensive Say prayers IDE highlights bug From static analyzer
Why Not Write a Compiler?
- Reinventing world, plus dog.
- Usually an expression parser.
- Write a library for an existing language.
- Use fluent APIs.
- Cognitive load.
- This applies to annotations as well.
– Note to self: Rag on python a bit.
- Confusing education with production.
General Engineering
- TIP: Write type-safe code.
- TIP: Write synchronous code.
- TIP: Write code with a well-defined thread
contract.
- TIP: Use monads or immutable structures.
Overview of a Compiler
- Treat the stages independently.
- Give each stage a contract.
- TIP: Make each stage check its input.
- TIP: Treat the input as a read-only structure.
Example Stage: The Front End
- Each phase:
- Consumes some previous outputs.
- Produces one immutable output.
- The last phase
- Produces a consolidated output structure, which is also immutable.
- TIP: Use Context and InstanceMap patterns.
Mistakes
- Writing an ugly monolith.
- Cannot debug or modify.
- Modifying data structures.
- Cannot localize bugs.
- No clear contract for the structure.
- Attaching metadata to the parse tree.
- It looks like a christmas tree.
Assembling Compilers
- A compiler S → T might consist of:
- S → X, X → Y, Y → T
- TIP: Write the data structures first.
- TIP: Use service discovery.
- TIP: Declare phase dependencies.
- TIP: Use jgrapht.
- TIP: Dijkstra's shortest path.
- TIP: Print everything.
- TIP: Use graphviz.
Language Design
- Languages should have low information
content.
- Low information content (or high redundancy)
increases the ability of the compiler to detect errors.
- An arbitrary (randomly generated) input should
have a high probability of being invalid.
- Note to self: Rag on Perl a bit, but not as much as
python.
- See @Override in Java, not present in C++.
- The output of a compiler will probably have very
high information content.
More Language Design
- The programmer has better things to do than:
- Format code.
– Make your code auto-formatable. – Do not make whitespace significant.
- Work out the valid options for …
– Parameter types to an overloaded function. – Available symbols, variables or types.
- Keep track of whether a variable exists.
– Grrrrr.
- You might not need a syntax.
- Just add semantics to an existing syntax, e.g. XML.
Lexer and Parser
- The lexer turns a sequence of characters into a
sequence of tokens.
- The parser turns the sequence of tokens into a
parse tree.
- They are generally rules-based.
Parsers: Ambiguity
- Sentence = NounPhrase Verb NounPhrase .
- NounPhrase = Article Adjective? Noun
The
- What happened?
- ld man the boats.
Parsers: A Parse Tree
Parsers: LL
public Statement parse_statement() { Token t = token(); switch (t.type) { case SELECT: return parse_select(); ... } } public Select parse_select() { Select s = new Select(); s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause(); return s; }
- An example parser for SQL
Parsers: LL
s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LL
s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LL
s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LR
- An LR parser looks at the top few tokens on the
stack, and decides one of two things:
- Push the new token onto the stack.
- Reduce the top of the stack to a compound token.
- LR parsers tend to be automatically generated.
- TIP: Always use LR parsers.
Parsers: LR
Stack: select expression_list from_list where a0 < 5 ... Rule: a0 < 5 → expr
Parsers: LR
Stack: select expression_list from_list where expr and b1 > 7 … Rule: b1 > 7 → expr
Parsers: LR
Stack: select expression_list from_list where expr and expr … Rule: expr and expr → expr
Parsers: LR
Stack: select expression_list from_list where expr ... Rule: where expr where_clause →
Parsers: Ambiguity
- Expressions:
- 5 + 6
- 3 * (5 + 6)
- (5 + 6)
- (6)
- Lists:
- (6, 7, 8)
- (6, 7)
- ()
- (6)
- F^HOops!
- Perl has +{}, @x = (6), $x = (6)
- Look for Meredith Patterson's 28c3 talk.
Parsers: Shift-Reduce Conflict
statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); if (a) if (bar) foo(); else baz(); if (a) if (bar) foo(); else baz();
Parsers: Shift-Reduce Conflict
statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); Initial stack: if[0] condition if[1] condition statement else … Rule: … if condition statement → statement Result: if[0] condition statement[1] Rule: … if condition statement [else statement] → (shift else) Result: if[0] condition if[1] condition statement else
Parsers: Factoring in LR
statement = {no_dangling} no_dangling_statement { -> no_dangling_statement.statement } | {dangling} dangling_statement { -> dangling_statement.statement } ; /* productions NOT ending in 'statement' */ no_dangling_statement { -> statement } = {comp} compound_statement { -> compound_statement.statement } | {exp} expression_statement { -> expression_statement.statement } | {jmp} jump_statement { -> jump_statement.statement } | {if_else} kw_if tok_lpar expression tok_rpar no_dangling_statement kw_else [other]:no_dangling_statement { -> New statement.if_else(expression, no_dangling_statement.statement,
- ther.statement) } |
{catch} catch_statement { -> catch_statement.statement } ; /* productions ending in 'statement' */ dangling_statement { -> statement } = {label} labeled_statement { -> labeled_statement.statement } | {select} selection_statement { -> selection_statement.statement } | {iter} iteration_statement { -> iteration_statement.statement } ;
Parsers: Priority in Bison
- For reference only:
... %nonassoc LOWER_THAN_ELSE %nonassoc L_ELSE ... %% ... statement : ... | L_IF '(' nv_list_exp ')' statement opt_else | ... ;
- pt_else : %prec LOWER_THAN_ELSE
| L_ELSE statement ;
Parsers: PEGs
- Like an LR CFG but where rules have priority.
statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ;
- PEG makes a deterministic decision in case of
ambiguity.
- This also means your language design is a dog's
breakfast.
SableCC
- A beautiful Java LR parser generator.
- Somewhat under-documented.
- TIP: Use SableCC.
SableCC: Example Grammar
/* 6.5.9 equality-expression */ equality_expression { -> expression } = {no} relational_expression { -> relational_expression.expression } | {eq} equality_expression tok_eq_eq relational_expression { -> New expression.eq( equality_expression.expression, relational_expression.expression) }| {ne} equality_expression tok_ne relational_expression { -> New expression.ne( equality_expression.expression, relational_expression.expression) }; /* 6.5.10 AND-expression */ and_expression { -> expression } = {no} equality_expression { -> equality_expression.expression } | {and} and_expression tok_and equality_expression { -> New expression.bitwise_and( and_expression.expression, equality_expression.expression) };
- A fragment of a SableCC grammar:
Concrete Syntax Trees
- Lists are parsed using recursion.
/* 6.5.2 argument-expression-list */ argument_expression_list = {single} assignment_expression | {list} argument_expression_list tok_comma assignment_expression ;
Ugly, and doesn't even fit on the slide.
argument_list : argument { $$ = newAV(); av_push($$, $1); } | argument_list ',' argument { av_push($1, $3); $$ = $1; } ;
- We can embed code to
disengangle it.
SableCC: Abstract Syntax Tree
- We want it to be flat.
/* 6.5.2 argument-expression-list */ argument_expression_list { -> expression* } = {single} assignment_expression { -> [assignment_expression.expression] } | {list} argument_expression_list tok_comma assignment_expression { -> [argument_expression_list.expression, assignment_expression.expression] } ;
- Ahhh, beauty.
SableCC: Visitors
- We want a type-safe visitor pattern.
Output: public class AAddExpression extends PExpression { private PExpression left; private PExpression right; ... public void apply(Visitor v) { v.inAddExpression(this); left.apply(v); right.apply(v); v.outAddExpression(this); } } Input: expression = ... {sub} [left]:expression [right]:expression | {add} [left]:expression [right]:expression | {rem} [left]:expression [right]:expression | {div} [left]:expression [right]:expression | {mul} [left]:expression [right]:expression | ... ;
SableCC: String Concatenation
- In C, constant strings concatenate.
public class StringConcatenationAnalysis extends DepthFirstAdapter { @Override public void outAMultiStringConstant(AMultiStringConstant node) { StringBuilder buf = new StringBuilder(); for (TStringLiteral sl : node.getStringLiteral()) buf.append(constants.getConstant(sl)); AStringConstant sc = new AStringConstant(new TStringLiteral(buf)); constants.addConstant(sc, value); node.replaceBy(sc); } }
I didn't even need all the space on this slide.
SableCC: Symbol Table
- inABlockStatement()
- push(new Scope());
- outABlockStatement()
- pop();
- caseADeclarator(ADeclarator declarator)
- peek().add(declarator);
- caseAIdentifier(AIdentifier identifier)
- setSymbol(identifier, peek().get(identifier));
@Override public void caseAIdentifierExpression(AIdentifierExpression node) { String name = node.getIdentifier().getText(); ExpressionSymbol symbol = relation.getSymbol(name); if (symbol == null) errors.addError("No such field " + name + " in " + relation); else setSymbol(node, symbol); }
SableCC: Managing an AST
- Simplify the AST.
- TIP: Eliminate special cases early.
- TIP: Write assertions.
- TIP: Overload getOut() to make type-safe and @Nonnull
- TIP: Make defaultCase() throw an exception.
- Now you can detect unhandled terminals.
- If you need to detect unhandled nonterminals, implement the
interface instead of extending the adapter.
@Nonnull public Type getType(Node expr) { Type t = (Type)getOut(expr); assert t != null : “No type for “ + StringVisitor.toString(expr); return t; }
Handling Errors
- Detect errors early; report them late.
- You might not have hit the root cause yet.
- The programmer would like to fix multiple errors at a time.
- Make them accurate, with useful context.
- TIP: Use an ErrorHandler:
– pushContext(...) – popContext() – addMessage(..., Throwable t, boolean fatal);
- Preserve error metadata as a structure.
- An IDE or framework can use them without parsing an exception message.
– Reminder: Rag on antlr a bit, if didn't rag on it enough already for being LL.
- The + operator at line 5 character 32
- With left argument from L3C42 to L5C30
- With …
- E_CANNOTPROMOTE_INT_LONG: Cannot promote int to long.
- Insert stubs and continue?
- TIP: Use exceptions for internal and unexpected errors only!
Error Reporting and Recovery
- Common mistakes are easy to cater for.
bad_octal_constant = octal_constant ['8'..'9'] digit*; bad_constant = bad_octal_constant; bad_string_literal = 'L'? '"' s_char_seq?; bad_character_constant = 'L'? ''' c_char_seq?; bad_identifier = digit identifier_nondigit+; bad_token = all; error_expression = {bad_constant} bad_constant { -> New error_expression.bad_constant(bad_constant) } ; primary_expression { -> expression } = ... {error} error_expression { -> New expression.error(error_expression) } | ...
- From most of my C-like grammars.
- The parser now throws 'Unexpected “unterminated_string_literal”'!
Error Reporting and Recovery
- Or find a synchronization point.
private Token string(char open, char close) { for (char c : ...) { ... else if (isLineSeparator(c)) { unread(c); // error("Unterminated string literal after " + buf); return new Token(INVALID, text.toString(), "Unterminated string literal after " + buf); } ... } }
- From jcpp, the synchronization point is end of line.
- This allows preprocessing to continue even if a token is semantically bad.
- Other grammars may have synchronization points, even in LR.
More Mistakes
- Do not tie everything to a successful compile.
- Invalid code is the norm.
- Pay more attention to the utility of your compiler for
invalid input than for valid input.
- Rag on eclipse a bit, because you can't even
reformat code which doesn't compile.
- NetBeans can even autocomplete within invalid
code!
The Middle End
What does your language allow?
- Loops
- Flow languages like guaranteed termination of
loops, and therefore tend to disallow them.
- Types
- We can have “more” or “less” typing, and there are
interesting consequences.
- Minor features
- Polymorphism, overloading, garbage collection,
threads, … (minor due to conceptually simpler implementation).
Types: When to Handle Them
- Compile time
- Detects errors earlier.
- This is C++.
- Run time
- More runtime complexity.
- Space and time overhead.
- Reminder: Rag on python some more.
- Don't handle
- Let the kernel handle segmentation faults.
- Hard to build a method dispatch mechanism.
Types: Partial Handling
- Partial handling at compile-time
- Detects some errors earlier.
- Does not prevent errors.
– Emitter gets partial data structures.
- Does not avoid runtime costs.
- Pike (iengine/JVM)
- Two VMs: One for typed and one for untyped
runtime.
- Perl
- “known” and “unknown” comparison ops.
Creating Safety with Types
- Consider a language with type-safe indexes.
- char[8] a;
- a[5] is legal. a[9] is illegal.
- a[i] is of unknown legality unless we can prove something about i.
- SPARK/Ada can do this.
– The runtime can rely on the correctness.
- Coverity and Sparse can do this sometimes.
– The runtime cannot rely on the correctness.
- How do we compile memcpy(char[?], char[?])
- Either we expand a template, or we allow unsafe code.
- Other rat-holes include intrusive lists, which are:
- Possible in C.
- Typesafe in C++.
- Impossible in Java.
Instruction Ordering
- Consider *a++ + *a++
- Adds two consecutive chars in a string.
- The order of operations is left to right.
- Read, increment, write.
- Read, increment, write.
- Add the reads.
Instruction Ordering in C
- Consider *a++ + *a++
- C doesn't guarantee that the LHS of + is evaluated
first.
- The order of operations is not well defined.
- We might not even add 2 to a!
- Consider *a++ + *a++
- Java does guarantee that the LHS of + is evaluated
first, including all side-effects.
Instruction Ordering in Java
- The VM still has freedom to reorder the operations
if there is no memory-visible difference.
- TIP: Use graphviz.
Optimizers
- We aim to minimize some objective function.
- Usually time.
- Occasionally space.
- Three categories:
- Basic hill-climbing
- Algebraic
- NP-Hard / nonalgebraic / superoptimizers
Hill Climbing Optimization: TAC
- The cost function is time.
- Each change improves the code.
- Often used for register machines.
- The subsequent register-colouring problem is Hard.
Algebraic Optimization
- Consider a game of chess.
- We understand the valid moves.
- The objective function is victory/defeat in a leaf
node.
- We estimate the objective at each non-leaf node.
- Moves in the game-tree may immediately lose
material (incur cost), but still improve our
- verall likelihood of winning.
- Therefore, this is not hill-climbing.
Algebraic Optimization: SQL
- Join A, B, C, D.
- These are the objective plans.
- They are leaf nodes in the algebraic tree.
- They are themselves trees!
Algebraic Optimization: SQL
- A partial result is (A x B).
- Is this more likely to lead to a fast solution than (C x D)?
- Yes, because it is smaller?
- Multiple ways to
reach a partial result.
- A: 2 records
- B: 10 records
- C: 100 records
- D: 10 records
Algebraic Optimization: Example
This is really a DAG of trees, where the parents of any node are the partial results combined to compute it. Dominated partial results have been pruned.
Algebraic Optimization: Hints
- Unlike the TAC optimization where every stage
was a valid answer, algebraic methods only generate an answer at the leaf node.
- You may want to prove that:
- At least one leaf is reachable.
- All leaves are reachable (for optimality).
- The algorithm finishes in reasonable polynomial
time.
- Dominated partial results are pruned early.
- TIP: Breadth first. RAM is cheap, but prune lots.
NP-Hard Optimization
- In one cycle, a CPU can execute:
- Ten adds
- Four multiplies
- Two divides
- Schedule this:
NP-Hard Optimization
- We use a general purpose constraint solver.
- Each operation becomes a variable.
- The value of the variable will be the instruction
index.
NP-Hard Optimization
- Each relationship becomes a constraint.
- m0 < a1, m0 < c3, etc.
- We add the other constraints to the system.
- a1 != c3, as we cannot schedule add and multiply
together
NP-Hard Optimization
- We solve it.
- The solution sucks.
CSP: Symmetry
- The eight-queens problem has symmetric
solutions.
- We can defeat symmetry and double the performance of our solver
if we tell it never to put a queen in a5, a6, a7, a8.
- If one solution has a queen in a[5..8] then its mirror has not.
Using symmetry
- We add a new constraint:
If there are no instructions in time slot k then there must be no instructions in time slot k+1.
- This causes all instructions to be scheduled densely at 0.
- And now we optimize:
There are no instructions in time slot N.
- If the solver succeeds, we reduce N and try again.
- If it fails, the last discovered solution must have been optimal.
Revisiting Scheduling
- k is empty → k+1 is empty
- 9 is empty (later, 8 is empty, etc)
Applications of SAT/CSP
- Modelling a CPU for classic superoptimization.
- Hindley-Milner type unification.
- Cluster scheduling and placement.
- Cases where you just can't be bothered to write
the code or derive the algebra.
- Beware: It might go out to lunch, or fail totally.
Consequences of Optimization
- Optimization destroys some of the naivety of
the AST or basic block graph we are used to.
- For example, TAC optimizations make it harder
(but not impossible) to emit for a stack machine.
- Because sometimes we need to load a partial result
which is not at the top of the stack.
Backend
- Isn't actually magic after all.
Example: jcpp
- The C Preprocessor can only be implemented
by hand.
- There are lots of nearly-correct implementations.
- There are very few correct implementations.
Example: iengine
Example: SQL
Example: Apache Pig
Party Tricks in C
- Some C compilers accept nested comments.
- /* /* something */ still a comment */
- Some do not.
- /* /* something */ error */
- Write a program which prints which type of
compiler you are using.
- It must compile on all compilers.
- It can be done elegantly in a couple of lines.
- No preprocessor tricks.
- No pragmas or system hacks.
Party Tricks in C++
- C++ was originally a preprocessor called cfront.
- It got standardized by ISO.
- Write a program which tells you which type of
compiler you are using.
- It must compile on all compilers.
- It can be done elegantly in a couple of lines.
- You know the rest.
Party Tricks with VMs
Office Hours
Thank you
“Begin at the beginning,” the King said gravely, “and go on till you come to the end: then stop.”