 
              This slide contains no jokes.
How to Write Compilers an d solve data transformation problems. Shevek shevek@anarres.org shevek@nebula.com
Introduction ● I have written a lot of compilers. ● Sequential and parallel languages. ● Machine code and interpreted. ● Some of them didn't work! ● This presentation contains practical knowledge. ● Almost nothing in it is researched. ● The gaps are stuff I don't know, or don't need. ● Some basic material is omitted. ● Read a book. ● Ask questions! Already!
Nebula http://www.nebula.com/careers/
Example of a Compiler
The Mini-language Philosophy ● Describe your problem in some language. ● Implement the ● Write a compiler description in code. for that language. ● 1 month ● 1 month ● Update the problem description. ● Reimplement the ● Run your description in code. compiler again. ● 1 month ● 1 week!
Why Write a Compiler? ● Save time over the life of a project. ● Simplify the runtime. ● Detect errors earlier.
Errors in Development IDE highlights bug From static analyzer Code leaves developer's desktop Runtime errors detected Say prayers Kind of expensive
Why Not Write a Compiler? ● Reinventing world, plus dog. ● Usually an expression parser. ● Write a library for an existing language. ● Use fluent APIs. ● Cognitive load. ● This applies to annotations as well. – Note to self: Rag on python a bit. ● Confusing education with production.
General Engineering ● TIP: Write type-safe code. ● TIP: Write synchronous code. ● TIP: Write code with a well-defined thread contract. ● TIP: Use monads or immutable structures.
Overview of a Compiler ● Treat the stages independently. ● Give each stage a contract. ● TIP: Make each stage check its input. ● TIP: Treat the input as a read-only structure.
Example Stage: The Front End ● Each phase: ● Consumes some previous outputs. ● Produces one immutable output. ● The last phase ● Produces a consolidated output structure, which is also immutable. ● TIP: Use Context and InstanceMap patterns.
Mistakes ● Writing an ugly monolith. ● Cannot debug or modify. ● Modifying data structures. ● Cannot localize bugs. ● No clear contract for the structure. ● Attaching metadata to the parse tree. ● It looks like a christmas tree.
Assembling Compilers ● A compiler S → T might consist of: ● S → X, X → Y, Y → T ● TIP: Write the data structures first. ● TIP: Use service discovery. ● TIP: Declare phase dependencies. ● TIP: Use jgrapht. ● TIP: Dijkstra's shortest path. ● TIP: Print everything. ● TIP: Use graphviz.
Language Design ● Languages should have low information content. ● Low information content (or high redundancy) increases the ability of the compiler to detect errors. ● An arbitrary (randomly generated) input should have a high probability of being invalid. ● Note to self: Rag on Perl a bit, but not as much as python. ● See @Override in Java, not present in C++. ● The output of a compiler will probably have very high information content.
More Language Design ● The programmer has better things to do than: ● Format code. – Make your code auto-formatable. – Do not make whitespace significant. ● Work out the valid options for … – Parameter types to an overloaded function. – Available symbols, variables or types. ● Keep track of whether a variable exists. – Grrrrr. ● You might not need a syntax. ● Just add semantics to an existing syntax, e.g. XML.
Lexer and Parser ● The lexer turns a sequence of characters into a sequence of tokens. ● The parser turns the sequence of tokens into a parse tree. ● They are generally rules-based.
Parsers: Ambiguity ● Sentence = NounPhrase Verb NounPhrase . ● NounPhrase = Article Adjective? Noun The old man the boats. ● What happened?
Parsers: A Parse Tree
Parsers: LL ● An example parser for SQL public Statement parse_statement() { Token t = token(); switch (t.type) { case SELECT: return parse_select(); ... } } public Select parse_select() { Select s = new Select(); s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause(); return s; }
Parsers: LL s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LL s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LL s.expressions = parse_expression_list(); s.tables = parse_from_list(); s.where = parse_where_clause();
Parsers: LR ● An LR parser looks at the top few tokens on the stack, and decides one of two things: ● Push the new token onto the stack. ● Reduce the top of the stack to a compound token. ● LR parsers tend to be automatically generated. ● TIP: Always use LR parsers.
Parsers: LR Stack: select expression_list from_list where a0 < 5 ... → expr Rule: a0 < 5
Parsers: LR Stack: select expression_list from_list where expr and b1 > 7 … → expr Rule: b1 > 7
Parsers: LR Stack: select expression_list from_list where expr and expr … → expr Rule: expr and expr
Parsers: LR Stack: select expression_list from_list where expr ... → Rule: where expr where_clause
Parsers: Ambiguity ● Expressions: ● Lists: ● 5 + 6 ● (6, 7, 8) ● 3 * (5 + 6) ● (6, 7) ● (5 + 6) ● () ● (6) ● (6) ● F ^HO ops! ● Perl has +{}, @x = (6), $x = (6) ● Look for Meredith Patterson's 28c3 talk.
Parsers: Shift-Reduce Conflict statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); if (a) if (a) if (bar) if (bar) foo(); foo(); else else baz(); baz();
Parsers: Shift-Reduce Conflict statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; if (a) if (bar) foo(); else baz(); Initial stack: if[0] condition if[1] condition statement else … Rule: … if condition statement → statement Result: if[0] condition statement [1] Rule: … if condition statement [ else statement] → (shift else ) Result: if[0] condition if[1] condition statement else
Parsers: Factoring in LR statement = {no_dangling} no_dangling_statement { -> no_dangling_statement.statement } | {dangling} dangling_statement { -> dangling_statement.statement } ; /* productions NOT ending in 'statement' */ no_dangling_statement { -> statement } = {comp} compound_statement { -> compound_statement.statement } | {exp} expression_statement { -> expression_statement.statement } | {jmp} jump_statement { -> jump_statement.statement } | {if_else} kw_if tok_lpar expression tok_rpar no_dangling_statement kw_else [other]:no_dangling_statement { -> New statement.if_else(expression, no_dangling_statement.statement, other.statement) } | {catch} catch_statement { -> catch_statement.statement } ; /* productions ending in 'statement' */ dangling_statement { -> statement } = {label} labeled_statement { -> labeled_statement.statement } | {select} selection_statement { -> selection_statement.statement } | {iter} iteration_statement { -> iteration_statement.statement } ;
Parsers: Priority in Bison ● For reference only: ... %nonassoc LOWER_THAN_ELSE %nonassoc L_ELSE ... %% ... statement : ... | L_IF '(' nv_list_exp ')' statement opt_else | ... ; opt_else : %prec LOWER_THAN_ELSE | L_ELSE statement ;
Parsers: PEGs ● Like an LR CFG but where rules have priority. statement: {ifelse} if ( condition ) then statement else statement | {if} if (condition) then statement ; ● PEG makes a deterministic decision in case of ambiguity. ● This also means your language design is a dog's breakfast.
SableCC ● A beautiful Java LR parser generator. ● Somewhat under-documented. ● TIP: Use SableCC.
SableCC: Example Grammar ● A fragment of a SableCC grammar: /* 6.5.9 equality-expression */ equality_expression { -> expression } = {no} relational_expression { -> relational_expression.expression } | {eq} equality_expression tok_eq_eq relational_expression { -> New expression.eq( equality_expression.expression, relational_expression.expression) }| {ne} equality_expression tok_ne relational_expression { -> New expression.ne( equality_expression.expression, relational_expression.expression) }; /* 6.5.10 AND-expression */ and_expression { -> expression } = {no} equality_expression { -> equality_expression.expression } | {and} and_expression tok_and equality_expression { -> New expression.bitwise_and( and_expression.expression, equality_expression.expression) };
Concrete Syntax Trees ● Lists are parsed using recursion. /* 6.5.2 argument-expression-list */ argument_expression_list = {single} assignment_expression | {list} argument_expression_list tok_comma assignment_expression ; Ugly, and doesn't even fit on the slide. ● We can embed code to disengangle it. argument_list : argument { $$ = newAV(); av_push($$, $1); } | argument_list ',' argument { av_push($1, $3); $$ = $1; } ;
Recommend
More recommend