Compiler Construction
Mayer Goldberg \ Ben-Gurion University Saturday 2nd November, 2019
Mayer Goldberg \ Ben-Gurion University Compiler Construction 1 / 177
Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg - - PowerPoint PPT Presentation
Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University Saturday 2 nd November, 2019 Mayer Goldberg \ Ben-Gurion University Chapter 2 Goals Agenda Compiler Construction 2 / 177 The pipeline of the
Mayer Goldberg \ Ben-Gurion University Compiler Construction 1 / 177
▶ The pipeline of the compiler ▶ Introduction to syntactic analysis ▶ Further steps in ocaml
▶ The pipeline
▶ Syntactic analysis ▶ Semantic analysis ▶ Code generation
▶ The compiler for the course ▶ The language of S-expressions ▶ More ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction 2 / 177
▶ The interpreter as an evaluation function ▶ The compiler as a translator & optimizer ▶ We explored the relations between interpretation & compilation
Mayer Goldberg \ Ben-Gurion University Compiler Construction 3 / 177
▶ Understanding the syntax of the program
▶ What kinds of statements & expressions there are ▶ What are the various parts of these statements & expressions ▶ Are they syntactically correct
▶ Understanding the meaning of the program
▶ Do the operations make sense? ▶ What are their types? ▶ Are they used in accordance with their types? ▶ On what data is the program acting? ▶ What is returned?
▶ Once we understand the syntax and meaning of a sentence, we
Mayer Goldberg \ Ben-Gurion University Compiler Construction 4 / 177
▶ Syntactic analysis
▶ Scanning ▶ Parsing ▶ Reading ▶ Tag-Parsing
▶ Semantic analysis ▶ Code generation
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 5 / 177
▶ Function: What they do ▶ Dependencies: Which stages depend on which other ▶ Complexity: How diffjcult it is to perform a stage
▶ Understanding syntax is relatively straightforward (unlike in
▶ Understanding meaning is much harder than understanding
▶ Meaning is built upon syntax (in natural languages, syntax &
▶ Code generation is relatively straightforward (template-based)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 6 / 177
▶ We distinguish [at least] two levels of optimizations:
▶ High-level optimizations (closer to the source language) would
▶ Low-level optimizations (closer to assembly language) would go
Mayer Goldberg \ Ben-Gurion University Compiler Construction 7 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 8 / 177
▶ The test during run-time has been eliminated ▶ The code is shorter ▶ Possibly lead to further, cascading optimizations
Mayer Goldberg \ Ben-Gurion University Compiler Construction 9 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 10 / 177
▶ Saved 1 cycle ▶ Made the code smaller ▶ If this code appears within a loop, gains shall be multiplied…
Mayer Goldberg \ Ben-Gurion University Compiler Construction 11 / 177
▶ Concrete syntax ▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 12 / 177
▶ It’s one-dimensional ▶ Lacking in structure
▶ No nesting ▶ No sub-expressions
▶ Diffjcult to work with
▶ Diffjcult to access parts ▶ Diffjcult to determine correctness
▶ Contains redundancies (spaces, comments, etc)
▶ A text fjle ▶ Characters typed at the prompt
Mayer Goldberg \ Ben-Gurion University Compiler Construction 13 / 177
▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 14 / 177
▶ Multi-dimensional ▶ Conveys structure
▶ Nested ▶ Recursive (following the inductive defjnition of the grammar)
▶ Easier to work with than the concrete syntax
▶ Easier to access parts ▶ Easier to verify correctness ▶ Some syntactic correctness issues have already been decided Mayer Goldberg \ Ben-Gurion University Compiler Construction 15 / 177
▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 16 / 177
▶ The AST is a tree ▶ A data-structure that
▶ Follows the abstract
▶ No text, parenthesis,
▶ The structure is evident ▶ Easy to fjnd
▶ Easier to analyze,
Mayer Goldberg \ Ben-Gurion University Compiler Construction 17 / 177
▶ Parsing: going from concrete syntax to abstract syntax ▶ Parser: the tool that performs parsing, constructing an AST
▶ Lacks structure ▶ Prone to errors ▶ Hard to delimit
▶ Ineffjcient to work with ▶ Concrete Syntax can be
▶ Visual languages ▶ Structure/syntax editors
▶ Has structure ▶ Many kinds of errors are
▶ Sub-Expressions are readily
▶ Effjcient to work with
Mayer Goldberg \ Ben-Gurion University Compiler Construction 18 / 177
▶ Token ▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 19 / 177
▶ The smallest, meaningful, lexical unit in a language ▶ Described using regular expressions ▶ Identifjed using DFA (a very simple model of computation) ▶ Examples
▶ Numbers ▶ [Non-nested] Strings ▶ Names (variables, functions) ▶ Punctuation
▶ Cannot handle nesting of any kind:
▶ Parenthesized expressions ▶ Nested comments ▶ etc. Mayer Goldberg \ Ben-Gurion University Compiler Construction 20 / 177
▶ Scanning: going from characters into tokens ▶ Scanner: the tool that performs scanning ▶ Scanner generator: the tool that takes defjnitions for tokens,
▶ Examples of scanner-generators: lex, fmex
Mayer Goldberg \ Ben-Gurion University Compiler Construction 21 / 177
▶ Delimiter ▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 22 / 177
▶ Delimiters are characters that separate tokens ▶ In most languages spaces, parentheses, commas, semicolons,
▶ Some tokens must be separated by delimiters
▶ Two consecutive numbers, two consecutive symbols, etc.
▶ Some tokens do not need to be separated by delimiters
▶ Two consecutive strings, an open parenthesis followed by a
▶ Delimiters are language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction 23 / 177
▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction 24 / 177
▶ Whitespace refers to characters that
▶ Have no graphical representation ▶ Occur before or after tokens ▶ Spaces within strings are not whitespaces… ▶ Serve no syntactic purpose other than as delimiters and for
▶ Whitespace is language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction 25 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 26 / 177
▶ Delimiters & whitespaces ▶ Parentheses, brackets, braces, and other grouping, nesting, and
Mayer Goldberg \ Ben-Gurion University Compiler Construction 27 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 28 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 29 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 30 / 177
▶ Grammars with which to express the syntax of the language
▶ There are difgerent kinds of grammars (CFG, CSG, two-level,
▶ There are difgerent languages for expressing the grammar (e.g.,
▶ Algorithms for parsing programs as per kind of grammar ▶ Techniques (e.g., parsing combinators, DCGs)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 31 / 177
▶ Going from characters to tokens ▶ Identifying & grouping characters into tokens for words,
▶ Parsing over tokens is more effjcient than parsing over
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 32 / 177
▶ In LISP/Prolog, the parser is split into two components:
▶ The reader, or the parser for the data language ▶ The tag-parser, or the parser for the source code
▶ In LISP/Scheme/Racket/Clojure/etc, the abstract syntax for
▶ In Prolog, the abstract syntax for the data is the abstract syntax
▶ Prolog is the programming language with the most powerful
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 33 / 177
▶ In programming languages in which the syntax of code is not a
▶ In programming languages in which the syntax of code is part of
▶ The concrete syntax of data is a stream of characters ▶ The concrete language of code is the abstract syntax of the
▶ In Scheme, the language of data is called S-expressions (sexprs,
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 34 / 177
▶ The tag-parser takes sexprs and returns [ASTs for] exprs ▶ Languages other than from the LISP & Prolog families do not
▶ In such languages, parsing goes directly from tokens to [ASTs
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 35 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 36 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 37 / 177
▶ Annotate the ASTs ▶ Compute addresses ▶ Annotate tail-calls ▶ Type-check code ▶ Perform optimizations
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 38 / 177
▶ Generate a stream of instructions in
▶ assembly language ▶ machine language ▶ Build executable ▶ some other target language…
▶ Perform low-level optimizations
Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 39 / 177
▶ Written in ocaml ▶ Supports a subset of Scheme + extensions ▶ Supports two, simple optimizations ▶ Compiles to x86/64 ▶ Runs on linux
▶ Support for the full language of Scheme ▶ Support for garbage collection ▶ The ability to compile itself
Mayer Goldberg \ Ben-Gurion University Compiler Construction 40 / 177
▶ We’re going to learn about syntax by studying the syntax of
▶ After all, we’re writing a Scheme compiler… ▶ It’s relatively simple, compared to the syntax of C, Java,
▶ It comes with some interesting twists
▶ Scheme comes with two languages:
▶ A language for code ▶ A language for data
▶ The key to understanding the syntax of Scheme, is to think
Mayer Goldberg \ Ben-Gurion University Compiler Construction 41 / 177
▶ Describe arbitrarily-complex data
▶ Possibly multi-dimensional, deeply nested ▶ Polymorphic ▶ Possibly circular
▶ Access components easily and effjciently
Mayer Goldberg \ Ben-Gurion University Compiler Construction 42 / 177
▶ S-expressions (the fjrst: 1959) ▶ Functors (1972) ▶ Datalog (1977) ▶ SGML (1986) ▶ MS DDE (1987) ▶ CORBA (1991) ▶ MS COM (1993) ▶ MS DCOM (1996) ▶ XML (1996) ▶ JSON (2001)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 43 / 177
▶ They’re the fjrst… 😊 ▶ They’re supported natively, as part of specifjc programming
▶ S-expressions are supported by LISP-based languages, including
▶ Functors are supported by Prolog-based languages
Mayer Goldberg \ Ben-Gurion University Compiler Construction 44 / 177
▶ It’s not supported natively by any programming language ▶ Most modern languages (Java, Python, etc) support it via
▶ No programming language has XML for its concrete syntax:
Mayer Goldberg \ Ben-Gurion University Compiler Construction 45 / 177
▶ Supported XML as its data language ▶ Were itself written in XML
▶ Writing interpreters, compilers, and other language-tools would
▶ Refmection (code examining code) would be simple
Mayer Goldberg \ Ben-Gurion University Compiler Construction 46 / 177
▶ They are the data language for LISP-based languages, including
▶ LISP-based languages are written using S-expressions ▶ Writing interpreters and compilers in LISP-based languages is
▶ Computational refmection was invented in LISP! ▶ This is the real reason behind all these parentheses in Scheme:
▶ A very simple language ▶ Supports core types: pairs, vectors, symbols, strings, numbers,
▶ A syntactic compromise that is great for expressing both code
Mayer Goldberg \ Ben-Gurion University Compiler Construction 47 / 177
▶ S-expressions were invented along with LISP, in 1959 ▶ S-expressions stand for Symbolic Expressions ▶ The term is intended to distinguish itself from numerical
▶ Before LISP (and long after it was invented), most computation
▶ Computers languages were great at “crunching numbers”, but
▶ String libraries were non-standard and uncommon ▶ Polymorphic data was unheard of ▶ Nested data structured needed to be implemented from scratch,
Mayer Goldberg \ Ben-Gurion University Compiler Construction 48 / 177
▶ Working with data structures became considerably simpler
▶ Trivially allocated (no pointer-arithmetic) ▶ Polymorphic (lists of lists of numbers and strings and vectors of
▶ Easy to access sub-structures (no pointer arithmetic) ▶ Easy to modify (in an easy-going, functional style) ▶ Easy to examine (they’re just made up of primitive types) ▶ Easy to redefjne ▶ Automatically deallocated (garbage collection)
▶ Treating code as data became considerably simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction 49 / 177
▶ Symbolic Mathematics (Macsyma, a precursor to Wolfram
▶ Artifjcial Intelligence ▶ Computer adventure-game generation-languages (MDL, ZIL)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 50 / 177
▶ The empty list: () ▶ Booleans: #f, #t ▶ Characters: #\a, #\Z, #\space, #\return, #\x05d0, etc ▶ Strings: "abc", "Hello\nWorld\t\x05d0;hi!", etc ▶ Numbers: -23, #x41, 2/3, 2-3i, 2.34, -2.34+3.5i ▶ Symbols: abc, lambda, define, fact, list->string ▶ Pairs: (a . b), (a b c), (a (2 . #f) "moshe") ▶ Vectors: #(), #(a b ((1 . 2) #f) "moshe")
Mayer Goldberg \ Ben-Gurion University Compiler Construction 51 / 177
▶ The name LISP comes from LISt Processing. ▶ In fact, LISP has no direct support for lists. ▶ LISP has ordered pairs
▶ Ordered pairs are created using cons ▶ The fjrst and second projections over ordered pairs are car and
▶ (car (cons x y)) ≡ x ▶ (cdr (cons x y)) ≡ y ▶ The ordered pair of x and y can be written as (x . y) Mayer Goldberg \ Ben-Gurion University Compiler Construction 52 / 177
▶ Rule 1: For any E, the ordered pair (E . ()) is printed as (E),
▶ Rule 2: For any E1, E2, …, the ordered pair (E1 . (E2 — )) is
▶ These rules just efgect how pairs are printed ▶ These rules give us a canonical representation for pairs
Mayer Goldberg \ Ben-Gurion University Compiler Construction 53 / 177
▶ The pair (a . (b . c)) is printed as (a b . c)
SYMBOL a SYMBOL b SYMBOL c PAIR CAR CDR PAIR CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction 54 / 177
▶ The pair ((a . (b . ())) . ((c . (d . ())))) is
SYMBOL a SYMBOL b NIL PAIR CAR CDR PAIR CAR CDR SYMBOL c SYMBOL d NIL PAIR CAR CDR PAIR CAR CDR NIL PAIR CAR CDR PAIR CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction 55 / 177
▶ Lists in Scheme can come in two forms, proper lists and
▶ When we just speak of lists, we usually mean proper lists. ▶ Most of the list processing functions (length, map, etc) take
Mayer Goldberg \ Ben-Gurion University Compiler Construction 56 / 177
▶ Proper lists are nested ordered pairs the rightmost cdr of which
▶ Testings for pairs is cheap, and is done by means of the builtin
▶ Testing for lists is expensive, since it traverses nested, ordered
Mayer Goldberg \ Ben-Gurion University Compiler Construction 57 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 58 / 177
▶ Pairs that are not proper lists are improper lists. ▶ Improper lists end with a rightmost cdr that is not nil ▶ List-processing procedures such as length, map, etc., do not
▶ There is no builtin procedure for testing improper lists, but it
Mayer Goldberg \ Ben-Gurion University Compiler Construction 59 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 60 / 177
▶ Entering an empty list or a vector or an improper list at the
▶ Entering a symbol at the prompt causes Scheme to attempt to
▶ Entering a proper list, that is not the empty list, at the prompt
Mayer Goldberg \ Ben-Gurion University Compiler Construction 61 / 177
▶ The special form quote can be written in two ways:
▶ '<sexpr> ▶ (quote <sexpr>)
▶ When you type abc at the Scheme prompt, you’re evaluating
▶ When you type 'abc at the Scheme prompt, you’re evaluating
▶ The value of the literal symbol abc is just itself, which is why
Mayer Goldberg \ Ben-Gurion University Compiler Construction 62 / 177
▶ When you type () at the Scheme prompt, you’re evaluating an
▶ When you type '() at the Scheme prompt, you’re evaluating a
▶ The value of the literal empty list is just itself, which is why
Mayer Goldberg \ Ben-Gurion University Compiler Construction 63 / 177
▶ When you type (a b c) at the Scheme prompt, you’re
▶ When you type '(a b c) at the Scheme prompt, you’re
▶ The value of the literal list (a b c) is just (a b c), which is
▶ Quoting a self-evaluating S-expression is possible, and
Mayer Goldberg \ Ben-Gurion University Compiler Construction 64 / 177
▶ The quote form does nothing
▶ It is not a procedure ▶ It doesn’t take an argument ▶ It delimits a constant, literal S-expressions
▶ The syntactic function of quote in Scheme is the same as the
Mayer Goldberg \ Ben-Gurion University Compiler Construction 65 / 177
▶ Simlarly to quote, the form quasiquote can be written in two
▶ `<sexpr> ▶ (quasiquote <sexpr>)
▶ quasiquote is also used to defjne data:
▶ `abc is the same as 'abc ▶ `(a b c) is the same as '(a b c)
▶ But quasiquote has two neat tricks!
Mayer Goldberg \ Ben-Gurion University Compiler Construction 66 / 177
▶ The following two forms may occur within a
▶ The unquote form: ▶ ,<sexpr> ▶ (unquote <sexpr>)
▶ The unquote-splicing form: ▶ ,@<sexpr> ▶ (unquote-splicing <sexpr>)
▶ Both unquote & unquote-splicing are used within
Mayer Goldberg \ Ben-Gurion University Compiler Construction 67 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 68 / 177
▶ The expression `(a ,(append '(x y) '(z w)) b) is
▶ The expression `(a ,@(append '(x y) '(z w)) b) is
▶ The difgerence between unquote & unquote-splicing is that
▶ unquote mixes in an expression using cons ▶ unquote-splicing mixes in an expression using append Mayer Goldberg \ Ben-Gurion University Compiler Construction 69 / 177
▶ Together, quasiquote, unquote, & unquote-splicing are
▶ The quasiquote mechanism allows us to create data by
▶ In Scheme, convenient ways to create data translate
▶ Therefore we expect the quasiquote mechanism to have useful
▶ We can turn code that computes something into code that
Mayer Goldberg \ Ben-Gurion University Compiler Construction 70 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 71 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 72 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 73 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 74 / 177
▶ The 2nd and 3rd ribs of the cond overlap [we could have
▶ All atoms are left unchanged ▶ All pairs are duplicated, while recursing over the car and cdr of
Mayer Goldberg \ Ben-Gurion University Compiler Construction 75 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 76 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 77 / 177
▶ Using the quasiquote mechanism, we got foo to describe how
▶ We should really add support for proper lists and vectors! ▶ In fact, the name describe is far more appropriate than foo…
Mayer Goldberg \ Ben-Gurion University Compiler Construction 78 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 79 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 80 / 177
▶ '<sexpr> ≡ (quote <sexpr>) ▶ `<sexpr> ≡ (quasiquote <sexpr>) ▶ ,<sexpr> ≡ (unquote <sexpr>) ▶ ,@<sexpr> ≡ (unquote-splicing <sexpr>)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 81 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 82 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 83 / 177
▶ The fjrst element of the list is the symbol quote ▶ The second element of the list is '''''''''''''''moshe
Mayer Goldberg \ Ben-Gurion University Compiler Construction 84 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 85 / 177
▶ In a previous slide, we made the claims that in all descendants of
▶ We can now show you some examples
▶ (if if if if) is a list
▶ (if (zero? n) 'zero
▶ (if if if if) is not a
▶ (if (zero? n) 'zero
Mayer Goldberg \ Ben-Gurion University Compiler Construction 86 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 87 / 177
▶ Types ▶ References ▶ Modules & signatures ▶ Functional programming in ocaml Mayer Goldberg \ Ben-Gurion University Compiler Construction 88 / 177
▶ Defjning new data types ▶ Assignments, side-efgects,
Mayer Goldberg \ Ben-Gurion University Compiler Construction 89 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 90 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 91 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 92 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 93 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 94 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 95 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 96 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 97 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 98 / 177
▶ References are derived types. For any type α, we can have a
▶ References are records with a single fjeld contents ▶ References have a special syntax ! to dereference the fjeld:
Mayer Goldberg \ Ben-Gurion University Compiler Construction 99 / 177
▶ References have a special syntax := for assignment ▶ This is how assignments are managed in ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction 100 / 177
▶ It is not possible to perform assignments on variables ▶ It is only possible to change the fjelds of reference types
Mayer Goldberg \ Ben-Gurion University Compiler Construction 101 / 177
▶ You can defjne a reference type of any other type, including
Mayer Goldberg \ Ben-Gurion University Compiler Construction 102 / 177
▶ A module is a way of packaging functions, classes, variables, &
▶ A signature is the type of a module
▶ Visibility of a module can be restricted through the signature
▶ Functors are functions from functors/modules to
▶ Learn to work with existing modules ▶ Learn to write your own modules
Mayer Goldberg \ Ben-Gurion University Compiler Construction 103 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 104 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 105 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 106 / 177
▶ M.hyp is visible from outside M ▶ M.square is not visible from outside M ▶ Functions visible from outside may use functions visible from
Mayer Goldberg \ Ben-Gurion University Compiler Construction 107 / 177
▶ Modules & signatures are the way to package functions &
▶ Convenient, super-effjcient, safe ▶ No need to use local, nested functions to manage visibility ▶ Always use signatures to control visibility!
▶ Modules can contain types too, and be used to parameterize
▶ Simpler & better than generics & templates
▶ Functors map modules/functors =
Mayer Goldberg \ Ben-Gurion University Compiler Construction 108 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 109 / 177
▶ Parsing algorithms are tailored to a specifjc kind of grammar
▶ Difgerent kinds of grammars can be parsed by difgerent
▶ Difgerent kinds of grammars have difgerent levels of complexity
▶ Most programming languages can be described using
▶ Some older languages can only be described using
Mayer Goldberg \ Ben-Gurion University Compiler Construction 110 / 177
▶ V is a set of non-terminals ▶ Σ is a set of terminals, or tokens ▶ R is a relation in V × (V ∪ Σ)∗
▶ Members of R are called production rules or rewrite rules
▶ S is the an initial non-terminal
Mayer Goldberg \ Ben-Gurion University Compiler Construction 111 / 177
▶ We abbreviate the two productions ⟨A, X⟩ , ⟨A, Y⟩ ∈ R with
▶ We abbreviate the three productions ⟨A, X⟩ , ⟨X, ε⟩ , ⟨X, BX⟩ ∈ R,
▶ We abbreviate the three productions
▶ We abbreviate the two productions ⟨A, ε⟩ , ⟨A, B⟩ ∈ R, with
Mayer Goldberg \ Ben-Gurion University Compiler Construction 112 / 177
▶ Start with the initial non-terminal ▶ Rewrite the LHS of a non-terminal with its RHS, matching the
▶ Keep rewriting until the entire input stream is matched
Mayer Goldberg \ Ben-Gurion University Compiler Construction 113 / 177
▶ Start with the input stream of tokens ▶ Find a rewrite rule where the RHS matches sequences in the
▶ Keep rewriting until the entire input stream has been reduced to
Mayer Goldberg \ Ben-Gurion University Compiler Construction 114 / 177
▶ Describe the grammar of the language using a DSL for some
▶ Example: Backus-Naur Form (BNF)
▶ Associate actions with each production rule:
▶ How to build the AST when a specifjc rule is matched
▶ A parser generator (e.g., yacc, bison, antlr, etc) compiles the
▶ Performing various optimizations ▶ Generating code in some language (C, Java, ocaml, etc) ▶ This code is the parser
▶ Calling the parser on some input returns an AST
Mayer Goldberg \ Ben-Gurion University Compiler Construction 115 / 177
▶ Minimal restrictions on the grammar ▶ Avoid backtracking as much as possible ▶ Maximum optimizations of the parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction 116 / 177
▶ Parsers for larger languages are composed from parsers for
▶ The grammar can be written & debugged bottom-up ▶ The parsers are fjrst-class objects:
▶ We get to use abstraction to create complex parsers quickly &
▶ Re-use efgectively common sub-languages
▶ Simple to understand & implement ▶ Very rapid development
Mayer Goldberg \ Ben-Gurion University Compiler Construction 117 / 177
▶ The grammar is embedded as-is:
▶ As much backtracking as implied by the grammar: Rewrite
▶ No optimizations or transformations are performed on it!
▶ ε-productions & left-recursion result in infjnite loops
▶ We need to eliminate these manually!
▶ Can produce ineffjcient parsers rather effjciently! 😊
Mayer Goldberg \ Ben-Gurion University Compiler Construction 118 / 177
▶ Parsing combinators are very simple to learn about grammars:
▶ No complex algorithms are necessary! ▶ The easiest way to design complex grammars & their parsers:
▶ shortens & simplifjes the code ▶ encourages re-use & consistency
▶ Optimizations can always be done manually!
Mayer Goldberg \ Ben-Gurion University Compiler Construction 119 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 120 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 121 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 122 / 177
▶ We build parsers of large languages by combining parsers for
▶ The procedures that combine parsers are called parsing
▶ But we must start by being able to parse single characters
▶ All other parsers are built on top of such simple parsers for
Mayer Goldberg \ Ben-Gurion University Compiler Construction 123 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 124 / 177
▶ …takes a list of characters ▶ …returns a pair of what it matched, and the remaining characters
Mayer Goldberg \ Ben-Gurion University Compiler Construction 125 / 177
▶ We only match the head of the input ▶ Obviously, ntA fails on an empty list
Mayer Goldberg \ Ben-Gurion University Compiler Construction 126 / 177
▶ Testing our parsers by applying them to lists is no fun
▶ It’s a pain to type lists of characters!
▶ Let’s automate things a bit:
Mayer Goldberg \ Ben-Gurion University Compiler Construction 127 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 128 / 177
▶ We try to parse the head of s using nt1
▶ If we succeed, we get e1 and the remaining chars s ▶ We try to parse the head of s (what remained after nt1) using
▶ If we succeed, we get e2 and the remaining chars s ▶ We return the pair of e1 & e2, as well as the remaining chars Mayer Goldberg \ Ben-Gurion University Compiler Construction 129 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 130 / 177
▶ We try to parse the head of s using nt1
▶ If we succeed, then the call to nt1 returns normally ▶ If we fail we try to parse the head of s using nt2 Mayer Goldberg \ Ben-Gurion University Compiler Construction 131 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 132 / 177
▶ Some simple parsers ▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction 133 / 177
▶ nt_epsilon is the parser that recognizes ε-productions ▶ nt_none is the parser that always fails ▶ nt_end_of_input is the parser that recognizes the end of the
Mayer Goldberg \ Ben-Gurion University Compiler Construction 134 / 177
▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction 135 / 177
▶ What is the unit element of catenation?
▶ Answer: r = ε ▶ We’re looking for a non-terminal r such that for any s, we have
▶ This means that nt_epsilon is the unit element for caten: ▶ caten nt_epsilon nt ≡ caten nt nt_epsilon ≡ nt ▶ Both nt_epsilon & nt_end_of_input are used ’til the end of
▶ The natural operation is to create a list of all things until ε or
▶ The unit element for append on lists is the empty list ▶ Ergo, it is natural to match [] when either condition is
Mayer Goldberg \ Ben-Gurion University Compiler Construction 136 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 137 / 177
▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction 138 / 177
▶ We want to be able to create an AST for that piece of syntax ▶ We do this by specifying postprocessing or callback functions
▶ In our package, the PC that performs this is called pack
▶ pack takes a non-terminal nt and a function f ▶ returns a parser that recognizes the same language as nt ▶ …but which applies f to whatever was matched Mayer Goldberg \ Ben-Gurion University Compiler Construction 139 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 140 / 177
▶ Grammars are often recursive or mutually-recursive:
▶ The non-terminal on the LHS of a production often appears on
▶ The non-terminal on the LHS of a production often appears in
▶ Currently, we are unable to describe recursive rules using PCs
Mayer Goldberg \ Ben-Gurion University Compiler Construction 141 / 177
▶ The non-terminal A ▶ The open-parenthesis token ▶ The close-parenthesis token ▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!
Mayer Goldberg \ Ben-Gurion University Compiler Construction 142 / 177
▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!
Mayer Goldberg \ Ben-Gurion University Compiler Construction 143 / 177
▶ The problem is not specifjc to parsing combinators.
▶ For example, you couldn’t defjne in Scheme:
▶ So how are recursive defjnitions possible at all?
▶ When you defjne a recursive function you are not using the
▶ You are using the address of the function before the function is
▶ Recursive functions are circular data structures:
▶ The language defjnition permits you to defjne these particular
Mayer Goldberg \ Ben-Gurion University Compiler Construction 144 / 177
▶ “Wrap it in a lambda…”
▶ A thunk is a procedure that takes zero arguments ▶ Thunks are used to delay evaluation
Mayer Goldberg \ Ben-Gurion University Compiler Construction 145 / 177
▶ Notice the packing function (function (a, s) -> a :: s)
Mayer Goldberg \ Ben-Gurion University Compiler Construction 146 / 177
▶ We got a list of digits, as opposed to a list of chars!
Mayer Goldberg \ Ben-Gurion University Compiler Construction 147 / 177
▶ Notice the type of the parser: char list -> int * char list
Mayer Goldberg \ Ben-Gurion University Compiler Construction 148 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 149 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 150 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 151 / 177
▶ By now, our toolset of parsing combinators consists of
▶ const ▶ caten ▶ disj ▶ pack ▶ delayed
▶ We can handle recursive grammars ▶ We can create ASTs ▶ In principle, we can implement parsers for any language
Mayer Goldberg \ Ben-Gurion University Compiler Construction 152 / 177
▶ For any NT P, P∗ stands for the
▶ The point of the Kleene-star is
Mayer Goldberg \ Ben-Gurion University Compiler Construction 153 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 154 / 177
▶ For any NT P, P+ stands for the rule Pplus defjned as follows:
▶ The point of the Kleene-plus is to recognize the catenation of
▶ Kleene didn’t really invent the Kleene-plus
▶ Rather, Kleene-plus is a natural extension of Kleene-star Mayer Goldberg \ Ben-Gurion University Compiler Construction 155 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 156 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 157 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 158 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 159 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 160 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 161 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 162 / 177
▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction 163 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 164 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 165 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 166 / 177
▶ Take a string of chars, and convert it to a list ▶ Map over each character in the list, creating a parser that
▶ Perofrm a right fold over that list using the caten operation
▶ The unit element is the unit element of catenation, namely
Mayer Goldberg \ Ben-Gurion University Compiler Construction 167 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 168 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 169 / 177
▶ Very similar to word:
▶ We use disj rather than caten ▶ The unit element for disj is nt_none
Mayer Goldberg \ Ben-Gurion University Compiler Construction 170 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 171 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 172 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 173 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 174 / 177
▶ The PC trace_pc is a wrapper (using the decorator pattern)
▶ The trace_pc PC takes a documentation string and a parser,
Mayer Goldberg \ Ben-Gurion University Compiler Construction 175 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 176 / 177
Mayer Goldberg \ Ben-Gurion University Compiler Construction 177 / 177