Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg - - PowerPoint PPT Presentation

compiler construction
SMART_READER_LITE
LIVE PREVIEW

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg - - PowerPoint PPT Presentation

Compiler Construction Compiler Construction 1 / 177 Mayer Goldberg \ Ben-Gurion University Saturday 2 nd November, 2019 Mayer Goldberg \ Ben-Gurion University Chapter 2 Goals Agenda Compiler Construction 2 / 177 The pipeline of the


slide-1
SLIDE 1

Compiler Construction

Mayer Goldberg \ Ben-Gurion University Saturday 2nd November, 2019

Mayer Goldberg \ Ben-Gurion University Compiler Construction 1 / 177

slide-2
SLIDE 2

Chapter 2

Goals

▶ The pipeline of the compiler ▶ Introduction to syntactic analysis ▶ Further steps in ocaml

Agenda

▶ The pipeline

▶ Syntactic analysis ▶ Semantic analysis ▶ Code generation

▶ The compiler for the course ▶ The language of S-expressions ▶ More ocaml

Mayer Goldberg \ Ben-Gurion University Compiler Construction 2 / 177

slide-3
SLIDE 3

Refresher

Last week, we discussed

▶ The interpreter as an evaluation function ▶ The compiler as a translator & optimizer ▶ We explored the relations between interpretation & compilation

This was a rather high-level view of the area We now wish to consider compilation as a large software-project

Mayer Goldberg \ Ben-Gurion University Compiler Construction 3 / 177

slide-4
SLIDE 4

Compilation as translation

A compiler translates between languages:

▶ Understanding the syntax of the program

▶ What kinds of statements & expressions there are ▶ What are the various parts of these statements & expressions ▶ Are they syntactically correct

▶ Understanding the meaning of the program

▶ Do the operations make sense? ▶ What are their types? ▶ Are they used in accordance with their types? ▶ On what data is the program acting? ▶ What is returned?

▶ Once we understand the syntax and meaning of a sentence, we

can render it in another language

Mayer Goldberg \ Ben-Gurion University Compiler Construction 4 / 177

slide-5
SLIDE 5

The pipeline of the compiler

Since the 1950’s, the standard architecture for compilers has been a pipeline:

▶ Syntactic analysis

▶ Scanning ▶ Parsing ▶ Reading ▶ Tag-Parsing

▶ Semantic analysis ▶ Code generation

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 5 / 177

slide-6
SLIDE 6

The pipeline of the compiler

The stages in the compiler pipeline are distinguished by

▶ Function: What they do ▶ Dependencies: Which stages depend on which other ▶ Complexity: How diffjcult it is to perform a stage

In programming languages:

▶ Understanding syntax is relatively straightforward (unlike in

natural languages)

▶ Understanding meaning is much harder than understanding

syntax

▶ Meaning is built upon syntax (in natural languages, syntax &

meaning can be inter-dependent)

▶ Code generation is relatively straightforward (template-based)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 6 / 177

slide-7
SLIDE 7

The pipeline of the compiler

Optimizations

How optimizations fjt into the pipeline of the compiler:

▶ We distinguish [at least] two levels of optimizations:

▶ High-level optimizations (closer to the source language) would

go into the semantic analysis phase

▶ Low-level optimizations (closer to assembly language) would go

into the code generation phase

This distinction can be fuzzy. Some make it fuzzier with intermediate-level optimizations

Mayer Goldberg \ Ben-Gurion University Compiler Construction 7 / 177

slide-8
SLIDE 8

An example of a high-level optimization

Suppose the compiler can know that the value of n is 0 when reaching the following statement: if (n == 0) { foo(); } else { goo(n); } Then an obvious optimization to perform would be to eliminate the if-statement, replacing it with: foo();

Mayer Goldberg \ Ben-Gurion University Compiler Construction 8 / 177

slide-9
SLIDE 9

How has the code improved:

Before

if (n == 0) { foo(); } else { goo(n); }

After

foo();

What was gained

▶ The test during run-time has been eliminated ▶ The code is shorter ▶ Possibly lead to further, cascading optimizations

Mayer Goldberg \ Ben-Gurion University Compiler Construction 9 / 177

slide-10
SLIDE 10

An example of a low-level optimization

Before: mov rax, 1 mov rax, 2 After: mov rax, 2

Mayer Goldberg \ Ben-Gurion University Compiler Construction 10 / 177

slide-11
SLIDE 11

How has the code improved:

Before

mov rax, 1 mov rax, 2

After

mov rax, 2

What was gained

▶ Saved 1 cycle ▶ Made the code smaller ▶ If this code appears within a loop, gains shall be multiplied…

Mayer Goldberg \ Ben-Gurion University Compiler Construction 11 / 177

slide-12
SLIDE 12

The pipeline of the compiler

Basic concepts

▶ Concrete syntax ▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 12 / 177

slide-13
SLIDE 13

Concrete syntax (continued)

The concrete syntax of a programming language is a specifjcation of the syntax of programs in that language in terms of a stream of characters:

▶ It’s one-dimensional ▶ Lacking in structure

▶ No nesting ▶ No sub-expressions

▶ Diffjcult to work with

▶ Diffjcult to access parts ▶ Diffjcult to determine correctness

▶ Contains redundancies (spaces, comments, etc)

(define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n1)))))) Think of

▶ A text fjle ▶ Characters typed at the prompt

Mayer Goldberg \ Ben-Gurion University Compiler Construction 13 / 177

slide-14
SLIDE 14

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax

▶ Abstract syntax ▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 14 / 177

slide-15
SLIDE 15

Abstract syntax

The abstract syntax of a programming language is a set of mutually-recursive defjnitions of abstract data-structures:

▶ Multi-dimensional ▶ Conveys structure

▶ Nested ▶ Recursive (following the inductive defjnition of the grammar)

▶ Easier to work with than the concrete syntax

▶ Easier to access parts ▶ Easier to verify correctness ▶ Some syntactic correctness issues have already been decided Mayer Goldberg \ Ben-Gurion University Compiler Construction 15 / 177

slide-16
SLIDE 16

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax

▶ Abstract Syntax-Tree (AST) ▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 16 / 177

slide-17
SLIDE 17

Abstract Syntax-Tree (AST)

Notice

▶ The AST is a tree ▶ A data-structure that

represents code

▶ Follows the abstract

syntax of the language

▶ No text, parenthesis,

spaces, tabs, newlines

▶ The structure is evident ▶ Easy to fjnd

sub-expressions

▶ Easier to analyze,

transform, and compile

The AST of fact

Mayer Goldberg \ Ben-Gurion University Compiler Construction 17 / 177

slide-18
SLIDE 18

Concrete vs Abstract Syntax

▶ Parsing: going from concrete syntax to abstract syntax ▶ Parser: the tool that performs parsing, constructing an AST

Concrete Syntax

▶ Lacks structure ▶ Prone to errors ▶ Hard to delimit

sub-expressions

▶ Ineffjcient to work with ▶ Concrete Syntax can be

avoided

▶ Visual languages ▶ Structure/syntax editors

Abstract Syntax

▶ Has structure ▶ Many kinds of errors are

avoided

▶ Sub-Expressions are readily

accessible

▶ Effjcient to work with

Mayer Goldberg \ Ben-Gurion University Compiler Construction 18 / 177

slide-19
SLIDE 19

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST)

▶ Token ▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 19 / 177

slide-20
SLIDE 20

Tokens

▶ The smallest, meaningful, lexical unit in a language ▶ Described using regular expressions ▶ Identifjed using DFA (a very simple model of computation) ▶ Examples

▶ Numbers ▶ [Non-nested] Strings ▶ Names (variables, functions) ▶ Punctuation

▶ Cannot handle nesting of any kind:

▶ Parenthesized expressions ▶ Nested comments ▶ etc. Mayer Goldberg \ Ben-Gurion University Compiler Construction 20 / 177

slide-21
SLIDE 21

Tokens (continued)

▶ Scanning: going from characters into tokens ▶ Scanner: the tool that performs scanning ▶ Scanner generator: the tool that takes defjnitions for tokens,

using regular expressions (and callback functions), and returns a scanner

▶ Examples of scanner-generators: lex, fmex

Mayer Goldberg \ Ben-Gurion University Compiler Construction 21 / 177

slide-22
SLIDE 22

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token

▶ Delimiter ▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 22 / 177

slide-23
SLIDE 23

Delimiters

▶ Delimiters are characters that separate tokens ▶ In most languages spaces, parentheses, commas, semicolons,

etc., are all delimiters

▶ Some tokens must be separated by delimiters

▶ Two consecutive numbers, two consecutive symbols, etc.

▶ Some tokens do not need to be separated by delimiters

▶ Two consecutive strings, an open parenthesis followed by a

number, etc.

▶ Delimiters are language-dependent

Mayer Goldberg \ Ben-Gurion University Compiler Construction 23 / 177

slide-24
SLIDE 24

The pipeline of the compiler

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token 🗹 Delimiter

▶ Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 24 / 177

slide-25
SLIDE 25

Whitespace

▶ Whitespace refers to characters that

▶ Have no graphical representation ▶ Occur before or after tokens ▶ Spaces within strings are not whitespaces… ▶ Serve no syntactic purpose other than as delimiters and for

indentation

▶ Whitespace is language-dependent

Mayer Goldberg \ Ben-Gurion University Compiler Construction 25 / 177

slide-26
SLIDE 26

Delimiters in various languages

C & Scheme

Spaces, tab, newlines, carriage returns, form feeds are examples of whitespaces

Java

Literal newline characters may not occur inside a literal string (must use \n). Otherwise, similar to C & Scheme.

Python

Leading tabs are not whitespaces because they have a clear syntactic function: They denote nesting level.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 26 / 177

slide-27
SLIDE 27

Concrete vs Abstract syntax

Artifacts of the Concrete Syntax

▶ Delimiters & whitespaces ▶ Parentheses, brackets, braces, and other grouping, nesting, and

structring mechanisms (e.g., begin...end)

☞ Re-examine the concrete and abstract syntax for the factorial

function, and notice what’s gone!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 27 / 177

slide-28
SLIDE 28

Concrete vs Abstract syntax (continued)

The concrete syntax

(define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n1))))))

The abstract syntax

Mayer Goldberg \ Ben-Gurion University Compiler Construction 28 / 177

slide-29
SLIDE 29

The pipeline of the compiler (continued)

Basic concepts 🗹 Concrete syntax 🗹 Abstract syntax 🗹 Abstract Syntax-Tree (AST) 🗹 Token 🗹 Delimiter 🗹 Whitespace

Mayer Goldberg \ Ben-Gurion University Compiler Construction 29 / 177

slide-30
SLIDE 30

The pipeline of the compiler (continued)

Question

Which of the following statements is correct?

👏 Every token becomes a vertex in the AST 👏 Every AST is a binary tree 👏 ASTs can contain cycles 👏 Comments are a part of the abstract syntax 👎 ASTs contain type tags

Mayer Goldberg \ Ben-Gurion University Compiler Construction 30 / 177

slide-31
SLIDE 31

More on parsing

To parse computer programs in a given language, we rely on:

▶ Grammars with which to express the syntax of the language

▶ There are difgerent kinds of grammars (CFG, CSG, two-level,

etc)

▶ There are difgerent languages for expressing the grammar (e.g.,

BNF, EBNF, etc.)

▶ Algorithms for parsing programs as per kind of grammar ▶ Techniques (e.g., parsing combinators, DCGs)

Parser generator: Takes a description of the grammar for a language L, and generates a parser for L. For example, yacc, bison, nearly, etc.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 31 / 177

slide-32
SLIDE 32

The pipeline of the compiler (continued)

Scanning

▶ Going from characters to tokens ▶ Identifying & grouping characters into tokens for words,

numbers, strings, etc.

▶ Parsing over tokens is more effjcient than parsing over

characters

☞ As the parser examines various ways to parse the code, the

parser can avoid re-identifying and re-building complex tokens such as numbers, strings, etc

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 32 / 177

slide-33
SLIDE 33

The pipeline of the compiler (continued)

Reading

▶ In LISP/Prolog, the parser is split into two components:

▶ The reader, or the parser for the data language ▶ The tag-parser, or the parser for the source code

▶ In LISP/Scheme/Racket/Clojure/etc, the abstract syntax for

the data is the concrete syntax for the code

▶ In Prolog, the abstract syntax for the data is the abstract syntax

for the code

▶ Prolog is the programming language with the most powerful

capabilities of refmection, i.e., code examining and working with itself.

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 33 / 177

slide-34
SLIDE 34

The pipeline of the compiler (continued)

Reading — Summary

▶ In programming languages in which the syntax of code is not a

part of the syntax of data, concrete syntax is given as a stream

  • f characters

▶ In programming languages in which the syntax of code is part of

the syntax of data, things are a bit more complex:

▶ The concrete syntax of data is a stream of characters ▶ The concrete language of code is the abstract syntax of the

data

▶ In Scheme, the language of data is called S-expressions (sexprs,

more on this, later)

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 34 / 177

slide-35
SLIDE 35

The pipeline of the compiler (continued)

Tag-Parsing

▶ The tag-parser takes sexprs and returns [ASTs for] exprs ▶ Languages other than from the LISP & Prolog families do not

split parsing into a reader & tag-parser

▶ In such languages, parsing goes directly from tokens to [ASTs

for] expressions

☞ Every valid program “used to be” [i.e., before tag-parsing] a

valid sexpr

☞ Not every valid sexpr is a valid program!

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 35 / 177

slide-36
SLIDE 36

The pipeline of the compiler (continued)

Question

A parser should:

👏 Perform optimizations 👏 Evaluate expressions 👏 Raise type-mismatch errors 👏 Find potential runtime errors (null-pointer dereferences,

array-index errors, etc.)

👎 Validate the structure of input programs against a syntactic

specifjcation

Mayer Goldberg \ Ben-Gurion University Compiler Construction 36 / 177

slide-37
SLIDE 37

The pipeline of the compiler (continued)

Question

Using an AST, it is impossible to:

👏 Perform code reformatting/beautifjcation/style-checking 👏 Perform optimizations 👏 Output a new program which is semantically equivalent to the

input program (code generation)

👏 Refactor the input program 👎 Generate a list of all the comments in the code

Mayer Goldberg \ Ben-Gurion University Compiler Construction 37 / 177

slide-38
SLIDE 38

The pipeline of the compiler (continued)

Semantic Analysis

▶ Annotate the ASTs ▶ Compute addresses ▶ Annotate tail-calls ▶ Type-check code ▶ Perform optimizations

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 38 / 177

slide-39
SLIDE 39

The pipeline of the compiler (continued)

Code Generation

▶ Generate a stream of instructions in

▶ assembly language ▶ machine language ▶ Build executable ▶ some other target language…

▶ Perform low-level optimizations

Scanner Reader Tag-Parser Semantic Analyser Code Generator chars tokens sexprs ASTs ASTs asm / mach lang Parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 39 / 177

slide-40
SLIDE 40

The compiler for the course

Our compiler project

▶ Written in ocaml ▶ Supports a subset of Scheme + extensions ▶ Supports two, simple optimizations ▶ Compiles to x86/64 ▶ Runs on linux

What our project shall lack

▶ Support for the full language of Scheme ▶ Support for garbage collection ▶ The ability to compile itself

Mayer Goldberg \ Ben-Gurion University Compiler Construction 40 / 177

slide-41
SLIDE 41

S-expressions

▶ We’re going to learn about syntax by studying the syntax of

Scheme

▶ After all, we’re writing a Scheme compiler… ▶ It’s relatively simple, compared to the syntax of C, Java,

Python, and many other languages

▶ It comes with some interesting twists

▶ Scheme comes with two languages:

▶ A language for code ▶ A language for data

and there’s a tricky relationship between the two.

▶ The key to understanding the syntax of Scheme, is to think

about data

Mayer Goldberg \ Ben-Gurion University Compiler Construction 41 / 177

slide-42
SLIDE 42

The Language of Data

What is a language of data? — A language in which to

▶ Describe arbitrarily-complex data

▶ Possibly multi-dimensional, deeply nested ▶ Polymorphic ▶ Possibly circular

▶ Access components easily and effjciently

Mayer Goldberg \ Ben-Gurion University Compiler Construction 42 / 177

slide-43
SLIDE 43

The Language of Data (continued)

Today many languages of data are known:

▶ S-expressions (the fjrst: 1959) ▶ Functors (1972) ▶ Datalog (1977) ▶ SGML (1986) ▶ MS DDE (1987) ▶ CORBA (1991) ▶ MS COM (1993) ▶ MS DCOM (1996) ▶ XML (1996) ▶ JSON (2001)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 43 / 177

slide-44
SLIDE 44

The Language of Data (continued)

What makes S-expressions and Functors unique?

▶ They’re the fjrst… 😊 ▶ They’re supported natively, as part of specifjc programming

languages

▶ S-expressions are supported by LISP-based languages, including

Scheme & Racket

▶ Functors are supported by Prolog-based languages

☞ The language of programming is a [strict] subset of the language

  • f data

Mayer Goldberg \ Ben-Gurion University Compiler Construction 44 / 177

slide-45
SLIDE 45

The Language of Data (continued)

Think for a moment about the language of XML: <something>...</something>, etc

▶ It’s not supported natively by any programming language ▶ Most modern languages (Java, Python, etc) support it via

libraries

▶ No programming language has XML for its concrete syntax:

<package name="Foo"> <class name="Foo"> <method name="goo"> ... </method> </class> </package> This would be cumbersome, and weird!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 45 / 177

slide-46
SLIDE 46

The Language of Data (continued)

However, if some programming language both

▶ Supported XML as its data language ▶ Were itself written in XML

Then a parser for XML could also read programs written in that language:

▶ Writing interpreters, compilers, and other language-tools would

have been much simpler!

▶ Refmection (code examining code) would be simple

Mayer Goldberg \ Ben-Gurion University Compiler Construction 46 / 177

slide-47
SLIDE 47

The Language of Data (continued)

This is the case with S-expressions:

▶ They are the data language for LISP-based languages, including

Scheme

▶ LISP-based languages are written using S-expressions ▶ Writing interpreters and compilers in LISP-based languages is

much simpler than in other languages

▶ Computational refmection was invented in LISP! ▶ This is the real reason behind all these parentheses in Scheme:

▶ A very simple language ▶ Supports core types: pairs, vectors, symbols, strings, numbers,

booleans, the empty list, etc.

▶ A syntactic compromise that is great for expressing both code

and data

Mayer Goldberg \ Ben-Gurion University Compiler Construction 47 / 177

slide-48
SLIDE 48

S-expressions (continued)

Back to S-expressions

▶ S-expressions were invented along with LISP, in 1959 ▶ S-expressions stand for Symbolic Expressions ▶ The term is intended to distinguish itself from numerical

expressions

▶ Before LISP (and long after it was invented), most computation

concerned itself with numbers

▶ Computers languages were great at “crunching numbers”, but

working with non-numeric data types was diffjcult

▶ String libraries were non-standard and uncommon ▶ Polymorphic data was unheard of ▶ Nested data structured needed to be implemented from scratch,

usually with arrays of characters and/or integers…

Mayer Goldberg \ Ben-Gurion University Compiler Construction 48 / 177

slide-49
SLIDE 49

S-expressions (continued)

Back to S-expressions

Then S-expressions were invented as part of a very dynamic programming language (LISP):

▶ Working with data structures became considerably simpler

▶ Trivially allocated (no pointer-arithmetic) ▶ Polymorphic (lists of lists of numbers and strings and vectors of

booleans and…)

▶ Easy to access sub-structures (no pointer arithmetic) ▶ Easy to modify (in an easy-going, functional style) ▶ Easy to examine (they’re just made up of primitive types) ▶ Easy to redefjne ▶ Automatically deallocated (garbage collection)

▶ Treating code as data became considerably simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 49 / 177

slide-50
SLIDE 50

S-expressions (continued)

Several fjelds were invented using LISP and its tools:

▶ Symbolic Mathematics (Macsyma, a precursor to Wolfram

Mathematica)

▶ Artifjcial Intelligence ▶ Computer adventure-game generation-languages (MDL, ZIL)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 50 / 177

slide-51
SLIDE 51

S-expressions (continued)

Defjnition: S-expressions

The language is made up of

▶ The empty list: () ▶ Booleans: #f, #t ▶ Characters: #\a, #\Z, #\space, #\return, #\x05d0, etc ▶ Strings: "abc", "Hello\nWorld\t\x05d0;hi!", etc ▶ Numbers: -23, #x41, 2/3, 2-3i, 2.34, -2.34+3.5i ▶ Symbols: abc, lambda, define, fact, list->string ▶ Pairs: (a . b), (a b c), (a (2 . #f) "moshe") ▶ Vectors: #(), #(a b ((1 . 2) #f) "moshe")

Traditionally, non-pairs are known as atoms.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 51 / 177

slide-52
SLIDE 52

S-expressions (continued)

Proper & improper lists

▶ The name LISP comes from LISt Processing. ▶ In fact, LISP has no direct support for lists. ▶ LISP has ordered pairs

▶ Ordered pairs are created using cons ▶ The fjrst and second projections over ordered pairs are car and

  • cdr. For all x, y:

▶ (car (cons x y)) ≡ x ▶ (cdr (cons x y)) ≡ y ▶ The ordered pair of x and y can be written as (x . y) Mayer Goldberg \ Ben-Gurion University Compiler Construction 52 / 177

slide-53
SLIDE 53

S-expressions (continued)

The dot rules

Two rules govern how ordered pairs are printed:

▶ Rule 1: For any E, the ordered pair (E . ()) is printed as (E),

which looks like a list of 1 element.

▶ Rule 2: For any E1, E2, …, the ordered pair (E1 . (E2 — )) is

printed as (E1 E2 — )

▶ These rules just efgect how pairs are printed ▶ These rules give us a canonical representation for pairs

Mayer Goldberg \ Ben-Gurion University Compiler Construction 53 / 177

slide-54
SLIDE 54

S-expressions (continued)

Example

▶ The pair (a . (b . c)) is printed as (a b . c)

SYMBOL a SYMBOL b SYMBOL c PAIR CAR CDR PAIR CAR CDR

Mayer Goldberg \ Ben-Gurion University Compiler Construction 54 / 177

slide-55
SLIDE 55

S-expressions (continued)

Example

▶ The pair ((a . (b . ())) . ((c . (d . ())))) is

printed as ((a b) (c d))

SYMBOL a SYMBOL b NIL PAIR CAR CDR PAIR CAR CDR SYMBOL c SYMBOL d NIL PAIR CAR CDR PAIR CAR CDR NIL PAIR CAR CDR PAIR CAR CDR

Mayer Goldberg \ Ben-Gurion University Compiler Construction 55 / 177

slide-56
SLIDE 56

S-expressions (continued)

▶ Lists in Scheme can come in two forms, proper lists and

improper lists.

▶ When we just speak of lists, we usually mean proper lists. ▶ Most of the list processing functions (length, map, etc) take

  • nly proper lists:

> (length '(a b . c)) Exception in length: (a b . c) is not a proper list Type (debug) to enter the debugger.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 56 / 177

slide-57
SLIDE 57

S-expressions (continued)

Proper lists

▶ Proper lists are nested ordered pairs the rightmost cdr of which

is the empty list (aka nil)

▶ Testings for pairs is cheap, and is done by means of the builtin

predicate pair?

▶ Testing for lists is expensive, since it traverses nested, ordered

pairs, until it reaches their rightmost cdr. This is done by means of the builtin predicate list?

Mayer Goldberg \ Ben-Gurion University Compiler Construction 57 / 177

slide-58
SLIDE 58

S-expressions (continued)

Proper lists

Here’s a defjnition for list?: (define list? (lambda (e) (or (null? e) (and (pair? e) (list? (cdr e))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 58 / 177

slide-59
SLIDE 59

S-expressions (continued)

Improper lists

▶ Pairs that are not proper lists are improper lists. ▶ Improper lists end with a rightmost cdr that is not nil ▶ List-processing procedures such as length, map, etc., do not

work over improper lists

▶ There is no builtin procedure for testing improper lists, but it

could be written as follows: (define improper-list? (lambda (e) (and (pair? e) (not (list? (cdr e))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 59 / 177

slide-60
SLIDE 60

S-expressions (continued)

Self-evaluating forms

Booleans, numbers, characters, strings are self-evaluating forms. You can evaluate them directly at the prompt: > 123 123 > "abc" "abc" > #t #t > #\m #\m

Mayer Goldberg \ Ben-Gurion University Compiler Construction 60 / 177

slide-61
SLIDE 61

S-expressions (continued)

Other forms

The empty list, pairs, and vectors cannot be evaluated directly at the prompt:

▶ Entering an empty list or a vector or an improper list at the

prompt generates a run-time error.

▶ Entering a symbol at the prompt causes Scheme to attempt to

evaluate a variable by the same name

▶ Entering a proper list, that is not the empty list, at the prompt

causes Scheme to attempt to evaluate an application: > (a b c) Exception: variable b is not bound Type (debug) to enter the debugger.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 61 / 177

slide-62
SLIDE 62

S-expressions: quote & friends

To evaluate S-expressions that are not self-evaluating, we must use the form quote:

▶ The special form quote can be written in two ways:

▶ '<sexpr> ▶ (quote <sexpr>)

Both forms are equivalent, but Scheme will convert the fjrst into the second

▶ When you type abc at the Scheme prompt, you’re evaluating

the variable abc

▶ When you type 'abc at the Scheme prompt, you’re evaluating

the literal symbol abc

▶ The value of the literal symbol abc is just itself, which is why

when you type 'abc at the Scheme prompt, you get back abc

Mayer Goldberg \ Ben-Gurion University Compiler Construction 62 / 177

slide-63
SLIDE 63

S-expressions: quote & friends

▶ When you type () at the Scheme prompt, you’re evaluating an

application with no function and no arguments! This is a syntax-error!

▶ When you type '() at the Scheme prompt, you’re evaluating a

literal empty list

▶ The value of the literal empty list is just itself, which is why

when you type '() at the Scheme prompt, you get back ()

Mayer Goldberg \ Ben-Gurion University Compiler Construction 63 / 177

slide-64
SLIDE 64

S-expressions: quote & friends

▶ When you type (a b c) at the Scheme prompt, you’re

evaluating the application of the procedure a to the arguments b and c, which are variables

▶ When you type '(a b c) at the Scheme prompt, you’re

evaluating the literal list (a b c)

▶ The value of the literal list (a b c) is just (a b c), which is

why when you type '(a b c) at the Scheme prompt, you get back (a b c).

▶ Quoting a self-evaluating S-expression is possible, and

redundant: > '2 2 > (+ '2 '3) 5

Mayer Goldberg \ Ben-Gurion University Compiler Construction 64 / 177

slide-65
SLIDE 65

S-expressions: quote & friends

So what does quote do?

▶ The quote form does nothing

▶ It is not a procedure ▶ It doesn’t take an argument ▶ It delimits a constant, literal S-expressions

▶ The syntactic function of quote in Scheme is the same as the

syntactic function of braces { ... } in C in defjning literal data: const int A[] = {4, 9, 6, 3, 5, 1};

Mayer Goldberg \ Ben-Gurion University Compiler Construction 65 / 177

slide-66
SLIDE 66

S-expressions: quote & friends

Meet quasiquote

▶ Simlarly to quote, the form quasiquote can be written in two

ways:

▶ `<sexpr> ▶ (quasiquote <sexpr>)

Both forms are equivalent, but Scheme will convert the fjrst into the second

▶ quasiquote is also used to defjne data:

▶ `abc is the same as 'abc ▶ `(a b c) is the same as '(a b c)

▶ But quasiquote has two neat tricks!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 66 / 177

slide-67
SLIDE 67

S-expressions: quote & friends

Meet quasiquote

▶ The following two forms may occur within a

quasiquote-expression:

▶ The unquote form: ▶ ,<sexpr> ▶ (unquote <sexpr>)

Both forms are equivalent, but Scheme will convert the fjrst into the second

▶ The unquote-splicing form: ▶ ,@<sexpr> ▶ (unquote-splicing <sexpr>)

Both forms are equivalent, but Scheme will convert the fjrst into the second

▶ Both unquote & unquote-splicing are used within

quasiquote-expressions, to mix in dynamic and static data

Mayer Goldberg \ Ben-Gurion University Compiler Construction 67 / 177

slide-68
SLIDE 68

S-expressions: quote & friends

Meet quasiquote

> '(a (+ 1 2 3) b) (a (+ 1 2 3) b) > '(a ,(+ 1 2 3) b) (a ,(+ 1 2 3) b) > `(a (+ 1 2 3) b) (a (+ 1 2 3) b) > `(a ,(+ 1 2 3) b) (a 6 b) > `(a ,(append '(x y) '(z w)) b) (a (x y z w) b) > `(a ,@(append '(x y) '(z w)) b) (a x y z w b)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 68 / 177

slide-69
SLIDE 69

S-expressions: quote & friends

Meet quasiquote

▶ The expression `(a ,(append '(x y) '(z w)) b) is

equivalent to (cons 'a (cons (append '(x y) '(z w)) '(b)))

▶ The expression `(a ,@(append '(x y) '(z w)) b) is

equivalent to (cons 'a (append (append '(x y) '(z w)) '(b)))

▶ The difgerence between unquote & unquote-splicing is that

▶ unquote mixes in an expression using cons ▶ unquote-splicing mixes in an expression using append Mayer Goldberg \ Ben-Gurion University Compiler Construction 69 / 177

slide-70
SLIDE 70

S-expressions: quote & friends

Meet quasiquote

▶ Together, quasiquote, unquote, & unquote-splicing are

known as the quasiquote mechanism or the backquote mechanism

▶ The quasiquote mechanism allows us to create data by

template, that is, by specifying the shape of the data

▶ In Scheme, convenient ways to create data translate

immediately into convenient ways to create code

▶ Therefore we expect the quasiquote mechanism to have useful

applications within programming languages

▶ We can turn code that computes something into code that

shows us a computation…

Mayer Goldberg \ Ben-Gurion University Compiler Construction 70 / 177

slide-71
SLIDE 71

S-expressions: quote & friends

Consider the familiar factorial function: (define fact (lambda (n) (if (zero? n) 1 (* n (fact (- n 1))))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 71 / 177

slide-72
SLIDE 72

S-expressions: quote & friends

We use the quasiquote mechanism to convert the application (* n (fact (- n 1))) into code that describes what factorial does: (define fact (lambda (n) (if (zero? n) 1 `(* ,n ,(fact (- n 1)))))) Running (fact 5) now gives: > (fact 5) (* 5 (* 4 (* 3 (* 2 (* 1 1))))) As you can see, factorial now prints a trace of the computation.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 72 / 177

slide-73
SLIDE 73

S-expressions: quote & friends

We are now going to use the quasiquote mechanism to get Scheme to teach us about the structure of S-expressions. Consider the following code: (define foo (lambda (e) (cond ((pair? e) (cons (foo (car e)) (foo (cdr e)))) ((or (null? e) (symbol? e)) e) (else e)))) What does this program do?

Mayer Goldberg \ Ben-Gurion University Compiler Construction 73 / 177

slide-74
SLIDE 74

S-expressions: quote & friends

Let’s call foo with some arguments: > (foo 'a) a > (foo 123) 123 > (foo '()) () > (foo '(a b c)) (a b c)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 74 / 177

slide-75
SLIDE 75

S-expressions: quote & friends

Looking over the code again (define foo (lambda (e) (cond ((pair? e) (cons (foo (car e)) (foo (cdr e)))) ((or (null? e) (symbol? e)) e) (else e)))) we notice that:

▶ The 2nd and 3rd ribs of the cond overlap [we could have

removed the 2nd]

▶ All atoms are left unchanged ▶ All pairs are duplicated, while recursing over the car and cdr of

the pair So foo does nothing, though it does it recursively! ☺

Mayer Goldberg \ Ben-Gurion University Compiler Construction 75 / 177

slide-76
SLIDE 76

S-expressions: quote & friends

We now use the quasiquote mechanism to cause foo to generate a trace: (define foo (lambda (e) (cond ((pair? e) `(cons ,(foo (car e)) ,(foo (cdr e)))) ((or (null? e) (symbol? e)) `',e) (else e))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 76 / 177

slide-77
SLIDE 77

S-expressions: quote & friends

Running foo now gives us some interesting data: > (foo 'a) 'a > (foo '(a b c)) (cons 'a (cons 'b (cons 'c '()))) > (foo '(a 1 b 2)) (cons 'a (cons 1 (cons 'b (cons 2 '())))) > (foo 123) 123 > (foo '((a b) (c d))) (cons (cons 'a (cons 'b '())) (cons (cons 'c (cons 'd '())) '()))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 77 / 177

slide-78
SLIDE 78

S-expressions: quote & friends

▶ Using the quasiquote mechanism, we got foo to describe how

S-expressions are created using the most basic API

▶ We should really add support for proper lists and vectors! ▶ In fact, the name describe is far more appropriate than foo…

Let’s rewrite foo…

Mayer Goldberg \ Ben-Gurion University Compiler Construction 78 / 177

slide-79
SLIDE 79

S-expressions: quote & friends

(define describe (lambda (e) (cond ((list? e) `(list ,@(map describe e))) ((pair? e) `(cons ,(describe (car e)) ,(describe (cdr e)))) ((vector? e) `(vector ,@(map describe (vector->list e)))) ((or (null? e) (symbol? e)) `',e) (else e))))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 79 / 177

slide-80
SLIDE 80

S-expressions: quote & friends

Running describe on various S-expressions is very instructive: > (describe '(a b c)) (list 'a 'b 'c) > (describe '#(a b c)) (vector 'a 'b 'c) > (describe '(a b . c)) (cons 'a (cons 'b 'c)) > (describe ''a) (list 'quote 'a) Wait! What’s with the last example?!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 80 / 177

slide-81
SLIDE 81

S-expressions: quote & friends

Recall what we said about quote, quasiquote, unquote, & unquote-splicing:

▶ '<sexpr> ≡ (quote <sexpr>) ▶ `<sexpr> ≡ (quasiquote <sexpr>) ▶ ,<sexpr> ≡ (unquote <sexpr>) ▶ ,@<sexpr> ≡ (unquote-splicing <sexpr>)

Now we get to see this happen…

Mayer Goldberg \ Ben-Gurion University Compiler Construction 81 / 177

slide-82
SLIDE 82

S-expressions: quote & friends

Now we get to see this happen: > (describe ''<sexpr>) (list 'quote '<sexpr>) > (describe '`<sexpr>) (list 'quasiquote '<sexpr>) > (describe ',<sexpr>) (list 'unquote '<sexpr>) > (describe ',@<sexpr>) (list 'unquote-splicing '<sexpr>) Rule: Every Scheme expression used to be an S-expression when it was little! 👷

Mayer Goldberg \ Ben-Gurion University Compiler Construction 82 / 177

slide-83
SLIDE 83

S-expressions: quote & friends

Question

What is (length '''''''''''''''''moshe) ?

👏 17 👏 16 👏 Generates an error message! 👏 1 👎 2

Mayer Goldberg \ Ben-Gurion University Compiler Construction 83 / 177

slide-84
SLIDE 84

S-expressions: quote & friends

Explanation

(length '''''''''''''''''moshe) is the same as (length '(quote <something>)), where <something> is '''''''''''''''moshe, but that really doesn’t matter! We are still computing the length of a list of size 2:

▶ The fjrst element of the list is the symbol quote ▶ The second element of the list is '''''''''''''''moshe

Mayer Goldberg \ Ben-Gurion University Compiler Construction 84 / 177

slide-85
SLIDE 85

S-expressions: quote & friends (continued)

Question

The structure of the S-expression ''a in Scheme is:

👏 Just the symbol a 👏 The proper list (quote . (a . ())) 👏 The proper list (quote . (quote . (a . ()))) 👏 An invalid S-expression 👎 The nested proper list (quote . ((quote . (a . ())) .

()))

Mayer Goldberg \ Ben-Gurion University Compiler Construction 85 / 177

slide-86
SLIDE 86

Tag-Parsing (continued)

▶ In a previous slide, we made the claims that in all descendants of

LISP (including Scheme):

☞ Every valid program “used to be” [i.e., before tag-parsing] a

valid sexpr

☞ Not every valid sexpr is a valid program!

▶ We can now show you some examples

As data (S-expressions)

▶ (if if if if) is a list

  • f size 4

▶ (if (zero? n) 'zero

'non-zero) is also a list

  • f size 4

As code

▶ (if if if if) is not a

valid if-expression

▶ (if (zero? n) 'zero

'non-zero) is a valid if-expression

Mayer Goldberg \ Ben-Gurion University Compiler Construction 86 / 177

slide-87
SLIDE 87

Further reading

🕯 The Dragon Book (2nd edition): Chapter 1.2 - The structure of

a compiler, pages 4–11

🔘 Recursive Functions of Symbolic Expressionsand Their

Computation by Machine, Part I (by John McCarthy, 1960)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 87 / 177

slide-88
SLIDE 88

Chapter 2

Goals 🗹 The pipeline of the compiler 🗹 Introduction to syntactic analysis ☞ Further steps in ocaml Agenda ☞ Ocaml

▶ Types ▶ References ▶ Modules & signatures ▶ Functional programming in ocaml Mayer Goldberg \ Ben-Gurion University Compiler Construction 88 / 177

slide-89
SLIDE 89

Introduction to ocaml (2)

Still need to cover

To program in ocaml efgectively in this course , we still need to learn some additional topics:

▶ Defjning new data types ▶ Assignments, side-efgects,

What we shan’t cover

Object Orientation: Once you’re comfortable with the ocaml, you might like to pick up the object-oriented layer. As object-orientation goes, you should fjnd it to be sophisticated and expressive.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 89 / 177

slide-90
SLIDE 90

Types

New types are defjned using the type statement: type fraction = {numerator : int; denominator : int};; The above statement defjnes a new type fraction as a record consisting of two fjelds: numerator & denominator, both of type int.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 90 / 177

slide-91
SLIDE 91

Types (continued)

Once fraction has been defjned, the underlying system recognizes it for all records with these fjelds & types: # {numerator = 2; denominator = 3};;

  • : fraction = {numerator = 2; denominator = 3}

# {denominator = 3; numerator = 2};;

  • : fraction = {numerator = 2; denominator = 3}

Notice that the order of the fjelds in a record is immaterial, because the fjelds are accessed through their names, which are converted consistently into ofgsets.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 91 / 177

slide-92
SLIDE 92

Types (continued)

The type-inference engine in ocaml will correctly infer newly-defjned types: let add_fractions f1 f2 = match f1, f2 with | {numerator = n1; denominator = d1}, {numerator = n2; denominator = d2} -> {numerator = n1 * d2 + n2 * d1; denominator = d1 * d2};; And of course: # add_fractions {numerator = 2; denominator = 3} {numerator = 4; denominator = 5};;

  • : fraction = {numerator = 22; denominator = 15}

Mayer Goldberg \ Ben-Gurion University Compiler Construction 92 / 177

slide-93
SLIDE 93

Types (continued)

We can defjne disjoint types as follows: type number = | Int of int | Frac of fraction | Float of float;; Think of the | as disjunction. The initial | is optional in ocaml.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 93 / 177

slide-94
SLIDE 94

Types (continued)

We can now defjne a list of numbers as follows: # [Int 3; Frac {numerator = 3; denominator = 4}; Float (4.0 *. atan(1.0))];;

  • : number list =

[Int 3; Frac {numerator = 3; denominator = 4}; Float 3.14159265358979312] Notice that ocaml had no trouble identifying each of the three elements of the list as belonging to type number.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 94 / 177

slide-95
SLIDE 95

Types (continued)

Working with disjoint types

Use match to dispatch over the corresponding type constructor, and make sure you handle each and every possibility! let number_to_string x = match x with | Int n -> Format.sprintf "%d" n | Frac {numerator = num; denominator = den} -> Format.sprintf "%d/%d" num den | Float x -> Format.sprintf "%f" x;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 95 / 177

slide-96
SLIDE 96

Types (continued)

Working with disjoint types (continued)

And here’s how it looks: # number_to_string (Int 234);;

  • : string = "234"

# number_to_string (Frac {numerator = 2; denominator = 5});;

  • : string = "2/5"

# number_to_string (Float 234.234);;

  • : string = "234.234000"

Mayer Goldberg \ Ben-Gurion University Compiler Construction 96 / 177

slide-97
SLIDE 97

References

Let us take another look at the record-type. Recall the defjnition of fraction: # type fraction = {numerator : int; denominator : int};; type fraction = { numerator : int; denominator : int; } In the function add_fractions we used pattern-matching to access the record-fjelds.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 97 / 177

slide-98
SLIDE 98

References (continued)

Ocaml lets you access fjelds directing, using the dot-notation that is familiar from object-oriented programming: # {numerator = 3; denominator = 5}.numerator;;

  • : int = 3

# {numerator = 3; denominator = 5}.denominator;;

  • : int = 5

Mayer Goldberg \ Ben-Gurion University Compiler Construction 98 / 177

slide-99
SLIDE 99

References (continued)

Ocaml ofgers a special record-type known as a reference.

▶ References are derived types. For any type α, we can have a

type α ref.

▶ References are records with a single fjeld contents ▶ References have a special syntax ! to dereference the fjeld:

# {contents = 1234};;

  • : int ref = {contents = 1234}

# {contents = 1234}.contents;;

  • : int = 1234

# ! {contents = 1234};;

  • : int = 1234

Mayer Goldberg \ Ben-Gurion University Compiler Construction 99 / 177

slide-100
SLIDE 100

References (continued)

▶ References have a special syntax := for assignment ▶ This is how assignments are managed in ocaml

# let x = ref 1234;; val x : int ref = {contents = 1234} # x;;

  • : int ref = {contents = 1234}

# !x;;

  • : int = 1234

# x := 4567;;

  • : unit = ()

# x;;

  • : int ref = {contents = 4567}

# !x;;

  • : int = 4567

Mayer Goldberg \ Ben-Gurion University Compiler Construction 100 / 177

slide-101
SLIDE 101

References (continued)

▶ It is not possible to perform assignments on variables ▶ It is only possible to change the fjelds of reference types

# let x = "abc";; val x : string = "abc" # x := "def";; Characters 0-1: x := "def";; ^ Error: This expression has type string but an expression was expected of type 'a ref

Mayer Goldberg \ Ben-Gurion University Compiler Construction 101 / 177

slide-102
SLIDE 102

References (continued)

▶ You can defjne a reference type of any other type, including

  • ther reference types:

# let x = ref (ref 1234);; val x : int ref ref = {contents = {contents = 1234}} # x := ref 5678;;

  • : unit = ()

# x;;

  • : int ref ref = {contents = {contents = 5678}}

# !x := 9876;;

  • : unit = ()

# x;;

  • : int ref ref = {contents = {contents = 9876}}

Mayer Goldberg \ Ben-Gurion University Compiler Construction 102 / 177

slide-103
SLIDE 103

Modules, signatures, functors

Modules

▶ A module is a way of packaging functions, classes, variables, &

types

▶ A signature is the type of a module

▶ Visibility of a module can be restricted through the signature

▶ Functors are functions from functors/modules to

functors/modules

Goals

▶ Learn to work with existing modules ▶ Learn to write your own modules

Mayer Goldberg \ Ben-Gurion University Compiler Construction 103 / 177

slide-104
SLIDE 104

Modules, signatures, functors (continued)

We defjne the function hyp to compute the hypotenuse of a triangle, given two sides and the angle between them (cosine law). We use the auxiliary function square: # module M = struct let square x = x *. x let hyp a b theta = sqrt((square a) +. (square b) -. 2.0 *. a *. b *. (cos theta)) end;; module M : sig val square : float -> float val hyp : float -> float -> float -> float end

Mayer Goldberg \ Ben-Gurion University Compiler Construction 104 / 177

slide-105
SLIDE 105

Modules, signatures, functors (continued)

Both M.square and M.hyp are visible: # M.hyp;;

  • : float -> float -> float -> float = <fun>

# M.square;;

  • : float -> float = <fun>

# M.square 2.0;;

  • : float = 4.

# M.hyp 3.5 5.6 0.645771823239;;

  • : float = 3.50763282088818817

Mayer Goldberg \ Ben-Gurion University Compiler Construction 105 / 177

slide-106
SLIDE 106

Modules, signatures, functors (continued)

We defjne the module type based on the returned signature of M, but with the square function removed: # module type SigHyp = sig val hyp : float -> float -> float -> float end;; module type SigHyp = sig val hyp : float -> float -> float -> float end # module M : SigHyp = struct let square x = x *. x let hyp a b theta = sqrt((square a) +. (square b) -. 2.0 *. a *. b *. (cos theta)) end;; module M : SigHyp

Mayer Goldberg \ Ben-Gurion University Compiler Construction 106 / 177

slide-107
SLIDE 107

Modules, signatures, functors (continued)

Visibility is now restricted:

▶ M.hyp is visible from outside M ▶ M.square is not visible from outside M ▶ Functions visible from outside may use functions visible from

inside # M.hyp;;

  • : float -> float -> float -> float = <fun>

# M.square;; Characters 0-8: M.square;; ^^^^^^^^ Error: Unbound value M.square # M.hyp 3.5 5.6 0.645771823239;;

  • : float = 3.50763282088818817

Mayer Goldberg \ Ben-Gurion University Compiler Construction 107 / 177

slide-108
SLIDE 108

Modules, signatures, functors (continued)

Summary

▶ Modules & signatures are the way to package functions &

control visibility

▶ Convenient, super-effjcient, safe ▶ No need to use local, nested functions to manage visibility ▶ Always use signatures to control visibility!

Learn on your own

▶ Modules can contain types too, and be used to parameterize

code with types

▶ Simpler & better than generics & templates

▶ Functors map modules/functors =

⇒ modules/functors

Mayer Goldberg \ Ben-Gurion University Compiler Construction 108 / 177

slide-109
SLIDE 109

Further reading

🕯 The Objective Caml Programming Language, Chapter 12 🔘 An online tutorial on ocaml modules

Mayer Goldberg \ Ben-Gurion University Compiler Construction 109 / 177

slide-110
SLIDE 110

Parsing Techniques

Dozens of parsing algorithms are known:

▶ Parsing algorithms are tailored to a specifjc kind of grammar

▶ Difgerent kinds of grammars can be parsed by difgerent

algorithms

▶ Difgerent kinds of grammars have difgerent levels of complexity

  • n the Chomsky Hierarchy

▶ Most programming languages can be described using

context-free grammars

▶ Some older languages can only be described using

context-sensitive grammars

Mayer Goldberg \ Ben-Gurion University Compiler Construction 110 / 177

slide-111
SLIDE 111

Parsing Techniques (continued)

Context-free Grammars (CFGs)

A CFG is a structure of the form G = ⟨V, Σ, R, S⟩:

▶ V is a set of non-terminals ▶ Σ is a set of terminals, or tokens ▶ R is a relation in V × (V ∪ Σ)∗

▶ Members of R are called production rules or rewrite rules

▶ S is the an initial non-terminal

Mayer Goldberg \ Ben-Gurion University Compiler Construction 111 / 177

slide-112
SLIDE 112

Parsing Techniques (continued)

Context-free Grammars (conveniences)

▶ We abbreviate the two productions ⟨A, X⟩ , ⟨A, Y⟩ ∈ R with

⟨A, X | Y⟩ (disjunction)

▶ We abbreviate the three productions ⟨A, X⟩ , ⟨X, ε⟩ , ⟨X, BX⟩ ∈ R,

where X has no other productions, with ⟨A, B∗⟩, (Kleene-star)

▶ We abbreviate the three productions

⟨A, X⟩ , ⟨X, B⟩ , ⟨X, BX⟩ ∈ R, where X has no other productions, with ⟨A, B+⟩, (Kleene-plus)

▶ We abbreviate the two productions ⟨A, ε⟩ , ⟨A, B⟩ ∈ R, with

⟨ A, B?⟩ (maybe)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 112 / 177

slide-113
SLIDE 113

Parsing Techniques (continued)

The two basic approaches to parsing CFG are top-down & bottom-up:

Top-down algorithms

▶ Start with the initial non-terminal ▶ Rewrite the LHS of a non-terminal with its RHS, matching the

input stream of tokens

▶ Keep rewriting until the entire input stream is matched

Mayer Goldberg \ Ben-Gurion University Compiler Construction 113 / 177

slide-114
SLIDE 114

Parsing Techniques (continued)

The two basic approaches to parsing CFG are top-down & bottom-up:

Bottom-up algorithms

▶ Start with the input stream of tokens ▶ Find a rewrite rule where the RHS matches sequences in the

input, and rewrite them to the LHS, reducing several items to a single non-terminal

▶ Keep rewriting until the entire input stream has been reduced to

the initial non-terminal

Mayer Goldberg \ Ben-Gurion University Compiler Construction 114 / 177

slide-115
SLIDE 115

Parsing Techniques (continued)

How most parsing algorithms are used

▶ Describe the grammar of the language using a DSL for some

restricted CFG

▶ Example: Backus-Naur Form (BNF)

▶ Associate actions with each production rule:

▶ How to build the AST when a specifjc rule is matched

▶ A parser generator (e.g., yacc, bison, antlr, etc) compiles the

grammar:

▶ Performing various optimizations ▶ Generating code in some language (C, Java, ocaml, etc) ▶ This code is the parser

▶ Calling the parser on some input returns an AST

Mayer Goldberg \ Ben-Gurion University Compiler Construction 115 / 177

slide-116
SLIDE 116

Parsing Techniques (continued)

Goals of parsing algorithms

▶ Minimal restrictions on the grammar ▶ Avoid backtracking as much as possible ▶ Maximum optimizations of the parser

Mayer Goldberg \ Ben-Gurion University Compiler Construction 116 / 177

slide-117
SLIDE 117

Parsing Combinators

A technique for embedding a specifjcation of a grammar into a programming language:

▶ Parsers for larger languages are composed from parsers for

smaller languages

▶ The grammar can be written & debugged bottom-up ▶ The parsers are fjrst-class objects:

▶ We get to use abstraction to create complex parsers quickly &

simply

▶ Re-use efgectively common sub-languages

▶ Simple to understand & implement ▶ Very rapid development

Mayer Goldberg \ Ben-Gurion University Compiler Construction 117 / 177

slide-118
SLIDE 118

Parsers Combinators (continued)

Parsing combinators do have some disadvantages:

▶ The grammar is embedded as-is:

▶ As much backtracking as implied by the grammar: Rewrite

rules that have large common prefjxes are going to require plenty of backtracking: A → xByCzDt A → xByCzDw · · ·

▶ No optimizations or transformations are performed on it!

▶ ε-productions & left-recursion result in infjnite loops

▶ We need to eliminate these manually!

▶ Can produce ineffjcient parsers rather effjciently! 😊

Mayer Goldberg \ Ben-Gurion University Compiler Construction 118 / 177

slide-119
SLIDE 119

Parsers Combinators (continued)

Nevertheless:

▶ Parsing combinators are very simple to learn about grammars:

▶ No complex algorithms are necessary! ▶ The easiest way to design complex grammars & their parsers:

Abstraction —

▶ shortens & simplifjes the code ▶ encourages re-use & consistency

▶ Optimizations can always be done manually!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 119 / 177

slide-120
SLIDE 120

Parsers Combinators (continued)

Our parsing combinators take lists of characters for input, and return an AST. We start with code to convert strings to lists of characters: let string_to_list str = let rec loop i limit = if i = limit then [] else (String.get str i) :: (loop (i + 1) limit) in loop 0 (String.length str);;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 120 / 177

slide-121
SLIDE 121

Parsers Combinators (continued)

We shall also want to generate a string from a list of characters: let list_to_string s = let rec loop s n = match s with | [] -> String.make n '?' | car :: cdr -> let result = loop cdr (n + 1) in Bytes.set result n car; result in loop s 0;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 121 / 177

slide-122
SLIDE 122

Parsers Combinators (continued)

Sometimes our parsers must fail on their input. When this happens, we raise an exception (which in other languages is called throwing an exception). We should therefore defjne an exception: exception X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 122 / 177

slide-123
SLIDE 123

Parsers Combinators (continued)

Parsing combinators are compositional. This means

▶ We build parsers of large languages by combining parsers for

smaller [sub-]languages

▶ The procedures that combine parsers are called parsing

combinators (PCs)

▶ But we must start by being able to parse single characters

▶ All other parsers are built on top of such simple parsers for

single characters

Mayer Goldberg \ Ben-Gurion University Compiler Construction 123 / 177

slide-124
SLIDE 124

Parsers Combinators (continued)

The const PC takes a predicate (char -> bool), and return a parser that recognizes this character: let const pred = function | [] -> raise X_no_match | e :: s -> if (pred e) then (e, s) else raise X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 124 / 177

slide-125
SLIDE 125

Parsers Combinators (continued)

We defjne the non-terminal that recognizes the capital letter 'A' by calling const with a predicate that returns true if its argument is equal to 'A': # let ntA = const (fun ch -> ch = 'A');; val ntA : char list -> char * char list = <fun> Notice that ntA

▶ …takes a list of characters ▶ …returns a pair of what it matched, and the remaining characters

This is the structure of all parsers written using PCs

Mayer Goldberg \ Ben-Gurion University Compiler Construction 125 / 177

slide-126
SLIDE 126

Parsers Combinators (continued)

Using ntA

# ntA ['A'; 'B'; 'C'];;

  • : char * char list = ('A', ['B'; 'C'])

# ntA [];; Exception: PC.X_no_match. # ntA ['a'; 'A'];; Exception: PC.X_no_match.

▶ We only match the head of the input ▶ Obviously, ntA fails on an empty list

Mayer Goldberg \ Ben-Gurion University Compiler Construction 126 / 177

slide-127
SLIDE 127

Parsers Combinators (continued)

▶ Testing our parsers by applying them to lists is no fun

▶ It’s a pain to type lists of characters!

▶ Let’s automate things a bit:

let test_string nt str = let (e, s) = (nt (string_to_list str)) in (e, (Printf.sprintf "->[%s]" (list_to_string s)));;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 127 / 177

slide-128
SLIDE 128

Parsers Combinators (continued)

We can now test more easily: # test_string ntA "";; Exception: PC.X_no_match. # test_string ntA "Abc";;

  • : char * string = ('A', "->[bc]")

This is only for testing! When we deploy our parser, we’ll call it directly.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 128 / 177

slide-129
SLIDE 129

Parsers Combinators (continued)

Constant parsers are not very useful! Let’s consider catenation: let caten nt1 nt2 = fun s -> let (e1, s) = (nt1 s) in let (e2, s) = (nt2 s) in ((e1, e2), s);;

▶ We try to parse the head of s using nt1

▶ If we succeed, we get e1 and the remaining chars s ▶ We try to parse the head of s (what remained after nt1) using

nt2

▶ If we succeed, we get e2 and the remaining chars s ▶ We return the pair of e1 & e2, as well as the remaining chars Mayer Goldberg \ Ben-Gurion University Compiler Construction 129 / 177

slide-130
SLIDE 130

Parsers Combinators (continued)

We defjne and test the parser for A followed by B: # let ntAB = caten (const (fun ch -> ch = 'A')) ^^I(const (fun ch -> ch = 'B'));; val ntAB : char list -> (char * char) * char list = <fun> # test_string ntAB "ABC";;

  • : (char * char) * string = (('A', 'B'), "->[C]")

# test_string ntAB "abc";; Exception: PC.X_no_match. # test_string ntAB "Abc";; Exception: PC.X_no_match. # test_string ntAB "AB";;

  • : (char * char) * string = (('A', 'B'), "->[]")

# test_string ntAB "A Bcdef";; Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 130 / 177

slide-131
SLIDE 131

Parsers Combinators (continued)

We now consider disjunction of two parsers: let disj nt1 nt2 = fun s -> try (nt1 s) with X_no_match -> (nt2 s);;

▶ We try to parse the head of s using nt1

▶ If we succeed, then the call to nt1 returns normally ▶ If we fail we try to parse the head of s using nt2 Mayer Goldberg \ Ben-Gurion University Compiler Construction 131 / 177

slide-132
SLIDE 132

Parsers Combinators (continued)

We defjne and test the parser for either A or a: # let ntA_or_a = disj (const (fun ch -> ch = 'A')) (const (fun ch -> ch = 'a'));; val ntA_or_a : char list -> char * char list = <fun> # test_string ntA_or_a "";; Exception: PC.X_no_match. # test_string ntA_or_a "this won't work either";; Exception: PC.X_no_match. # test_string ntA_or_a "A nice example";;

  • : char * string = ('A', "->[ nice example]")

# test_string ntA_or_a "a nice example";;

  • : char * string = ('a', "->[ nice example]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 132 / 177

slide-133
SLIDE 133

Parsers Combinators (continued)

What next?

▶ Some simple parsers ▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 133 / 177

slide-134
SLIDE 134

Some simple parsers

let nt_epsilon s = ([], s);; let nt_none _ = raise X_no_match;; let nt_end_of_input = function | []

  • > ([], [])

| _ -> raise X_no_match;;

▶ nt_epsilon is the parser that recognizes ε-productions ▶ nt_none is the parser that always fails ▶ nt_end_of_input is the parser that recognizes the end of the

input stream (and fails otherwise)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 134 / 177

slide-135
SLIDE 135

Parsers Combinators (continued)

What next? 🗹 Some simple parsers

▶ Learn about the algebra of PCs ▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 135 / 177

slide-136
SLIDE 136

The Algebra of PCs

Why do nt_epsilon & nt_end_of_input match with the empty list []? This has to do with the Algebra of parsing combinators:

▶ What is the unit element of catenation?

▶ Answer: r = ε ▶ We’re looking for a non-terminal r such that for any s, we have

rs = sr = s…

▶ This means that nt_epsilon is the unit element for caten: ▶ caten nt_epsilon nt ≡ caten nt nt_epsilon ≡ nt ▶ Both nt_epsilon & nt_end_of_input are used ’til the end of

something

▶ The natural operation is to create a list of all things until ε or

the end-of-input are reached

▶ The unit element for append on lists is the empty list ▶ Ergo, it is natural to match [] when either condition is

encountered

Mayer Goldberg \ Ben-Gurion University Compiler Construction 136 / 177

slide-137
SLIDE 137

The Algebra of PCs (continued)

Similarly, nt_none is the unit element in the algebra of disjuction: disj nt nt_none ≡ disj nt_none nt ≡ nt

☞ Later on, we shall use the algebra of PCs together with folding

  • perations to create complex parsers easily

Mayer Goldberg \ Ben-Gurion University Compiler Construction 137 / 177

slide-138
SLIDE 138

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs

▶ Learn of new PC operators ▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 138 / 177

slide-139
SLIDE 139

New PC Operators

Identifying the characters, or pairs of characters, etc that match a grammar is often not enough:

▶ We want to be able to create an AST for that piece of syntax ▶ We do this by specifying postprocessing or callback functions

  • ver the expression that was matched.

▶ In our package, the PC that performs this is called pack

let pack nt f = fun s -> let (e, s) = (nt s) in ((f e), s);;

▶ pack takes a non-terminal nt and a function f ▶ returns a parser that recognizes the same language as nt ▶ …but which applies f to whatever was matched Mayer Goldberg \ Ben-Gurion University Compiler Construction 139 / 177

slide-140
SLIDE 140

Parsing combinators (continued)

Example: Identifying digits

# let nt_digit_0_to_9 = const (fun ch -> '0' <= ch && ch <= '9');; val nt_digit_0_to_9 : char list -> char * char list = <fun> # test_string nt_digit_0_to_9 "234";;

  • : char * string = ('2', " ->[34]")

# let nt_digit_0_to_9 = pack (const (fun ch -> '0' <= ch && ch <= '9')) (fun ch -> (int_of_char ch) - ascii_0);; val nt_digit_0_to_9 : char list -> int * char list = <fun> # test_string nt_digit_0_to_9 "234";;

  • : int * string = (2, " ->[34]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 140 / 177

slide-141
SLIDE 141

Recursive productions

▶ Grammars are often recursive or mutually-recursive:

▶ The non-terminal on the LHS of a production often appears on

the RHS (recursion)

▶ The non-terminal on the LHS of a production often appears in

  • ne of the RHSs of the transitive-refmexive closure of the

relation (mutual recursion)

▶ Currently, we are unable to describe recursive rules using PCs

Mayer Goldberg \ Ben-Gurion University Compiler Construction 141 / 177

slide-142
SLIDE 142

Recursive productions (continued)

We are unable to describe recursive rules using PCs: ⟨A⟩ → ( (⟨A⟩∗|ε) )

▶ The non-terminal A ▶ The open-parenthesis token ▶ The close-parenthesis token ▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 142 / 177

slide-143
SLIDE 143

Recursive productions (continued)

We are unable to describe recursive rules using PCs: let nt_A = caten (const (fun ch -> ch = '(')) (caten (disj (star nt_A) nt_epsilon) (const (fun ch -> ch = ')')));;

▶ Nevermind that we don’t yet have star… ▶ We can’t use nt_A before it’s defjned!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 143 / 177

slide-144
SLIDE 144

Recursive productions (continued)

We are unable to describe recursive rules using PCs:

▶ The problem is not specifjc to parsing combinators.

▶ For example, you couldn’t defjne in Scheme:

(define f (g (h f))) because you can’t use something before it’s defjned! (Ok, in some languages you can!)

▶ So how are recursive defjnitions possible at all?

▶ When you defjne a recursive function you are not using the

function before it’s defjned

▶ You are using the address of the function before the function is

defjned

▶ Recursive functions are circular data structures:

▶ The language defjnition permits you to defjne these particular

circular structures statically, rather than at run-time

Mayer Goldberg \ Ben-Gurion University Compiler Construction 144 / 177

slide-145
SLIDE 145

Parsing combinators (continued)

To implement recursive parsers, we need to delay the evaluation of the recursive non-terminal

▶ “Wrap it in a lambda…”

let delayed thunk = fun s -> thunk() s;;

▶ A thunk is a procedure that takes zero arguments ▶ Thunks are used to delay evaluation

Mayer Goldberg \ Ben-Gurion University Compiler Construction 145 / 177

slide-146
SLIDE 146

Recursive productions (continued)

Example: Identifying digits (continued)

# let nt_natural = let rec make_nt_natural () = pack (caten nt_digit_0_to_9 (disj (delayed make_nt_natural) nt_epsilon)) (function (a, s) -> a :: s) in make_nt_natural();; val nt_natural : char list -> int list * char list = <fun>

▶ Notice the packing function (function (a, s) -> a :: s)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 146 / 177

slide-147
SLIDE 147

Recursive productions (continued)

Example: Identifying digits (continued)

# test_string nt_natural "1234";;

  • : int list * string = ([1; 2; 3; 4], "->[]")

We are not done yet:

▶ We got a list of digits, as opposed to a list of chars!

☞ We want to left-fold these digits into a number in base 10

Mayer Goldberg \ Ben-Gurion University Compiler Construction 147 / 177

slide-148
SLIDE 148

Parsers combinators (continued)

We pack the list of digits using a left-fold: # let nt_natural = let rec make_nt_natural () = pack (caten nt_digit_0_to_9 (disj (delayed make_nt_natural) nt_epsilon)) (function (a, s) -> a :: s) in pack (make_nt_natural()) (fun s -> (List.fold_left (fun a b -> 10 * a + b) s));; val nt_natural : char list -> int * char list = <fun>

▶ Notice the type of the parser: char list -> int * char list

Mayer Goldberg \ Ben-Gurion University Compiler Construction 148 / 177

slide-149
SLIDE 149

Recursive productions (continued)

Testing it: # test_string nt_natural "1234";;

  • : int * string = (1234, "->[]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 149 / 177

slide-150
SLIDE 150

Recursive productions (continued)

The parser ntParen expresses the grammar of one set of arbitrarily-nested parentheses: # let rec ntParen s = pack (caten (const (fun ch -> ch = '(')) (caten (disj (delayed (fun _ -> ntParen)) (pack nt_epsilon (fun _ -> " ntParen"))) (const (fun ch -> ch = ')')) )) (fun _ -> "ntParen") s ;; val ntParen : char list -> string * char list = < fun> #+end_src

Mayer Goldberg \ Ben-Gurion University Compiler Construction 150 / 177

slide-151
SLIDE 151

Recursive productions (continued)

Testing ntParen on various inputs: # test_string ntParen "()";;

  • : string * string = ("ntParen", "->[]")

# test_string ntParen "";; Exception: PC.X_no_match. # test_string ntParen "((()))";;

  • : string * string = ("ntParen", "->[]")

# test_string ntParen "((())())";; Exception: PC.X_no_match. # test_string ntParen "((()))ABC";;

  • : string * string = ("ntParen", "->[ABC]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 151 / 177

slide-152
SLIDE 152

Parsing combinators (continued)

▶ By now, our toolset of parsing combinators consists of

▶ const ▶ caten ▶ disj ▶ pack ▶ delayed

▶ We can handle recursive grammars ▶ We can create ASTs ▶ In principle, we can implement parsers for any language

☞ We now wish to add additional PCs to simplify the task of

writing parsers

Mayer Goldberg \ Ben-Gurion University Compiler Construction 152 / 177

slide-153
SLIDE 153

New PC Operators (continued)

The Kleene Star

The Kleene-star is a meta-production-rule, or a rule-schema, or a “macro” over production-rules.

▶ For any NT P, P∗ stands for the

rule Pstar defjned as follows: Pstar → P Pstar | ε

▶ The point of the Kleene-star is

to recognize the catenation of zero or more expressions in P.

Stephen Cole Kleene

Mayer Goldberg \ Ben-Gurion University Compiler Construction 153 / 177

slide-154
SLIDE 154

New PC Operators (continued)

Here is our support for the Kleene-star: let rec star nt = fun s -> try let (e, s) = (nt s) in let (es, s) = (star nt s) in (e :: es, s) with X_no_match -> ([], s);; Notice how we match ε implicitly.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 154 / 177

slide-155
SLIDE 155

New PC Operators (continued)

The Kleene-plus

▶ For any NT P, P+ stands for the rule Pplus defjned as follows:

Pplus → P Pplus | P

▶ The point of the Kleene-plus is to recognize the catenation of

  • ne or more expressions in P.

▶ Kleene didn’t really invent the Kleene-plus

▶ Rather, Kleene-plus is a natural extension of Kleene-star Mayer Goldberg \ Ben-Gurion University Compiler Construction 155 / 177

slide-156
SLIDE 156

New PC Operators (continued)

Here is our support for the Kleene-plus: let plus nt = pack (caten nt (star nt)) (fun (e, es) -> (e :: es));; Notice how we defjne the Kleene-plus as the catenation of Kleene-star and the original NT.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 156 / 177

slide-157
SLIDE 157

New PC Operators (continued)

Let’s test star and plus: # let star_star = star (const (fun ch -> ch = '*'));; val star_star : char list -> char list * char list = <fun> # let star_plus = plus (const (fun ch -> ch = '*'));; val star_plus : char list -> char list * char list = <fun> # test_string star_star "****the end!";;

  • : char list * string =

(['*'; '*'; '*'; '*'], "->[the end!]") # test_string star_plus "****the end!";;

  • : char list * string =

(['*'; '*'; '*'; '*'], "->[the end!]") # test_string star_star "the end!";;

  • : char list * string = ([], "->[the end!]")

# test_string star_plus "the end!";; Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 157 / 177

slide-158
SLIDE 158

New PC Operators (continued)

Ocaml provides the polymorphic type α option = None | Some of α as a way of dealing with situations where a value may or may not exist. We’re going to use α option to implement maybe, which takes a parser r, and returns a parser r? that recognizes zero or one

  • ccurrences of whatever is recognized by r.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 158 / 177

slide-159
SLIDE 159

New PC Operators (continued)

let maybe nt = fun s -> try let (e, s) = (nt s) in (Some(e), s) with X_no_match -> (None, s);;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 159 / 177

slide-160
SLIDE 160

New PC Operators (continued)

Assume you have the parser nt_integer, that recognizes integers. Here is how we might use maybe: # test_string nt_integer "1234";;

  • : int * string = (1234, "->[]")

# test_string (maybe nt_integer) "1234";;

  • : int option * string = (Some 1234, "->[]")

# test_string (maybe nt_integer) "moshe";;

  • : int option * string = (None, "->[moshe]")

You would use pattern matching (via match) to handle both cases (None/Some)

Mayer Goldberg \ Ben-Gurion University Compiler Construction 160 / 177

slide-161
SLIDE 161

New PC Operators (continued)

We might want to attach an arbitrary predicate to serve as a guard for a parser, so that the parser succeeds only if the matched object satisfjes the guard. This is what the guard PC does: let guard nt pred = fun s -> let ((e, s) as result) = (nt s) in if (pred e) then result else raise X_no_match;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 161 / 177

slide-162
SLIDE 162

New PC Operators (continued)

Let’s use guard to identify only even numbers: # test_string (guard nt_integer (fun n -> n land 1 = 0)) "12345";; Exception: PC.X_no_match. # test_string (guard nt_integer (fun n -> n land 1 = 0)) "123456";;

  • : int * string = (123456, "->[]")

This exceeds the expressive power of CFGs!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 162 / 177

slide-163
SLIDE 163

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs 🗹 Learn of new PC operators

▶ Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 163 / 177

slide-164
SLIDE 164

Functional abstraction in PCs

We now wish to demonstrate some examples of using functional abstraction to write parsers in a general, consistent, and convenient way. Up to now we used to defjne single-character parsers using const: let nt_A = const (fun ch -> ch = 'A');; This is kind of clumsy. Let’s see how we can do this better!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 164 / 177

slide-165
SLIDE 165

Functional abstraction in PCs (continued)

let make_char equal ch1 = const (fun ch2 -> equal ch1 ch2);; let char = make_char (fun ch1 ch2 -> ch1 = ch2);; let char_ci = make_char (fun ch1 ch2 -> (Char.lowercase_ascii ch1) = (Char.lowercase_ascii ch2));;

The use of make_char allows us to defjne parser-generating functions for characters, in a case-sensitive or case-insensitive way.

☞ Warning: The version of ocaml installed in the labs uses

Char.lowercase, which is now deprecated. It’ll be upgraded [next year].

Mayer Goldberg \ Ben-Gurion University Compiler Construction 165 / 177

slide-166
SLIDE 166

Functional abstraction in PCs (continued)

# test_string (char 'a') "abc";;

  • : char * string = ('a', "->[bc]")

# test_string (char 'a') "ABC";; Exception: PC.X_no_match. # test_string (char_ci 'a') "abc";;

  • : char * string = ('a', "->[bc]")

# test_string (char_ci 'a') "ABC";;

  • : char * string = ('A', "->[BC]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 166 / 177

slide-167
SLIDE 167

Functional abstraction in PCs (continued)

If we wish to recognize entire words, this is still very cumbersome. We can put to a good use the algebra of catenation to do better: To identify a word, we —

▶ Take a string of chars, and convert it to a list ▶ Map over each character in the list, creating a parser that

recognizes that character

▶ Perofrm a right fold over that list using the caten operation

(with an approriate pack)

▶ The unit element is the unit element of catenation, namely

epsilon

By abstracing over char we can get both case-sensitive and case-insensitive variants!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 167 / 177

slide-168
SLIDE 168

Functional abstraction in PCs (continued)

Here is the code: let make_word char str = List.fold_right (fun nt1 nt2 -> pack (caten nt1 nt2) ^^I^^I^^I (fun (a, b) -> a :: b)) (List.map char (string_to_list str)) nt_epsilon;; let word = make_word char;; let word_ci = make_word char_ci;;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 168 / 177

slide-169
SLIDE 169

Functional abstraction in PCs (continued)

# test_string (word "moshe") "moshe is a nice guy!";;

  • : char list * string =

(['m'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]") # test_string (word_ci "moshe") "Moshe is a nice guy!";;

  • : char list * string =

(['M'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 169 / 177

slide-170
SLIDE 170

Functional abstraction in PCs (continued)

We might want to pick any single character in a string. Rather than specifying long disjunctions, we can use one_of to do this for us.

▶ Very similar to word:

▶ We use disj rather than caten ▶ The unit element for disj is nt_none

Such is the power of abstraction!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 170 / 177

slide-171
SLIDE 171

Functional abstraction in PCs (continued)

let make_one_of char str = List.fold_right disj (List.map char (string_to_list str)) nt_none;; let one_of = make_one_of char;; let one_of_ci = make_one_of char_ci;; As usual, we generate both the case-sensitive and case-insensitive versions!

Mayer Goldberg \ Ben-Gurion University Compiler Construction 171 / 177

slide-172
SLIDE 172

Functional abstraction in PCs (continued)

Let’s try out one_of: # test_string (one_of "abcdef") "moshe!";; Exception: PC.X_no_match. # test_string (one_of "abcdef") "be moshe!";;

  • : char * string = ('b', "->[e moshe!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 172 / 177

slide-173
SLIDE 173

Functional abstraction in PCs (continued)

When we wanted to recognize a range of characters, we, once again, used the const PC. We can do better using abstraction:

let make_range leq ch1 ch2 (s : char list) = const (fun ch -> (leq ch1 ch) && (leq ch ch2)) s ;; let range = make_range (fun ch1 ch2 -> ch1 <= ch2) ;; let range_ci = make_range (fun ch1 ch2 -> (Char.lowercase_ascii ch1) <= (Char.lowercase_ascii ch2));;

Mayer Goldberg \ Ben-Gurion University Compiler Construction 173 / 177

slide-174
SLIDE 174

Functional abstraction in PCs (continued)

And here is how we can test range: # test_string (star (range 'a' 'z')) "hello world!";;

  • : char list * string =

(['h'; 'e'; 'l'; 'l'; 'o'], "->[ world!]") # test_string (star (range 'a' 'z')) "HELLO WORLD!";;

  • : char list * string =

([], "->[HELLO WORLD!]") # test_string (star (range_ci 'a' 'z')) "Hello World!";;

  • : char list * string =

(['H'; 'e'; 'l'; 'l'; 'o'], "->[ World!]")

Mayer Goldberg \ Ben-Gurion University Compiler Construction 174 / 177

slide-175
SLIDE 175

Functional abstraction in PCs (continued)

How might you debug parsers written using PCs?

▶ The PC trace_pc is a wrapper (using the decorator pattern)

that can be used to trace any parser

▶ The trace_pc PC takes a documentation string and a parser,

and returns a tracing parser.

# test_string (trace_pc "The word \"hi\"" (word "hi ")) "high";; ;;; The word "hi" matched the head of "high", and the remaining string is "gh"

  • : char list * string = (['h'; 'i'], "->[gh]")

# test_string (trace_pc "The word \"hi\"" (word "hi ")) "bye";; ;;; The word "hi" failed on "bye" Exception: PC.X_no_match.

Mayer Goldberg \ Ben-Gurion University Compiler Construction 175 / 177

slide-176
SLIDE 176

Parsers Combinators (continued)

What next? 🗹 Some simple parsers 🗹 Learn about the algebra of PCs 🗹 Learn of new PC operators 🗹 Learn how to use abstraction to make our life simpler

Mayer Goldberg \ Ben-Gurion University Compiler Construction 176 / 177

slide-177
SLIDE 177

Further reading

🔘 Parsing Combinators

Mayer Goldberg \ Ben-Gurion University Compiler Construction 177 / 177