From parsing to interpretation Lets build a language Lots of - - PowerPoint PPT Presentation

from parsing to interpretation
SMART_READER_LITE
LIVE PREVIEW

From parsing to interpretation Lets build a language Lots of - - PowerPoint PPT Presentation

From parsing to interpretation Lets build a language Lots of code, if youd like to follow along: https://minond.xyz/pti-talk Who am I? My name is Marcos Minond, and Im a Software Engineer. My biggest area of interest in


slide-1
SLIDE 1

From parsing to interpretation

Let’s build a language

slide-2
SLIDE 2

Lot’s of code, if you’d like to follow along:

https://minond.xyz/pti-talk

slide-3
SLIDE 3

Who am I?

My name is Marcos Minond, and I’m a Software Engineer. My biggest area

  • f interest in CS is language design

and implementation.

slide-4
SLIDE 4

What are we talking about?

We’re going to be talking about programming languages.

slide-5
SLIDE 5

What are we talking about?

More specifically, we’re going to be talking about interpreters.

slide-6
SLIDE 6

What are we talking about?

And even more specifically than that, we’re going to talk about how one can take a sequence of characters that

  • nly a human could understand and make a computer

understand them.

slide-7
SLIDE 7

And why would we talk about that?

Well, we’re Software Engineers and as Software Engineers we write a lot of code.

slide-8
SLIDE 8

And why would we talk about that?

And how do we write that code? Well, with programming languages.

slide-9
SLIDE 9

And why would we talk about that?

Programming languages are tools. Can you think of a tool that you use more often? Most likely no.

slide-10
SLIDE 10

And why would we talk about that?

An understanding of programming languages and their implementation, even at a high level, will help you improve as a developer. Even if these skills are not used every day, the knowledge will stay with you and help you throughout your career.

slide-11
SLIDE 11

So what are we going to do about it?

slide-12
SLIDE 12

Let’s build an interpreter

slide-13
SLIDE 13

What’s that?

slide-14
SLIDE 14

A program that can analyze a program

slide-15
SLIDE 15

A program that can analyze a program

slide-16
SLIDE 16

Where do we start?

slide-17
SLIDE 17

How about with fancy buzzwords?

slide-18
SLIDE 18

Ohh, fancy.

Grammars BNF/EBNF Lexer Parsers Parser generators Recursive descent parsers Scope Evaluation

slide-19
SLIDE 19

Where do we really start?

1 - We parse 2 - And then we evaluate

slide-20
SLIDE 20

This is where we start

1 - Define what our language looks like. 2 - Tokenize the input into a stream of valid tokens. 3 - Take the stream of tokens and compose them into complete expressions. 4 - Evaluate the expressions.

slide-21
SLIDE 21

Let’s define a language

slide-22
SLIDE 22

First, what can our language do?

slide-23
SLIDE 23

It can understand numbers

7

slide-24
SLIDE 24

It can understand strings

"Hello, world."

slide-25
SLIDE 25

It can understand something is true

#t

slide-26
SLIDE 26

It can understand something is false

#f

slide-27
SLIDE 27

It can run code conditionally

(cond (condition1 expression1) (condition2 expression2) (condition3 expression3) (condition4 expression4) (else default-expression))

slide-28
SLIDE 28

It can express arithmetic operations

(* 21 2)

slide-29
SLIDE 29

It can define functions

(lambda (n) (* n 2))

slide-30
SLIDE 30

It can apply functions to parameters

(double 21)

slide-31
SLIDE 31

It can store all of those values

(define cool #t) (define age 99) (define name "Marcos") (define double (lambda (n) (* n 2)))

slide-32
SLIDE 32

Does it look familiar?

Yes, it looks like a Lisp. Notice all of those parenthesized lists? Those are s-expressions and we’ll be talking about them again soon.

slide-33
SLIDE 33

Let’s get a little more specific

slide-34
SLIDE 34

Let’s build a BNF grammar

slide-35
SLIDE 35

What’s BNF?

Think of BNF as a language for languages. It’s used in defining the structure in a computer language (not just programming languages)

slide-36
SLIDE 36

What’s BNF?

BNF is made up of rules and their expansions, such as:

<expr> ::= <digit> "+" <digit> where <expr> and <digit> are non-terminal symbols.

And terminal symbols: <digit> ::= "1" | "2" | "3"

slide-37
SLIDE 37

Let’s build an EBNF grammar

slide-38
SLIDE 38

What’s EBNF?

EBNF is a set of extensions and modifications placed on top of BNF. Differences include dropping of the angled brackets, ::= becomes =, and we add semicolons at the end of expressions. Other improvements include the ability to repeat expressions with {}, group expressions with (), add

  • ptional expressions with [], and explicit concatenation

with ,.

slide-39
SLIDE 39

Some examples?

slide-40
SLIDE 40

Numbers

number = [ "-" ] , ( digit , { digit } ) ; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

slide-41
SLIDE 41

Strings

string = ’"’ { chars } ’"’ ; chars = letter | not-quote ; letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" ;

slide-42
SLIDE 42

Booleans

boolean = "#t" | "#f" ;

slide-43
SLIDE 43

Identifiers

symbol = "<" | ">" | "*" | "+" | "-" | "=" | "_" | "/" | "%" | "?" ; identifier = ( letter | symbol ) , { letter | symbol | digit } ;

slide-44
SLIDE 44

S-expressions

sexpr = "(" { exprs } ")" ; exprs = [ "’" ] , ( atom | sexpr | exprs ) ; atom = identifier | number | boolean | string ;

slide-45
SLIDE 45

All together now. I present to you our Lisp.

main = { exprs } ; number = [ "-" ] , ( digit , { digit } ) ; digit = "0" | ... | "9" ; string = ’"’ { chars } ’"’ ; chars = letter | not-quote ; letter = "A" | ... | "z" ; boolean = "#t" | "#f" ; identifier = ( letter | symbol ) , { letter | symbol | digit } ; symbol = "<" | ">" | "*" | "+" | "-" | "=" | "_" | "/" | "%" | "?" ; atom = identifier | number | boolean | string ; exprs = [ "’" ] , ( atom | sexpr | exprs ) ; sexpr = "(" { exprs } ")" ;

slide-46
SLIDE 46

What does this give us?

A reference for our ourselves or for a tool. A parser generator (like Yacc, GNU bison, ANTLR, etc.) could take our EBNF grammar and generate all of the code we need in order to parse our language. But that’s not what we’re here for.

slide-47
SLIDE 47

Let’s build a parser

slide-48
SLIDE 48

But wait!

Actually, let’s take a step back. Characters are hard but what if we had ‘words’ instead? We need a lexer.

slide-49
SLIDE 49

What’s a lexer?

Lexers analyze a string, character by character, and turn it into a series of tokens that can be used in the later steps

  • f parsing.

(+ 21 43) OPAREN ID(+) NUM(21) NUM(43) CPAREN

slide-50
SLIDE 50

Token types

sealed trait Token case object SingleQuote extends Token case object OpenParen extends Token case object CloseParen extends Token case object True extends Token case object False extends Token case class Number(value: Double) extends Token case class Str(value: String) extends Token

slide-51
SLIDE 51

And even more tokens

case class InvalidToken(lexeme: String) extends Token case class Identifier(value: String) extends Token case class SExpr(values: List[Token]) extends Token

slide-52
SLIDE 52

Tokenizer function

def tokenize(str: String): Iterator[Token] = { val src = str.toList.toIterator.buffered for (c <- src if !c.isWhitespace) yield c match { // ... } }

slide-53
SLIDE 53

Tokenizer function

def tokenize(str: String): Iterator[Token] = { val src = str.toList.toIterator.buffered for (c <- src if !c.isWhitespace) yield c match { case ’(’ => OpenParen case ’)’ => CloseParen case ’\” => SingleQuote // ... } }

slide-54
SLIDE 54

Tokenizer function

def tokenize(str: String): Iterator[Token] = { val src = str.toList.toIterator.buffered for (c <- src if !c.isWhitespace) yield c match { case ’(’ => OpenParen case ’)’ => CloseParen case ’\” => SingleQuote case ’"’ => ??? case n if isDigit(n) => ??? case c if isIdentifier(c) => ??? case ’#’ => ??? case c => ??? } }

slide-55
SLIDE 55

Tokenizing strings

val src = str.toList.toIterator.buffered yield c match { case ’"’ => Str(src.takeWhile(c => c != ’"’) .mkString) }

slide-56
SLIDE 56

Tokenizing numbers

val src = str.toList.toIterator.buffered yield c match { case n if isDigit(n) || (n == ’-’ && src.hasNext && isDigit(src.head)) => val num = (n + consumeWhile(src, isDigit).mkString) Number(num.toDouble) }

slide-57
SLIDE 57

Helper definitions

def isDigit(c: Char): Boolean = c >= ’0’ && c <= ’9’ def consumeWhile[T]( src: BufferedIterator[T], predicate: T => Boolean ): Iterator[T] = { def aux(buff: List[T]): List[T] = if (src.hasNext && predicate(src.head)) { val curr = src.head src.next ; aux(buff :+ curr) } else buff aux(List.empty).toIterator }

slide-58
SLIDE 58

Tokenizing identifiers

val src = str.toList.toIterator.buffered yield c match { case c if isIdentifierStart(c) => val name = c + consumeWhile(src, isIdentifier) Identifier(name.mkString) }

slide-59
SLIDE 59

Helper definitions

def isIdentifierStart(c: Char): Boolean = isLetter(c) || isSymbol(c) def isIdentifier(c: Char): Boolean = isDigit(c) || isLetter(c) || isSymbol(c) def isLetter(c: Char): Boolean = c >= ’A’ && c <= ’z’ def isSymbol(c: Char): Boolean = Set( ’<’, ’>’, ’*’, ’+’, ’-’, ’=’, ’_’, ’/’, ’%’, ’?’ ).contains(c)

slide-60
SLIDE 60

Tokenizing booleans

val src = str.toList.toIterator.buffered yield c match { case ’#’ => src.headOption match { case None => InvalidToken("unexpected <eof>") case Some(’f’) => src.next; False case Some(’t’) => src.next; True case Some(c) => src.next; InvalidToken(s"#$c") } }

slide-61
SLIDE 61

Tokenizing everything else

val src = str.toList.toIterator.buffered yield c match { case c => val word = c + consumeWhile(src, isWord) InvalidToken(word.mkString) }

slide-62
SLIDE 62

Helper definitions

def isParen(c: Char): Boolean = c == ’(’ || c == ’)’ def isWord(c: Char): Boolean = !c.isWhitespace && !isParen(c)

slide-63
SLIDE 63

And now we have tokens

tokenize("(+ 21 43)").toList List( OpenParen, Identifier(+), Number(21.0), Number(43.0), CloseParen )

slide-64
SLIDE 64

Getting there

We nearly have a full representation of our grammar. So far we’ve covered cases the following cases: numbers, strings, booleans, and identifier. But we’re still missing the structured expressions: s-expressions.

slide-65
SLIDE 65

We need these

sexpr = "(" { exprs } ")" ; exprs = [ "’" ] , ( atom | sexpr | exprs ) ; atom = identifier | number | boolean | string ;

slide-66
SLIDE 66

We need this

(+ 21 43) OPAREN ID(+) NUM(21) NUM(43) CPAREN SEXPR( ID(+), NUM(21), NUM(43))

slide-67
SLIDE 67

ASTs

An abstract syntax tree is a tree representation of source code structure. ASTs represent some tokens explicitly, like numbers, booleans, etc. and other implicitly, like parentheses and semicolons.

slide-68
SLIDE 68

Let’s extend our data structures to match that

slide-69
SLIDE 69

Implicit data

sealed trait Token case object SingleQuote extends Token case object OpenParen extends Token case object CloseParen extends Token case class InvalidToken(lexeme: String) extends Token

slide-70
SLIDE 70

Explicit data

sealed trait Expr extends Token case object True extends Expr case object False extends Expr case class Number(value: Double) extends Expr case class Str(value: String) extends Expr case class Identifier(value: String) extends Expr case class SExpr(values: List[Expr]) extends Expr

slide-71
SLIDE 71

More expressions

case class Err(message: String) extends Expr case class Quote(value: Expr) extends Expr case class Lambda(args: List[Identifier], body: Expr) extends Expr case class Proc(f: (List[Expr], Env) => (Expr, Env)) extends Expr case class Builtin(f: (List[Expr], Env) => (Expr, Env)) extends Expr

slide-72
SLIDE 72

Parser function

def parse(ts: Iterator[Token]): Expr = { val tokens = ts.buffered tokens.next match { // ... } }

slide-73
SLIDE 73

Parser function

def parse(ts: Iterator[Token]): Expr = { val tokens = ts.buffered tokens.next match { case SingleQuote => ??? case OpenParen => ??? case CloseParen => ??? case InvalidToken(lexeme) => ??? case expr => expr } }

slide-74
SLIDE 74

Handling SingleQuote

tokens.next match { case SingleQuote => if (tokens.hasNext) Quote(parse(tokens)) else Err("unexpected <eof>") }

slide-75
SLIDE 75

Handling OpenParen

tokens.next match { case OpenParen => val values = parseExprs(tokens) if (tokens.hasNext) { tokens.next SExpr(values) } else Err("missing ’)’") }

slide-76
SLIDE 76

Helper definitions

def parseExprs( tokens: BufferedIterator[Token] ): List[Expr] = if (tokens.hasNext && tokens.head != CloseParen) parse(tokens) :: parseExprs(tokens) else List.empty

slide-77
SLIDE 77

Handling CloseParen, InvalidToken, and everything else

tokens.next match { case InvalidToken(lexeme) => Err(s"unexpected ’$lexeme’") case CloseParen => Err("unexpected ’)’") // True, False, Str, Number, // Identifier, SExpr, Quote, // Lambda, Builtin, Proc, Err case expr => expr }

slide-78
SLIDE 78

And now we have an AST

parse(tokenize("(((a)))")) List(OpenParen, OpenParen, OpenParen, Identifier(a), CloseParen, CloseParen, CloseParen) SExpr(List( SExpr(List( SExpr(List( Identifier(a)))))))

slide-79
SLIDE 79

Hey what about Lambda, Proc, and Builtin?

You may have noticed that our parser never returns Lambdas, Procs, or Builtins. There is a simple answer as to why Procs nor Builtins are returned, and that is because those are expression that are meant to only be created programmatically, and as such the parser doesn’t have to know how to parse them. That is not the case of Lambdas.

slide-80
SLIDE 80

This is what is happening right now

val code = "(lambda (x) (+ x x))" parse(tokenize(code)) SExpr(List( Identifier(lambda), SExpr(List(Identifier(x))), SExpr(List(Identifier(+), Identifier(x), Identifier(x)))))

slide-81
SLIDE 81

But this is what we need

val code = "(lambda (x) (+ x x))" parse(tokenize(code)) Lambda(List(Identifier(x)), SExpr(List(Identifier(+), Identifier(x), Identifier(x))))

slide-82
SLIDE 82

From this to that

SExpr(List( Identifier(lambda), SExpr(List(Identifier(x))), SExpr(List(Identifier(+), Identifier(x), Identifier(x))))) Lambda(List(Identifier(x)), SExpr(List(Identifier(+), Identifier(x), Identifier(x))))

slide-83
SLIDE 83

def passLambdas

def passLambdas(expr: Expr): Expr = expr match { // ... }

slide-84
SLIDE 84

def passLambdas

expr match { case SExpr(Identifier("lambda") :: SExpr(args) :: body :: Nil) => ??? case expr => expr }

slide-85
SLIDE 85

def passLambdas

val (params, errs) = ??? if (!errs.isEmpty) errs(0) else Lambda(params, body)

slide-86
SLIDE 86

def passLambdas

args.foldRight( List[Identifier](), List[Err]() ) { case (curr, (params, errs)) => curr match { case id @ Identifier(_) => (id :: params, errs) case x => ( params, Err("bad argument") :: errs ) } }

slide-87
SLIDE 87

calling passLambdas

def parse(ts: Iterator[Token]): Expr = { val tokens = ts.buffered passLambdas(tokens.next match { // ... }) }

slide-88
SLIDE 88

Lambdas!

val code = "(lambda (x) (+ x x))" parse(tokenize(code)) Lambda(List(Identifier(x)), SExpr(List(Identifier(+), Identifier(x), Identifier(x))))

slide-89
SLIDE 89

Multiple passes

We could employ this method of checking and manipulating an expression after it is parsed and before being executed to do many things. In our case we are adding a new feature, Lambda expressions, but one could also do optimizations, type checking, and other static analysis checks.

slide-90
SLIDE 90

So close

So far our interpreter can do a lot. I can parse numbers, booleans, strings, s-expression, and it even knows about lambdas! But still, it doesn’t run any code.

slide-91
SLIDE 91

Let’s build an evaluator

slide-92
SLIDE 92

Eval

In its simplest form, an evaluator is a function that takes an expression and returns another expression. The returned expression can be thought of as the simplified version of the original.

slide-93
SLIDE 93

Evaluate this!

324 324 #t #t "Hello, world." "Hello, world." (+ 21 43) 64 ((lambda (x) (add x 20)) 22) 42

slide-94
SLIDE 94

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { // ... }

slide-95
SLIDE 95

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case expr @ (True | False | _: Str | _: Number | _: Quote | _: Lambda | _: Builtin | _: Proc | _: Err ) => (expr, env) }

slide-96
SLIDE 96

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case id @ Identifier(name) => val err = Err( s"unbound variable: $name") (env.getOrElse(id, err), env) }

slide-97
SLIDE 97

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case SExpr(Nil) => (Err("empty expression"), env) }

slide-98
SLIDE 98

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case SExpr((id @ Identifier(_)) :: body) => val (head, _) = evaluate(id, env) evaluate( SExpr(head :: body), env) }

slide-99
SLIDE 99

def evaluate

case SExpr(Lambda(args, body) :: values) => val scope = args.zip(values) .foldLeft(env) { case (_env, (arg, value)) => _env ++ Map(arg -> evaluate(value, env)._1) } val (ret, _) = evaluate(body, scope) (ret, env)

slide-100
SLIDE 100

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case SExpr(Proc(fn) :: args) => val evaled = args.map { arg => evaluate(arg, env)._1 } fn(evaled) }

slide-101
SLIDE 101

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case SExpr(Builtin(fn) :: args) => fn(args, env) }

slide-102
SLIDE 102

def evaluate

def evaluate(expr: Expr, env: Env): (Expr, Env) = expr match { case SExpr(head :: _) => val err = Err( s"cannot call $head") (err, env) }

slide-103
SLIDE 103

That’s all for evaluate

You may have noticed our evaluate function was missing some functionality. What happened to conditionals? What about variable bindings?

slide-104
SLIDE 104

This is what Proc and Builtin are for

slide-105
SLIDE 105

Builtin: define

Builtin((args, env) => args match { case (id @ Identifier(_)) :: expr :: Nil => evaluate(expr, env)._1 match { case err: Err => (err, env) case value => val update = env ++ Map(id -> value) (value, update) } case _ => (Err("bad call to define"), env) })

slide-106
SLIDE 106

Builtin: cond

Builtin((args, env) => { def aux(conds: List[Expr]): Expr = // ... (aux(args), env) }),

slide-107
SLIDE 107

Builtin: cond

def aux(conds: List[Expr]): Expr = conds match { case SExpr(check :: body :: Nil) :: rest => ??? case Nil => SExpr(List.empty) case _ => Err("bad syntax: cond") }

slide-108
SLIDE 108

Builtin: cond

def aux(conds: List[Expr]): Expr = conds match { case SExpr(check :: body :: Nil) :: rest => evaluate(check, env)._1 match { case False => aux(rest) case _ => evaluate(body, env)._1 } case Nil => SExpr(List.empty) case _ => Err("bad syntax: cond") }

slide-109
SLIDE 109

Builtin: add

Proc((args, env) => (args match { case Number(a) :: Number(b) :: Nil => Number(a + b) case _ => Err("bad call to add") }, env))

slide-110
SLIDE 110

Let’s test it out

slide-111
SLIDE 111

val code = """ ((lambda (x) (add x 20)) 22) """ val env = Map( Identifier("add") -> builtinAdd ) evaluate(parse(tokenize(code)), env) Number(42.0)

slide-112
SLIDE 112

From parsing to interpretation

We’ve built a language