One-and-a-Half Simple Differential Programming Languages Gordon - - PowerPoint PPT Presentation

one and a half simple differential programming languages
SMART_READER_LITE
LIVE PREVIEW

One-and-a-Half Simple Differential Programming Languages Gordon - - PowerPoint PPT Presentation

One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martn Abadi ~ Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and


slide-1
SLIDE 1

One-and-a-Half Simple Differential Programming Languages

Gordon Plotkin Calgary, 2019

~ Joint work at Google with Martín Abadi ~

slide-2
SLIDE 2

Talk Synopsis

Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and loops Language semantics: operational and denotational Beyond powers of R Conclusion and future work

slide-3
SLIDE 3

Neural networks: a very brief introduction

Neural networks

Deep learning is based on neural networks:

  • loosely inspired by the brain;
  • built from simple, trainable functions.

input

  • utput
slide-4
SLIDE 4

Primitives: the programmer’s neuron

Primitives: the neuron y = F(w1x1 +...+ wnxn+b)

X1 Xn ...

inputs

  • w1...wn are weights,
  • b is a bias,
  • weights and biases are

parameters,

  • F is a “differentiable”

non-linear function, e.g., the “ReLU” F(x) = max(0, x)

slide-5
SLIDE 5

Two activation functions: ReLU and Swish

ReLU max(x, 0)

16/12/2017 Desmos | Graphing Calculator https://www.desmos.com/calculator 1/2

Swish x · σ(βx), where σ(z) = (1 + e−z)−1

16/12/2017 Desmos | Graphing Calculator https://www.desmos.com/calculator 1/2

(Ramachandran, Zoph, and Le, 2017)

slide-6
SLIDE 6

2d convolutional network

13/12/2017 Conv Nets: A Modular Perspective - colah's blog

Conv Nets: A Modular Perspective

Posted on July 8, 2014

neural networks (../../posts/tags/neural_networks.html), deep learning (../../posts/tags/deep_learning.html), convolutional neural networks (../../posts/tags/convolutional_neural_networks.html), modular neural networks (../../posts/tags/modular_neural_networks.html)

Introduction

In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition. One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network. At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters – the values describing how neurons behave – that need to be learned fairly small. A 2D Convolutional Neural Network This trick of having multiple copies of the same neuron is roughly analogous to the abstraction of functions in mathematics and computer science. When programming, we write a function once and use it in many places – not writing the same code a hundred times in different places makes it faster to program, and results in fewer bugs. Similarly, a convolutional neural network can learn a neuron once and use it in many places, making it easier to learn the model and reducing error.

Structure of Convolutional Neural Networks

Suppose you want a neural network to look at audio samples and predict whether a human is speaking or not. Maybe you want to do more analysis if someone is speaking. You get audio samples at different points in time. The samples are evenly spaced.

1 1

With thanks to C. Olah

slide-7
SLIDE 7

Some tensors

c Knoldus Inc

slide-8
SLIDE 8

Convolutional image classification

Inception-Resnet-v1 achitecture

  • 4282

Schema Stem

Szegedy et al, arxiv.org/abs/1602.07261

slide-9
SLIDE 9

Recurrent neural networks (RNNs)

Recurrent Neural Networks (RNNs)

Cell Cell Cell Cell

≅ with shared parameters There are many variants, e.g., LSTMs.

slide-10
SLIDE 10

Mixture of experts

A model MoE architecture with a conditional and a loop: With thanks to Yu et al.

slide-11
SLIDE 11

Neural Turing Machines

Neural Turing Machines combine a RNN with an external memory bank:

26/12/2017 Attention and Augmented Recurrent Neural Networks https://distill.pub/2016/augmented-rnns/ 2/16

Neural Turing Machines

Neural Turing Machines combine a RNN with an external memory bank. Since vectors are the natural language of neural networks, the memory is an array of vectors: But how does reading and writing work? The challenge is that we want to make them differentiable. In particular, we want to make them differentiable with respect to the location we read from or write to, so that we can learn where to read and write. This is tricky because memory addresses seem to be fundamentally discrete. NTMs take a very clever solution to this: every step, they read and write everywhere, just to different extents. As an example, let’s focus on reading. Instead of specifying a single location, the RNN outputs an “attention distribution” that describes how we spread out the amount we care about different memory positions. As such, the result of the read

  • peration is a weighted sum.

Similarly, we write everywhere at once to different extents. Again, an attention distribution describes how much we write at every location. We do this by having the new value of a position in memory be a convex combination of the old memory content and the write value, with the position between the two decided by the attention weight. Memory is an array of vectors. Network A writes and reads from this memory each step. x0 y0 x1 y1 x2 y2 x3 y3 attention memory The RNN gives an attention distribution which describe how we spread out the amount we care about different memory positions. The read result is a weighted sum.

With thanks to C. Olah

slide-12
SLIDE 12

Supervised learning

Given a training dataset of (input, output) pairs, e.g., a set of images with labels: While not done: Pick a pair (x, y) Run the neural network on x to get Net(x, b, . . .) Compare this to y to calculate the loss (= error = cost) Loss(b, . . .) = |y − Net(x, b, . . .)| Adjust parameters b, . . . to reduce the loss More generally, pick a “mini-batch" (x1, y1), . . . , (xn, yn) and minimise the loss Loss(b, . . .) =

  • n
  • i=1

(yi − Net(xi, b, . . .))2

slide-13
SLIDE 13

Slope of a line

slope = change in y change in x = ∆y ∆x So ∆y = slope × ∆x So x′ = x − r slope = ⇒ y′ = y − r slope2

slide-14
SLIDE 14

Gradient descent

Follow the gradient of the loss function

Gradient descent

Compute partial derivatives along paths in the neural network. Follow the gradient of the loss with respect to the parameters.

Thus: x′ := x − r(slope of Loss at x) = r dLoss(x) dx

slide-15
SLIDE 15

Multi-dimensional gradient descent

x′ := x − r ∂L(x,y) ∂x and y′ := y − r ∂L(x,y) ∂y (x′, y′) := (x, y) − r(∂L(x,y) ∂x , ∂L(x,y) ∂y ) v′ := v − r∇L

slide-16
SLIDE 16

Looking at differentiation

Expressions with several variables:

∂e[x,y] ∂x

  • x,y=a,b

Gradient of functions f : R2 → R of two arguments: ∇(f ) : R2 → R2 ∇(f )(u, v) =

∂f (u,v)

∂u

, ∂f (u,v)

∂v

  • Chain rule

∂f (g(x,y,z),h(x,y,z)) ∂x

=

∂f (u,v) ∂u

· ∂g(x,y,z)

∂x

+ ∂f (u,v)

∂v

· ∂h(x,y,z)

∂x

where u, v = g(x, y, z), h(x, y, z).

slide-17
SLIDE 17

A matrix view of the multiargument chain rule.

We have:

∂f (g(x,y,z),h(x,y,z)) ∂x

=

∂f (u,v) ∂u

· ∂g(x,y,z)

∂x

+ ∂f (u,v)

∂v

· ∂h(x,y,z)

∂x ∂f (g(x,y,z),h(x,y,z)) ∂y

=

∂f (u,v) ∂u

· ∂g(x,y,z)

∂y

+ ∂f (u,v)

∂v

· ∂h(x,y,z)

∂y ∂f (g(x,y,z),h(x,y,z)) ∂z

=

∂f (u,v) ∂u

· ∂g(x,y,z)

∂z

+ ∂f (u,v)

∂v

· ∂h(x,y,z)

∂z

Set k = g, h : R3 → R2 and define its Jacobian to be the 2 × 3 matrix: Jk =

 

∂g ∂x ∂g ∂y ∂g ∂z ∂h ∂x ∂h ∂y ∂h ∂z

 

Then the gradient of the composition f ◦ k is given by the vector-Jacobian product: ∇f (g(x, y, z), h(x, y, z)) = ∇f (u, v) · Jk(x, y, z)

slide-18
SLIDE 18

Differentials: A functional view of differentiation

Jacobians For f : Rm → Rn we have: Jf : Rm → Matn,m Chain rule for Jacobians For Rl

f

− → Rm g − → Rn we have: Jx(g ◦ f ) = Jf (x)(g) · Jx(f ) Differentials aka (forward) derivatives For f : Rm → Rn we define: df : Rm × Rm → Rn by: (dxf )y = (Jxf ) · y Chain rule for differentials For Rl

f

− → Rm g − → Rn we have: dx(g ◦ f ) = df (x)(g) ◦ dx(f )

slide-19
SLIDE 19

Reverse derivatives

Reverse derivatives For f : Rm → Rn we have: dRf : Rm × Rn → Rm where: (dR

x f )y

= y · (Jxf ) (= (dxf )†y) Chain rule For Rl

f

− → Rm g − → Rn we have: dR

x (g ◦ f )

= dR

x (f ) ◦ dR f (x)(g)

as:

dR

x (g ◦ f )(z)

= z · Jx(g ◦ f ) = z · (Jf (x)(g) · Jx(f )) = (z · Jf (x)(g)) · Jx(f ) = dR

x (f )(dR f (x)(g)(z))

Gradients For the case n = 1 where f : Rm → R, we have: dR

x f : Rm × R → Rm

and then: ∇xf = (dR

x f )1

slide-20
SLIDE 20

Takeaway on differentiation

For f : Rm → Rn we have

  • df : Rm × Rm → Rn
  • dRf : Rm × Rn → Rm

For f : Rm → R we have: ∇xf = (dR

x f )1

slide-21
SLIDE 21

ONNX: Open Neural Network Exchange

ONNX is an open exchange format to represent deep learning

  • models. Here is some of an ONXX graph:

c ONNX

slide-22
SLIDE 22

Deep Learning: Differentiable Programming Languages

Deep Learning Graphical Frameworks

  • Caffe,CNTK, MXNet, Theano, TensorFlow, . . .

TF graphs can have conditionals, iterations, and function calls. Automatic Differentiation (Dates back to 1965!)

  • Autograd which works as a Python package, adding a

first-class gradient operation.

  • Similarly: Python/TF Eager mode, Gluon, PyTorch. Also

F#/Diffsharp.

  • VLAD, a functional language with first-class forward and

reverse differentiation (Pearlmutter, Siskind).

Foundational studies

  • Differential Lambda Calculi (Ehrhard, Regnier,../ Manzyuk);
  • Language for Diff. Functions (Edalat, Gianantonio).
  • Differential/Tangent Categories (Blute, Cockett,...).

Functional Programming

  • Efficient Differentiable Programming in a Functional

Array-Processing Language (Shaikhha, Fitzgibbon et al)

  • Demystifying Differentiable Programming: Shift/Reset the

Penultimate Backpropagator (Wang et al)

slide-23
SLIDE 23

Core differentiable programming language desiderata

As many programmable functions f : T → U differentiable as possible, for as many types T, U as possible. A gradient operation, more generally, a reverse derivative one; even higher-order (= iterated) derivatives. Tensors (aka multidimensional arrays). These have ranks k and shapes d0, . . . , dk−1. The set of such real tensors is: R[d0]×...×[dk−1] Execution:

  • Learning: optimising neural net parameters against data.
  • Inference: using optimised neural nets.
slide-24
SLIDE 24

How are we going to do prog language theory?

Study a small functional programming language with relevant features:

  • Products of reals as datatypes, but:
  • No tensor datatypes (∃ APL + 21 other array languages;

functional programming: Steuwer et al; Gibbons; Haskell).

  • Reverse differentiation as a language primitive.
  • Control structures: conditionals/loops/recursion.
  • More, later.

Give it a semantics. Use the semantics to justify an operational semantics including the differentiation constructs. We also have source code transformations eliminating all differentiation constructs, not given here, but summarised.

slide-25
SLIDE 25

Previous foundational work

Erhard and Regnier’s differential lambda calculus. I originally thought this was the way to go. It is a typed lambda calculus with product and function types and (forward) differentiation as a primitive. It is based on a general notion of a differential category (which has linear features - tensors). Example: convenient vector spaces of Frölicher (other examples exist too). Main issue: does not support partial functions There is, however, a non-higher order notion of a differential restriction category (Cockett et al) which has smooth partial functions over powers of the reals as a model.

slide-26
SLIDE 26

Previous work

Automatic (aka algorithmic) differentiation Given a program, produce a program that calculates its derivative. Originally for scientific computing, not machine learning. Huge literature + large community: www.autodiff.org Very concerned with efficiency. As far as I could find out, largely not focused on semantics and its associated language theory — the focus of this talk — though there is functional programming work (VLAD). References

A simple automatic derivative evaluation program, Wengert, 1964. Compiling fast partial derivatives of functions given by Algorithms, Speelpenning, 1980. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation Griewank & Walther, Automatic Differentiation in Machine Learning, Baydin et al, 2017.

slide-27
SLIDE 27

A minilanguage: syntax

Types T ::= real | unit | T × U Terms M ::= x | r (r ∈ R) | M + N | op(M) | M.rdL(x : T. N) | let x : T = M in N | ∗ | M, N | fst(M) | snd(M) | if B then M else N | letrec f (x : T) : U = M in N | f (M) Boolean terms B ::= true | false | P(M)

slide-28
SLIDE 28

Minilanguage: Typing

(Ordinary) Environments Γ = x0 : T0, . . . , xn−1 : Tn−1 Function Environments Φ = f0 : T0 → U0, . . . , fn−1 : Tn−1 → Un−1 Judgements Φ | Γ ⊢ M : T Φ | Γ ⊢ B

slide-29
SLIDE 29

Typing (cntnd)

Operations Φ | Γ ⊢ M : T Φ | Γ ⊢ op(M) : U (op : T → U) Reverse derivatives Φ | Γ ⊢ L : T Φ | Γ[x : T] ⊢ N : U Φ | Γ ⊢ M : U Φ | Γ ⊢ M.rdL(x : T. N) : T

slide-30
SLIDE 30

Differentiating sequences of operations

Consider differentiating k(x) = h(g(f (x))) at x = a. Trace (or tape) method

1

Compute the trace, the list [a, b, c] = [a, f (a), g(f (a))]

2

Using the trace, play the tape h(g(f (x)) by applying the chain rule: k′(c) = h′(c) · g′(b) · f ′(a)

Source code transformation (SCT)

1

Using the chain rule, transform the code to M = let y = f (x) in let z = g(y) in h′(z) · g′(y) · f ′(x)

2

Evaluate the transformed code with x = a. Much of the automatic differentiation literature considers how to do reverse-mode differentiation efficiently, eg first translating to A-normal form, produces PL versions of the backprop algorithm (see: Griewank, Who Invented the Reverse Mode of Differentiation?, 2012)

slide-31
SLIDE 31

Differentiating conditionals

Consider: h(x) = if b(x) then f (x) else g(x) The rule people use: dh dx = if b(x) then df dx else dg dx However consider: h(x) = if x = 0 then − x else x Have h(x) = x, so dh dx = 1 But rule gives dh dx = if x = 0 then − 1 else 1 Another example: ReLU(x) = if x ≤ 0 then 0 else x

slide-32
SLIDE 32

A way around the difficulty

Note b : R → T, Switch to continuous partial b : R ⇀ T, meaning that b−1(tt) and b−1(ff ) are open (eg (−∞, 0) and (0, ∞)). Write f : R ⇀ R to mean that f is partial, with open domain

  • f definition.

Proposition For continuous b : R ⇀ T and differentiable f , g : R ⇀ R the conditional h(x) ≃ if b(x) then f (x) else g(x) is differentiable and, for all x ∈ R we have:

dh dx

≃ if b(x) then df

dx else dg dx

Reference Thomas Beck, Herbert Fischer, The if-problem in automatic differentiation, 1994.

slide-33
SLIDE 33

Proof of proposition

Proposition For continuous b : R ⇀ T and differentiable f , g : R ⇀ R the conditional h(x) ≃ if b(x) then f (x) else g(x) is differentiable and, for all x ∈ R we have:

dh dx

≃ if b(x) then df

dx else dg dx

Proof. Suppose b(x) = tt.

  • Then there is an open interval (a, b) containing x such that

b(x′) = tt for all x′ in (a, b).

  • So h(x′) ≃ f (x′) for all x′ ∈ (a, b) (not just x!)
  • So h and f have the same derivative at x, if any.
slide-34
SLIDE 34

Another example: swapping

Consider swap(x, y) = if x > y then (x, y) else (y, x) When is ∂swap ∂x = if x > y then (1, 0) else (0, 1) OK? By which I mean: at what points is > continuous? Equivalently, what is the maximum continuous restriction of >?

slide-35
SLIDE 35

How While loops work

We wish to compute the derivative at x of h(x) ≃ while b(x) do f (x) Suppose h(x) ↓, and the computation goes round the loop n

  • times. Then

h(x) = f n(x) and the rule for this x is: dh dx = df n dx Potential proof assuming b continuous, and f differentiable, even have: h(x′) = f n(x′) for all x′ in an open interval containing x.

slide-36
SLIDE 36

Computing reverse derivatives of while loops

Trace method

Run the loop (interpreter or compiler) till it terminates, producing a trace, being a sequence of intermediate values. Evaluate the reverse derivative along the tape, here the corresponding iterated loop body, using the chain rule.

Source code transformation Translate the code to code which consists of two while loops in sequence:

The first is the original while loop, but it also keeps copies of “checkpoint" intermediate values, and maintains a loop counter. The second counts down from the final value of the loop counter, computing individual reverse derivatives on the way using the relevant intermediate values.

slide-37
SLIDE 37

Vector-valued differentiable functions

A function f : Rm ⇀ R is continuously differentiable if its gradient ∇xf exists and is continuous at every x ∈ Dom(f ). A function f : Rm ⇀ Rn is continuously differentiable iff each component Rm ⇀ R is. Equivalently: A function f : Rm ⇀ Rn is continuously differentiable if its Jacobian J : Rm ⇀ Matm,n exists and is continuous at every point in Dom(f ). Equivalently: A function f : Rm ⇀ Rn is continuously differentiable if its differential d : Rm × Rm ⇀ Rn exists and is continuous at every point in its domain, which is Dom(f ) × R. Need continuity to make chain rule work

slide-38
SLIDE 38

Ordering partial functions

Partial functions Rm ⇀ Rn with open domain are partially

  • rdered by their graphs:

f ≤ g ⇐ ⇒ f ⊆ g equivalently: f ≤ g ⇐ ⇒ ∀x ∈ Rm. f (x) g(x) This makes Rm ⇀ Rn a cppo with ⊥= ∅ and union of graphs as sup:

  • fn =
  • n

fn This makes the conditional construction: if −then−else−: (Rm ⇀T)×(Rm ⇀Rn)×(Rm ⇀Rn) → (Rm ⇀Rn) continuous.

slide-39
SLIDE 39

Differentiable functions and ordering and conditionals

Monotonicity Suppose f , g : Rm ⇀ Rn are continuously

  • differentiable. Then:

f ≤ g = ⇒ dRf ≤ dRg Continuity Suppose fn is an increasing sequence of continuously differentiable functions. Then so is its sup, and we have: dR fn

  • =
  • dR(fn)

so: dR

x

  • fn
  • (y) = x′ ⇐

⇒ ∃n. dR

x (fn) (y) = x′

Conditionals Suppose b : Rm ⇀ T is continuous, and f , g : Rm ⇀ Rn are continuously differentiable. Then so is their conditional and we have: dR(if b then f else g) = if b then dR(f ) else dR(g)

slide-40
SLIDE 40

While loops

Iterates whilen+1 b do f = ⊥ whilen+1 b do f = if b then (whilen b do f ) ◦ f else id Loops while b do f =

  • n

whilen b do f Theorem (whilen b do f )x ↓ = ⇒ dR

x (while b do f ) = dR x (f n(x))

slide-41
SLIDE 41

Loop source code transformation

A while loop w = while b do f : Rm ⇀ Rm has a recursive definition: w = if b then w ◦ f else id equivalently: w(x) ≃ if b(x) then w(f (x)) else x Reverse differentiating we get: dR

x w(y)

≃ if b(x) then dR

x f (dR f (x)w(y)) else y

which suggests making the recursive definition of a function g : Rm × Rm ⇀ Rm by: g(x, y) ≃ if b(x) then (dRf )(x, g(f (x), y)) else y

slide-42
SLIDE 42

Loop source code transformation (cntnd)

Theorem Suppose w = while b do f . Then dRw is the least function g : Rm × Rm ⇀ Rm st.: g(x, y) ≃ if b(x) then (dRf )(x, g(f (x), y)) else y Proof. By induction we have: dR(f (n)) = g(n) So: dR(f ) = dR(

  • f (n)) =
  • dR(f (n)) =
  • g(n) = g
slide-43
SLIDE 43

Minilanguage reminder

Types T ::= real | unit | T × U Terms M ::= x | r (r ∈ R) | M + N | op(M) | M.rdL(x : T. N) | let x : T = M in N | ∗ | M, N | fst(M) | snd(M) | if B then M else N | letrec f (x : T) : U = M in N | f (M) Boolean terms B ::= true | false | P(M)

slide-44
SLIDE 44

Minilanguage semantics: types

[ [real] ] = R [ [unit] ] = ✶ [ [T × U] ] = [ [T] ] × [ [U] ]

slide-45
SLIDE 45

Flattening types and functions

Flattening Types ϕ : [ [T] ] ∼ = R|T| where: |real| = 1 |unit| = |T × U| = |T| + |U| Flattening Functions [ [T] ] f

✲ [

[U] ] R|T| ϕ−1

ϕ(f )

✲ R|U|

ϕ

slide-46
SLIDE 46

Smooth functions

A function f : Rm ⇀ R is smooth (or of class C∞) if all its m partial derivatives ∂f

∂xi : Rm ⇀ R (i = 1, m) are defined on its

domain and they too are all smooth. (It is of class C0 if it is continuous, and of class Ck+1 if all its partial derivatives are defined on its domain and are of class Ck.) (Equivalently) A function f : Rm ⇀ Rn is smooth if, for all y ∈ Rm, df (−, y) exists and is smooth. A function f : [ [T] ] ⇀ [ [U] ] is smooth if ϕ(f ) is smooth. We write S[[ [T] ], [ [U] ]] for the collection of all such functions. It forms a subcppo of the continuous such functions.

slide-47
SLIDE 47

Semantics of the language

Operations [ [op] ] ∈ S[[ [T] ], [ [U] ]] (op : S → T) Environments [ [x0 : T0, . . . , xn−1 : Tn−1] ] = [ [T0] ] × . . . [ [Tn−1] ] Function environments [ [f0 :T0 →U0, . . . , fn−1 :Tn−1 →Un−1] ] = S[[ [T0] ], [ [U0] ]]×. . .×S[[ [Tn−1] ], [ [Un−1] ]] Terms Φ | Γ ⊢ M : T [ [M] ] : [ [Φ] ] − → S[[ [Γ] ], [ [T] ]] Φ | Γ ⊢ B [ [B] ] : [ [Φ] ] − → C[[ [Γ] ], T]

slide-48
SLIDE 48

Example denotational semantics

Operations [ [op(M)] ](ϕ, γ) ≃ [ [op] ]([ [M] ](ϕ, γ)) Reverse derivatives [ [M.rdL(x : T. N)] ](ϕ, γ) ≃ dR

[ [L] ](ϕ,γ)(λa : [

[T] ]. [ [N] ](ϕ, γ[a/x]))([ [M] ](ϕ, γ)) where for any differentiable f : [ [T] ] ⇀ [ [U] ] we set: dR(f ) = ϕ−1

T×U,T(dR(ϕT,U(f )))

slide-49
SLIDE 49

Example denotational semantics

Operations [ [op(M)] ](ϕ, γ) ≃ [ [op] ]([ [M] ](ϕ, γ)) Reverse derivatives [ [M.rdL(x : T. N)] ] ≃ dR

[ [L] ](λa : [

[T] ]. [ [N] ][a/x])[ [M] ]

slide-50
SLIDE 50

Operational semantics: basics

Value Environments Any finite function γ :Variables →fin ClosedValues Values These are terms V , W , . . .: V ::= x | r (r ∈ R) | ∗ | V , W Boolean values These are terms Vbool: Vbool ::= true | false Function Environments Any finite function ϕ:FunctionVariables →fin Closures Closures These are structures cloρ,ϕ(f (x : T) : U. M) where:

(1) ρ is a value environment with FV(M)\x ⊆ Dom(ρ) (2) ϕ is a function environment with FFV(M)\f ⊆ Dom(ϕ)

slide-51
SLIDE 51

Evaluation relations

(Ordinary) Evaluation Relation These relations have the form ϕ | ρ ⊢ M ⇒ V ϕ | ρ ⊢ B ⇒ Vbool with V closed. Symbolic Evaluation Relation These relations have the form ϕ | ρ ⊢ M C Tape terms These are terms C, D, . . . with no control

  • constructs. More specifically, they contain no: function

variables; conditionals; function definitions; or function applications: C ::= x | r (r ∈ R) | C + D | op(C) | let x : T = C in D | ∗ | C, D

slide-52
SLIDE 52

Example ordinary evaluation rules

Operations ϕ | ρ ⊢ M ⇒ V ϕ | ρ ⊢ op(M) ⇒ W (ev(op, V ) ≃ W ) Local Definitions ϕ | ρ ⊢ M ⇒ V ϕ | ρ[V /x] ⊢ N ⇒ W ϕ | ρ ⊢ let x : T = M in N ⇒ W Conditionals ϕ | ρ ⊢ B ⇒ true ϕ | ρ ⊢ M ⇒ W ϕ | ρ ⊢ if B then M else N ⇒ W Reverse Derivatives ϕ | ρ ⊢ M.rdL(x : T. N) C ϕ | ρ ⊢ C ⇒ V ϕ | ρ ⊢ M.rdL(x : T. N) ⇒ V

slide-53
SLIDE 53

Symbolic evaluation rules

Variables ϕ | ρ ⊢ x x Operations ϕ | ρ ⊢ M C ϕ | ρ ⊢ op(M) op(C) Local Definitions ϕ | ρ ⊢ M C ϕ | ρ ⊢ C ⇒ V ϕ | ρ[V /x] ⊢ N D ϕ | ρ ⊢ let x : T = M in N let x : T = C in D Conditionals ϕ | ρ ⊢ B ⇒ true ϕ | ρ ⊢ M C ϕ | ρ ⊢ if B then M else N C

slide-54
SLIDE 54

Symbolic evaluation rules (cntnd)

Function Definition ϕ[cloρ,ϕ(f (x : T) : U. M)/f ] | ρ ⊢ N C ϕ | ρ ⊢ letrec f (x : T) : U = M in N C Function Application ϕ | ρ ⊢ M C ϕ | ρ ⊢ C ⇒ V ϕ′[ϕ(f )/f ] | ρ′[V /x] ⊢ N D ϕ | ρ ⊢ f (M) let x : T = C in Dρ′ (ϕ(f ) = cloρ′,ϕ′(f (x : T) : U. N)) Reverse Derivatives ϕ | ρ ⊢ L C ϕ | ρ ⊢ M D ϕ | ρ ⊢ C ⇒ V ϕ | ρ[V /x] ⊢ N E ϕ | ρ ⊢ M.rdL(x : T. N) let x :T, y :U = C, D in y.Rx(x :T. E) (x, y / ∈ Dom(ρ), Γρ ⊢ E :U)

slide-55
SLIDE 55

Symbolic differentiation: W .RV(x :T. C)

W .RV (x :T. y) =

  • W

(y = x) 0T (y = x) W .RV (x :T. D + E) = W .RV (x :T. D) + W .RV (x :T. E) W .RV (x :T. op(D[x])) = W .opr(D[V ]).RV (x :T. D) W .RV (x :T. let y :U = C[x] in D[x, y]) = let y :U = C[V ] in let z :T × U = W .RV ,y(z :T × U. D[fst(z), snd(z)]) in fst(z) + snd(z).RV (x :X. C[x]) W .RV (x :T. D, E) = fst(W ).RV (x :T. D) + snd(W ).RV (x :T. E) W .RV (x :T. fst(D[x])) = let x :T = D[V ] in W , 0 .RV (x :T. D) (x / ∈ FV(D))

slide-56
SLIDE 56

Typing environments

We give rules for judgments ρ:Γ Cl:T → U ϕ:Φ as follows: Vi :Ti (i = 0, n − 1) {x0 → V0, . . . , xn−1 → Vn−1} : x0 : T0, . . . , xn−1 : Tn−1 Cli :Ti → Ui (i = 0, n − 1) {f0 → Cl0, . . . , fn−1 → Cln−1} : f0 : T0 →U0, . . . , fn−1 : Tn−1 →Un−1 ϕ′ : Φ ρ′ :Γ Φ, Γ[T/x] ⊢ M :U cloρ,ϕ(f (x : T) : U. M) : T → U (ϕ′ = ϕ ↾ FFV(M)\f , ρ′ = ρ ↾ FV(M)\x)

slide-57
SLIDE 57

Correctness theorems

Theorem (Formal reverse-mode differentiation correctness) Suppose Γ[x :T] ⊢ E :U, Γ ⊢ C :T, and Γ ⊢ D :U (and so Γ ⊢ D.rdC(x :T. E):T). Then, for any γ ∈ [ [Γ] ], we have: [ [D.rdC(x :T. E)] ](γ) ≃ [ [D.RC(x :T. E)] ](γ)

slide-58
SLIDE 58

Correctness theorems (cntnd)

Two conditions: NGV No recursive function definitions in M have global free variables. NGFD No recursive function definitions in M contain the function variable within a derivative expression occurring within the function body. Theorem (Operational correctness)

1 Operational semantics. Suppose Φ | Γ ⊢ M :T, ϕ:Φ, and

ρ:Γ. Then: ϕ | ρ ⊢ M ⇒ V = ⇒ [ [M] ][ [ϕ] ][ [ρ] ] = [ [V ] ]

2 Symbolic operational semantics. Suppose Φ | Γ ⊢ M :T,

Φ | Γ ⊢ C :T, ϕ:Φ, and ρ:Γ. Then: ϕ | ρ ⊢ M C = ⇒ ∃ O ⊆open [ [Γ] ]. [ [ρ] ] ∈ O ∧ ∀γ ∈ O. [ [M] ][ [ϕ] ]γ ≃ [ [C] ]γ

slide-59
SLIDE 59

Operational completeness

Theorem (Operational completeness) The following hold:

1 Operational semantics. Suppose Φ | Γ ⊢ M : T, ϕ:Φ, and

ρ:Γ. Then: [ [M] ][ [ϕ] ][ [ρ] ] = [ [V ] ] = ⇒ ϕ | ρ ⊢ M ⇒ V

2 Symbolic operational semantics. Suppose Φ | Γ ⊢ M : T,

ϕ:Φ, and ρ:Γ. Then: ϕ | ρ ⊢ M ⇒ V = ⇒ ∃C. ϕ | ρ ⊢ M C

slide-60
SLIDE 60

Derivative Elimination Theorem

Theorem Let M be a closed well-typed NGV and NGFD term over a given alphabet of function variables. Then there is a unique derivative-free term D, possibly containing additional primed function variables, such that: M ⊲ D Further: [ [M] ] = [ [D] ]

slide-61
SLIDE 61

Decomposing a partial function into its components

A B

C D

A + B

C + D

f fAC fBD

fAD fBC

slide-62
SLIDE 62

The reverse derivative of a decomposed function

We want: dRf : (A + B) × (C + D) ⇀ A + B Identifying f with its composition with the distributive expansion of its domain, look for dRf : (A × C) + (A × D) + (B × C) + (B × D) ⇀ A + B Given by: (dRf )(A×C)A = dRfAC (dRf )(A×D)A = dRfAD (dRf )(B×C)B = dRfBC (dRf )(B×D)B = dRfBD and taking the other components such as (dRf )(A×C)B to be undefined.

slide-63
SLIDE 63

Sums: injections

For inl : A ⇀ A + B Wish: dR(inl) : A × (A + B) ⇀ A Define dR

x (inl)(z) ≃

  • y

(z = inl(y)) ↓ (otherwise) Differentiation equivalence Q.rdP(x :T. inl(M)) = let x :T, y :U + V = P, Q in cases y of inl(u : U) ⇒ u.rdx(x :T. M) | inr(v : V ) ⇒ UNDEF

slide-64
SLIDE 64

Sums: cotupling

For f : A ⇀ C g : B ⇀ C [f , g] : A + B ⇀ C Wish dR([f , g]) : (A + B) × C ⇀ A + B Define dR

z ([f , g])(u) =

  • inl((dR

x f )u)

(z = inl(x)) inr((dR

y f )u)

(z = inr(y))

slide-65
SLIDE 65

Sums: cotupling (cntnd)

Differentiation equivalence Q.rdP(x :T. cases L[x] of inl(u : U) ⇒ M[x, u] | inr(v : V ) ⇒ N[x, v]) = let x :T, z :W = P, Q in cases L[x] of inl(u : U) ⇒ let x′ :T, u′ :U = z.rdx,u(x :T, u : U. M[x, u]) in x′ + inl(u′).rdx(x :T. L[x]) | inr(v : V ) ⇒ . . .

slide-66
SLIDE 66

Symbolic operational semantics for sums

An abbreviation castlT,U(M) ≡ cases M of x : T ⇒ x | y : U ⇒ UNDEF Redex ϕ | ρ ⊢ V ⇒ inlT,U(W ) ϕ | ρ[W /x] ⊢ M C ϕ | ρ ⊢ cases V of x : T ⇒ M | y : U ⇒ N let x : T = castlT,U(V ) in C

slide-67
SLIDE 67

Differentiating functions on lists of reals

Is reverse : List(R) → List(R) differentiable? It can be considered as a collection of functions reversen,n : List(R, n) → List(R, n) As List(R, n) = Rn we say it is differentiable everywhere as each of its components reversen,n is. Example At which lists is sort : List(R) → List(R) differentiable?

slide-68
SLIDE 68

Differentiating functions between lists, in general

Any function: f : List(R) → List(R) decomposes into a collection of components fnm : List(R, n) ⇀ List(R, m) where fnm(l) ≃

  • f (l)

(f (l) has length m) ↓ (otherwise) (l ∈ List(R, n)) We say f is differentiable at l if fnm is at l (where n = |l|, and m = |f (l)|). We say f is differentiable with open domain if, and only if, each of its components fnm is.

slide-69
SLIDE 69

Reverse derivatives of functions on lists

As fnm : List(R, n) ⇀ List(R, m) have dRfn,m : List(R, n) × List(R, m) ⇀ List(R, n) So might expect a dependent type dRf :

  • l∈List(R,n)

List(R, |f (l)|) ⇀ List(R, n) but we instead use a simple type dRf : List(R) × List(R) ⇀ List(R)

slide-70
SLIDE 70

Differentiable shapely datatypes

Given a container, viz: A set S of shapes. For each shape s ∈ S, a finite set Ps of places. Shapely differentiable datatypes have the form DS,P =

  • s∈S

RPs = {s, x | s ∈ S, x:Ps → R}

slide-71
SLIDE 71

Examples of differentiable shapely datatypes

Sets X ∼ =

  • s ∈ X

R∅ Finite products of R: Rn ∼ =

  • s ∈ {∗}

R[n] Lists of reals List(R) ∼ =

  • n ∈ N

R[n] Tensors of rank k ≥ 0 of reals Tensork(R) ∼ =

  • d0,...,dk−1 ∈ Nk

>0

R[d0]×...×[dk−1] Binary trees of reals BinaryTrees(R) ∼ =

  • s ∈ BinaryTrees

RBranches(s)

slide-72
SLIDE 72

Shapely differentiable datatypes are manifolds

A differentiable manifold of varying dimension is: A Hausdorff topological space X, plus an atlas on X, ie a collection of open subsets Ui covering X and each with a specified homeomorphism Ui

ϕi

− → Rn (a coordinate chart) to an open subset of some Rn, subject to some axioms. For shapely differentiable datatypes we have the charts: Us = {s} × RPs − − − → R|Ps| (s ∈ S) This connects shapely differentiable datatypes with standard notions of differentiable functions. Manifolds figure commonly in learning theory. Pymanopt, Townsend et al, 2016 Should (a suitable version of) manifolds be datatypes of differentiable programming languages? Pearlmutter, Automatic Differentiation: History and Headroom, NIPS Autodiff Workshop, 2016.

slide-73
SLIDE 73

Future work

More language features in either external or internal mode, according to whether they cannot or can be differentiated. Examples:

Higher-order functions. External: Autograd; Internal cf. convenient vector spaces. Effects: exceptions, global state, I/O. All available with shapely differential datatypes.

  • Probability. Current work restricted to a graphical model with

a mixture of discrete distributions and those with a density.

Connect traditional semantic frameworks with differentiation:

Domain theory: could do streams and higher-order functions Metric spaces: could relate ideal computation with reals with approximate computation; differentiation of iteration scheme computations.

slide-74
SLIDE 74

Future work (cntnd)

Make less sweeping, more realistic, assumptions about smoothness

Work with functions (and hence programs) in smoothness classes C k. Allow weaker forms of differentiability, eg some form of Clarke generalised derivative, as in work of Edalat and Gianantonio (who use domain theory).

Look at theory of work in automatic differentiation, to establish correctness of their techniques for efficiency. Investigate use of dependent types to track shape analysis of tensor computations. Consider how to program with manifolds as types.