One-and-a-Half Simple Differential Programming Languages
Gordon Plotkin Calgary, 2019
~ Joint work at Google with Martín Abadi ~
One-and-a-Half Simple Differential Programming Languages Gordon - - PowerPoint PPT Presentation
One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martn Abadi ~ Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and
Gordon Plotkin Calgary, 2019
~ Joint work at Google with Martín Abadi ~
Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and loops Language semantics: operational and denotational Beyond powers of R Conclusion and future work
Deep learning is based on neural networks:
input
X1 Xn ...
inputs
parameters,
non-linear function, e.g., the “ReLU” F(x) = max(0, x)
ReLU max(x, 0)
16/12/2017 Desmos | Graphing Calculator https://www.desmos.com/calculator 1/2
Swish x · σ(βx), where σ(z) = (1 + e−z)−1
16/12/2017 Desmos | Graphing Calculator https://www.desmos.com/calculator 1/2
(Ramachandran, Zoph, and Le, 2017)
13/12/2017 Conv Nets: A Modular Perspective - colah's blog
Posted on July 8, 2014
neural networks (../../posts/tags/neural_networks.html), deep learning (../../posts/tags/deep_learning.html), convolutional neural networks (../../posts/tags/convolutional_neural_networks.html), modular neural networks (../../posts/tags/modular_neural_networks.html)
Introduction
In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition. One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network. At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters – the values describing how neurons behave – that need to be learned fairly small. A 2D Convolutional Neural Network This trick of having multiple copies of the same neuron is roughly analogous to the abstraction of functions in mathematics and computer science. When programming, we write a function once and use it in many places – not writing the same code a hundred times in different places makes it faster to program, and results in fewer bugs. Similarly, a convolutional neural network can learn a neuron once and use it in many places, making it easier to learn the model and reducing error.
Structure of Convolutional Neural Networks
Suppose you want a neural network to look at audio samples and predict whether a human is speaking or not. Maybe you want to do more analysis if someone is speaking. You get audio samples at different points in time. The samples are evenly spaced.
1 1
With thanks to C. Olah
c Knoldus Inc
Inception-Resnet-v1 achitecture
Schema Stem
Szegedy et al, arxiv.org/abs/1602.07261
Cell Cell Cell Cell
≅ with shared parameters There are many variants, e.g., LSTMs.
A model MoE architecture with a conditional and a loop: With thanks to Yu et al.
Neural Turing Machines combine a RNN with an external memory bank:
26/12/2017 Attention and Augmented Recurrent Neural Networks https://distill.pub/2016/augmented-rnns/ 2/16
Neural Turing Machines
Neural Turing Machines combine a RNN with an external memory bank. Since vectors are the natural language of neural networks, the memory is an array of vectors: But how does reading and writing work? The challenge is that we want to make them differentiable. In particular, we want to make them differentiable with respect to the location we read from or write to, so that we can learn where to read and write. This is tricky because memory addresses seem to be fundamentally discrete. NTMs take a very clever solution to this: every step, they read and write everywhere, just to different extents. As an example, let’s focus on reading. Instead of specifying a single location, the RNN outputs an “attention distribution” that describes how we spread out the amount we care about different memory positions. As such, the result of the read
Similarly, we write everywhere at once to different extents. Again, an attention distribution describes how much we write at every location. We do this by having the new value of a position in memory be a convex combination of the old memory content and the write value, with the position between the two decided by the attention weight. Memory is an array of vectors. Network A writes and reads from this memory each step. x0 y0 x1 y1 x2 y2 x3 y3 attention memory The RNN gives an attention distribution which describe how we spread out the amount we care about different memory positions. The read result is a weighted sum.
With thanks to C. Olah
Given a training dataset of (input, output) pairs, e.g., a set of images with labels: While not done: Pick a pair (x, y) Run the neural network on x to get Net(x, b, . . .) Compare this to y to calculate the loss (= error = cost) Loss(b, . . .) = |y − Net(x, b, . . .)| Adjust parameters b, . . . to reduce the loss More generally, pick a “mini-batch" (x1, y1), . . . , (xn, yn) and minimise the loss Loss(b, . . .) =
(yi − Net(xi, b, . . .))2
slope = change in y change in x = ∆y ∆x So ∆y = slope × ∆x So x′ = x − r slope = ⇒ y′ = y − r slope2
Follow the gradient of the loss function
Compute partial derivatives along paths in the neural network. Follow the gradient of the loss with respect to the parameters.
Thus: x′ := x − r(slope of Loss at x) = r dLoss(x) dx
x′ := x − r ∂L(x,y) ∂x and y′ := y − r ∂L(x,y) ∂y (x′, y′) := (x, y) − r(∂L(x,y) ∂x , ∂L(x,y) ∂y ) v′ := v − r∇L
Expressions with several variables:
∂e[x,y] ∂x
Gradient of functions f : R2 → R of two arguments: ∇(f ) : R2 → R2 ∇(f )(u, v) =
∂f (u,v)
∂u
, ∂f (u,v)
∂v
∂f (g(x,y,z),h(x,y,z)) ∂x
=
∂f (u,v) ∂u
· ∂g(x,y,z)
∂x
+ ∂f (u,v)
∂v
· ∂h(x,y,z)
∂x
where u, v = g(x, y, z), h(x, y, z).
We have:
∂f (g(x,y,z),h(x,y,z)) ∂x
=
∂f (u,v) ∂u
· ∂g(x,y,z)
∂x
+ ∂f (u,v)
∂v
· ∂h(x,y,z)
∂x ∂f (g(x,y,z),h(x,y,z)) ∂y
=
∂f (u,v) ∂u
· ∂g(x,y,z)
∂y
+ ∂f (u,v)
∂v
· ∂h(x,y,z)
∂y ∂f (g(x,y,z),h(x,y,z)) ∂z
=
∂f (u,v) ∂u
· ∂g(x,y,z)
∂z
+ ∂f (u,v)
∂v
· ∂h(x,y,z)
∂z
Set k = g, h : R3 → R2 and define its Jacobian to be the 2 × 3 matrix: Jk =
∂g ∂x ∂g ∂y ∂g ∂z ∂h ∂x ∂h ∂y ∂h ∂z
Then the gradient of the composition f ◦ k is given by the vector-Jacobian product: ∇f (g(x, y, z), h(x, y, z)) = ∇f (u, v) · Jk(x, y, z)
Jacobians For f : Rm → Rn we have: Jf : Rm → Matn,m Chain rule for Jacobians For Rl
f
− → Rm g − → Rn we have: Jx(g ◦ f ) = Jf (x)(g) · Jx(f ) Differentials aka (forward) derivatives For f : Rm → Rn we define: df : Rm × Rm → Rn by: (dxf )y = (Jxf ) · y Chain rule for differentials For Rl
f
− → Rm g − → Rn we have: dx(g ◦ f ) = df (x)(g) ◦ dx(f )
Reverse derivatives For f : Rm → Rn we have: dRf : Rm × Rn → Rm where: (dR
x f )y
= y · (Jxf ) (= (dxf )†y) Chain rule For Rl
f
− → Rm g − → Rn we have: dR
x (g ◦ f )
= dR
x (f ) ◦ dR f (x)(g)
as:
dR
x (g ◦ f )(z)
= z · Jx(g ◦ f ) = z · (Jf (x)(g) · Jx(f )) = (z · Jf (x)(g)) · Jx(f ) = dR
x (f )(dR f (x)(g)(z))
Gradients For the case n = 1 where f : Rm → R, we have: dR
x f : Rm × R → Rm
and then: ∇xf = (dR
x f )1
For f : Rm → Rn we have
For f : Rm → R we have: ∇xf = (dR
x f )1
ONNX is an open exchange format to represent deep learning
c ONNX
Deep Learning Graphical Frameworks
TF graphs can have conditionals, iterations, and function calls. Automatic Differentiation (Dates back to 1965!)
first-class gradient operation.
F#/Diffsharp.
reverse differentiation (Pearlmutter, Siskind).
Foundational studies
Functional Programming
Array-Processing Language (Shaikhha, Fitzgibbon et al)
Penultimate Backpropagator (Wang et al)
As many programmable functions f : T → U differentiable as possible, for as many types T, U as possible. A gradient operation, more generally, a reverse derivative one; even higher-order (= iterated) derivatives. Tensors (aka multidimensional arrays). These have ranks k and shapes d0, . . . , dk−1. The set of such real tensors is: R[d0]×...×[dk−1] Execution:
Study a small functional programming language with relevant features:
functional programming: Steuwer et al; Gibbons; Haskell).
Give it a semantics. Use the semantics to justify an operational semantics including the differentiation constructs. We also have source code transformations eliminating all differentiation constructs, not given here, but summarised.
Erhard and Regnier’s differential lambda calculus. I originally thought this was the way to go. It is a typed lambda calculus with product and function types and (forward) differentiation as a primitive. It is based on a general notion of a differential category (which has linear features - tensors). Example: convenient vector spaces of Frölicher (other examples exist too). Main issue: does not support partial functions There is, however, a non-higher order notion of a differential restriction category (Cockett et al) which has smooth partial functions over powers of the reals as a model.
Automatic (aka algorithmic) differentiation Given a program, produce a program that calculates its derivative. Originally for scientific computing, not machine learning. Huge literature + large community: www.autodiff.org Very concerned with efficiency. As far as I could find out, largely not focused on semantics and its associated language theory — the focus of this talk — though there is functional programming work (VLAD). References
A simple automatic derivative evaluation program, Wengert, 1964. Compiling fast partial derivatives of functions given by Algorithms, Speelpenning, 1980. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation Griewank & Walther, Automatic Differentiation in Machine Learning, Baydin et al, 2017.
Types T ::= real | unit | T × U Terms M ::= x | r (r ∈ R) | M + N | op(M) | M.rdL(x : T. N) | let x : T = M in N | ∗ | M, N | fst(M) | snd(M) | if B then M else N | letrec f (x : T) : U = M in N | f (M) Boolean terms B ::= true | false | P(M)
(Ordinary) Environments Γ = x0 : T0, . . . , xn−1 : Tn−1 Function Environments Φ = f0 : T0 → U0, . . . , fn−1 : Tn−1 → Un−1 Judgements Φ | Γ ⊢ M : T Φ | Γ ⊢ B
Operations Φ | Γ ⊢ M : T Φ | Γ ⊢ op(M) : U (op : T → U) Reverse derivatives Φ | Γ ⊢ L : T Φ | Γ[x : T] ⊢ N : U Φ | Γ ⊢ M : U Φ | Γ ⊢ M.rdL(x : T. N) : T
Consider differentiating k(x) = h(g(f (x))) at x = a. Trace (or tape) method
1
Compute the trace, the list [a, b, c] = [a, f (a), g(f (a))]
2
Using the trace, play the tape h(g(f (x)) by applying the chain rule: k′(c) = h′(c) · g′(b) · f ′(a)
Source code transformation (SCT)
1
Using the chain rule, transform the code to M = let y = f (x) in let z = g(y) in h′(z) · g′(y) · f ′(x)
2
Evaluate the transformed code with x = a. Much of the automatic differentiation literature considers how to do reverse-mode differentiation efficiently, eg first translating to A-normal form, produces PL versions of the backprop algorithm (see: Griewank, Who Invented the Reverse Mode of Differentiation?, 2012)
Consider: h(x) = if b(x) then f (x) else g(x) The rule people use: dh dx = if b(x) then df dx else dg dx However consider: h(x) = if x = 0 then − x else x Have h(x) = x, so dh dx = 1 But rule gives dh dx = if x = 0 then − 1 else 1 Another example: ReLU(x) = if x ≤ 0 then 0 else x
Note b : R → T, Switch to continuous partial b : R ⇀ T, meaning that b−1(tt) and b−1(ff ) are open (eg (−∞, 0) and (0, ∞)). Write f : R ⇀ R to mean that f is partial, with open domain
Proposition For continuous b : R ⇀ T and differentiable f , g : R ⇀ R the conditional h(x) ≃ if b(x) then f (x) else g(x) is differentiable and, for all x ∈ R we have:
dh dx
≃ if b(x) then df
dx else dg dx
Reference Thomas Beck, Herbert Fischer, The if-problem in automatic differentiation, 1994.
Proposition For continuous b : R ⇀ T and differentiable f , g : R ⇀ R the conditional h(x) ≃ if b(x) then f (x) else g(x) is differentiable and, for all x ∈ R we have:
dh dx
≃ if b(x) then df
dx else dg dx
Proof. Suppose b(x) = tt.
b(x′) = tt for all x′ in (a, b).
Consider swap(x, y) = if x > y then (x, y) else (y, x) When is ∂swap ∂x = if x > y then (1, 0) else (0, 1) OK? By which I mean: at what points is > continuous? Equivalently, what is the maximum continuous restriction of >?
We wish to compute the derivative at x of h(x) ≃ while b(x) do f (x) Suppose h(x) ↓, and the computation goes round the loop n
h(x) = f n(x) and the rule for this x is: dh dx = df n dx Potential proof assuming b continuous, and f differentiable, even have: h(x′) = f n(x′) for all x′ in an open interval containing x.
Trace method
Run the loop (interpreter or compiler) till it terminates, producing a trace, being a sequence of intermediate values. Evaluate the reverse derivative along the tape, here the corresponding iterated loop body, using the chain rule.
Source code transformation Translate the code to code which consists of two while loops in sequence:
The first is the original while loop, but it also keeps copies of “checkpoint" intermediate values, and maintains a loop counter. The second counts down from the final value of the loop counter, computing individual reverse derivatives on the way using the relevant intermediate values.
A function f : Rm ⇀ R is continuously differentiable if its gradient ∇xf exists and is continuous at every x ∈ Dom(f ). A function f : Rm ⇀ Rn is continuously differentiable iff each component Rm ⇀ R is. Equivalently: A function f : Rm ⇀ Rn is continuously differentiable if its Jacobian J : Rm ⇀ Matm,n exists and is continuous at every point in Dom(f ). Equivalently: A function f : Rm ⇀ Rn is continuously differentiable if its differential d : Rm × Rm ⇀ Rn exists and is continuous at every point in its domain, which is Dom(f ) × R. Need continuity to make chain rule work
Partial functions Rm ⇀ Rn with open domain are partially
f ≤ g ⇐ ⇒ f ⊆ g equivalently: f ≤ g ⇐ ⇒ ∀x ∈ Rm. f (x) g(x) This makes Rm ⇀ Rn a cppo with ⊥= ∅ and union of graphs as sup:
fn This makes the conditional construction: if −then−else−: (Rm ⇀T)×(Rm ⇀Rn)×(Rm ⇀Rn) → (Rm ⇀Rn) continuous.
Monotonicity Suppose f , g : Rm ⇀ Rn are continuously
f ≤ g = ⇒ dRf ≤ dRg Continuity Suppose fn is an increasing sequence of continuously differentiable functions. Then so is its sup, and we have: dR fn
so: dR
x
⇒ ∃n. dR
x (fn) (y) = x′
Conditionals Suppose b : Rm ⇀ T is continuous, and f , g : Rm ⇀ Rn are continuously differentiable. Then so is their conditional and we have: dR(if b then f else g) = if b then dR(f ) else dR(g)
Iterates whilen+1 b do f = ⊥ whilen+1 b do f = if b then (whilen b do f ) ◦ f else id Loops while b do f =
whilen b do f Theorem (whilen b do f )x ↓ = ⇒ dR
x (while b do f ) = dR x (f n(x))
A while loop w = while b do f : Rm ⇀ Rm has a recursive definition: w = if b then w ◦ f else id equivalently: w(x) ≃ if b(x) then w(f (x)) else x Reverse differentiating we get: dR
x w(y)
≃ if b(x) then dR
x f (dR f (x)w(y)) else y
which suggests making the recursive definition of a function g : Rm × Rm ⇀ Rm by: g(x, y) ≃ if b(x) then (dRf )(x, g(f (x), y)) else y
Theorem Suppose w = while b do f . Then dRw is the least function g : Rm × Rm ⇀ Rm st.: g(x, y) ≃ if b(x) then (dRf )(x, g(f (x), y)) else y Proof. By induction we have: dR(f (n)) = g(n) So: dR(f ) = dR(
Types T ::= real | unit | T × U Terms M ::= x | r (r ∈ R) | M + N | op(M) | M.rdL(x : T. N) | let x : T = M in N | ∗ | M, N | fst(M) | snd(M) | if B then M else N | letrec f (x : T) : U = M in N | f (M) Boolean terms B ::= true | false | P(M)
[ [real] ] = R [ [unit] ] = ✶ [ [T × U] ] = [ [T] ] × [ [U] ]
Flattening Types ϕ : [ [T] ] ∼ = R|T| where: |real| = 1 |unit| = |T × U| = |T| + |U| Flattening Functions [ [T] ] f
✲ [
[U] ] R|T| ϕ−1
✻
ϕ(f )
✲ R|U|
ϕ
❄
A function f : Rm ⇀ R is smooth (or of class C∞) if all its m partial derivatives ∂f
∂xi : Rm ⇀ R (i = 1, m) are defined on its
domain and they too are all smooth. (It is of class C0 if it is continuous, and of class Ck+1 if all its partial derivatives are defined on its domain and are of class Ck.) (Equivalently) A function f : Rm ⇀ Rn is smooth if, for all y ∈ Rm, df (−, y) exists and is smooth. A function f : [ [T] ] ⇀ [ [U] ] is smooth if ϕ(f ) is smooth. We write S[[ [T] ], [ [U] ]] for the collection of all such functions. It forms a subcppo of the continuous such functions.
Operations [ [op] ] ∈ S[[ [T] ], [ [U] ]] (op : S → T) Environments [ [x0 : T0, . . . , xn−1 : Tn−1] ] = [ [T0] ] × . . . [ [Tn−1] ] Function environments [ [f0 :T0 →U0, . . . , fn−1 :Tn−1 →Un−1] ] = S[[ [T0] ], [ [U0] ]]×. . .×S[[ [Tn−1] ], [ [Un−1] ]] Terms Φ | Γ ⊢ M : T [ [M] ] : [ [Φ] ] − → S[[ [Γ] ], [ [T] ]] Φ | Γ ⊢ B [ [B] ] : [ [Φ] ] − → C[[ [Γ] ], T]
Operations [ [op(M)] ](ϕ, γ) ≃ [ [op] ]([ [M] ](ϕ, γ)) Reverse derivatives [ [M.rdL(x : T. N)] ](ϕ, γ) ≃ dR
[ [L] ](ϕ,γ)(λa : [
[T] ]. [ [N] ](ϕ, γ[a/x]))([ [M] ](ϕ, γ)) where for any differentiable f : [ [T] ] ⇀ [ [U] ] we set: dR(f ) = ϕ−1
T×U,T(dR(ϕT,U(f )))
Operations [ [op(M)] ](ϕ, γ) ≃ [ [op] ]([ [M] ](ϕ, γ)) Reverse derivatives [ [M.rdL(x : T. N)] ] ≃ dR
[ [L] ](λa : [
[T] ]. [ [N] ][a/x])[ [M] ]
Value Environments Any finite function γ :Variables →fin ClosedValues Values These are terms V , W , . . .: V ::= x | r (r ∈ R) | ∗ | V , W Boolean values These are terms Vbool: Vbool ::= true | false Function Environments Any finite function ϕ:FunctionVariables →fin Closures Closures These are structures cloρ,ϕ(f (x : T) : U. M) where:
(1) ρ is a value environment with FV(M)\x ⊆ Dom(ρ) (2) ϕ is a function environment with FFV(M)\f ⊆ Dom(ϕ)
(Ordinary) Evaluation Relation These relations have the form ϕ | ρ ⊢ M ⇒ V ϕ | ρ ⊢ B ⇒ Vbool with V closed. Symbolic Evaluation Relation These relations have the form ϕ | ρ ⊢ M C Tape terms These are terms C, D, . . . with no control
variables; conditionals; function definitions; or function applications: C ::= x | r (r ∈ R) | C + D | op(C) | let x : T = C in D | ∗ | C, D
Operations ϕ | ρ ⊢ M ⇒ V ϕ | ρ ⊢ op(M) ⇒ W (ev(op, V ) ≃ W ) Local Definitions ϕ | ρ ⊢ M ⇒ V ϕ | ρ[V /x] ⊢ N ⇒ W ϕ | ρ ⊢ let x : T = M in N ⇒ W Conditionals ϕ | ρ ⊢ B ⇒ true ϕ | ρ ⊢ M ⇒ W ϕ | ρ ⊢ if B then M else N ⇒ W Reverse Derivatives ϕ | ρ ⊢ M.rdL(x : T. N) C ϕ | ρ ⊢ C ⇒ V ϕ | ρ ⊢ M.rdL(x : T. N) ⇒ V
Variables ϕ | ρ ⊢ x x Operations ϕ | ρ ⊢ M C ϕ | ρ ⊢ op(M) op(C) Local Definitions ϕ | ρ ⊢ M C ϕ | ρ ⊢ C ⇒ V ϕ | ρ[V /x] ⊢ N D ϕ | ρ ⊢ let x : T = M in N let x : T = C in D Conditionals ϕ | ρ ⊢ B ⇒ true ϕ | ρ ⊢ M C ϕ | ρ ⊢ if B then M else N C
Function Definition ϕ[cloρ,ϕ(f (x : T) : U. M)/f ] | ρ ⊢ N C ϕ | ρ ⊢ letrec f (x : T) : U = M in N C Function Application ϕ | ρ ⊢ M C ϕ | ρ ⊢ C ⇒ V ϕ′[ϕ(f )/f ] | ρ′[V /x] ⊢ N D ϕ | ρ ⊢ f (M) let x : T = C in Dρ′ (ϕ(f ) = cloρ′,ϕ′(f (x : T) : U. N)) Reverse Derivatives ϕ | ρ ⊢ L C ϕ | ρ ⊢ M D ϕ | ρ ⊢ C ⇒ V ϕ | ρ[V /x] ⊢ N E ϕ | ρ ⊢ M.rdL(x : T. N) let x :T, y :U = C, D in y.Rx(x :T. E) (x, y / ∈ Dom(ρ), Γρ ⊢ E :U)
W .RV (x :T. y) =
(y = x) 0T (y = x) W .RV (x :T. D + E) = W .RV (x :T. D) + W .RV (x :T. E) W .RV (x :T. op(D[x])) = W .opr(D[V ]).RV (x :T. D) W .RV (x :T. let y :U = C[x] in D[x, y]) = let y :U = C[V ] in let z :T × U = W .RV ,y(z :T × U. D[fst(z), snd(z)]) in fst(z) + snd(z).RV (x :X. C[x]) W .RV (x :T. D, E) = fst(W ).RV (x :T. D) + snd(W ).RV (x :T. E) W .RV (x :T. fst(D[x])) = let x :T = D[V ] in W , 0 .RV (x :T. D) (x / ∈ FV(D))
We give rules for judgments ρ:Γ Cl:T → U ϕ:Φ as follows: Vi :Ti (i = 0, n − 1) {x0 → V0, . . . , xn−1 → Vn−1} : x0 : T0, . . . , xn−1 : Tn−1 Cli :Ti → Ui (i = 0, n − 1) {f0 → Cl0, . . . , fn−1 → Cln−1} : f0 : T0 →U0, . . . , fn−1 : Tn−1 →Un−1 ϕ′ : Φ ρ′ :Γ Φ, Γ[T/x] ⊢ M :U cloρ,ϕ(f (x : T) : U. M) : T → U (ϕ′ = ϕ ↾ FFV(M)\f , ρ′ = ρ ↾ FV(M)\x)
Theorem (Formal reverse-mode differentiation correctness) Suppose Γ[x :T] ⊢ E :U, Γ ⊢ C :T, and Γ ⊢ D :U (and so Γ ⊢ D.rdC(x :T. E):T). Then, for any γ ∈ [ [Γ] ], we have: [ [D.rdC(x :T. E)] ](γ) ≃ [ [D.RC(x :T. E)] ](γ)
Two conditions: NGV No recursive function definitions in M have global free variables. NGFD No recursive function definitions in M contain the function variable within a derivative expression occurring within the function body. Theorem (Operational correctness)
1 Operational semantics. Suppose Φ | Γ ⊢ M :T, ϕ:Φ, and
ρ:Γ. Then: ϕ | ρ ⊢ M ⇒ V = ⇒ [ [M] ][ [ϕ] ][ [ρ] ] = [ [V ] ]
2 Symbolic operational semantics. Suppose Φ | Γ ⊢ M :T,
Φ | Γ ⊢ C :T, ϕ:Φ, and ρ:Γ. Then: ϕ | ρ ⊢ M C = ⇒ ∃ O ⊆open [ [Γ] ]. [ [ρ] ] ∈ O ∧ ∀γ ∈ O. [ [M] ][ [ϕ] ]γ ≃ [ [C] ]γ
Theorem (Operational completeness) The following hold:
1 Operational semantics. Suppose Φ | Γ ⊢ M : T, ϕ:Φ, and
ρ:Γ. Then: [ [M] ][ [ϕ] ][ [ρ] ] = [ [V ] ] = ⇒ ϕ | ρ ⊢ M ⇒ V
2 Symbolic operational semantics. Suppose Φ | Γ ⊢ M : T,
ϕ:Φ, and ρ:Γ. Then: ϕ | ρ ⊢ M ⇒ V = ⇒ ∃C. ϕ | ρ ⊢ M C
Theorem Let M be a closed well-typed NGV and NGFD term over a given alphabet of function variables. Then there is a unique derivative-free term D, possibly containing additional primed function variables, such that: M ⊲ D Further: [ [M] ] = [ [D] ]
We want: dRf : (A + B) × (C + D) ⇀ A + B Identifying f with its composition with the distributive expansion of its domain, look for dRf : (A × C) + (A × D) + (B × C) + (B × D) ⇀ A + B Given by: (dRf )(A×C)A = dRfAC (dRf )(A×D)A = dRfAD (dRf )(B×C)B = dRfBC (dRf )(B×D)B = dRfBD and taking the other components such as (dRf )(A×C)B to be undefined.
For inl : A ⇀ A + B Wish: dR(inl) : A × (A + B) ⇀ A Define dR
x (inl)(z) ≃
(z = inl(y)) ↓ (otherwise) Differentiation equivalence Q.rdP(x :T. inl(M)) = let x :T, y :U + V = P, Q in cases y of inl(u : U) ⇒ u.rdx(x :T. M) | inr(v : V ) ⇒ UNDEF
For f : A ⇀ C g : B ⇀ C [f , g] : A + B ⇀ C Wish dR([f , g]) : (A + B) × C ⇀ A + B Define dR
z ([f , g])(u) =
x f )u)
(z = inl(x)) inr((dR
y f )u)
(z = inr(y))
Differentiation equivalence Q.rdP(x :T. cases L[x] of inl(u : U) ⇒ M[x, u] | inr(v : V ) ⇒ N[x, v]) = let x :T, z :W = P, Q in cases L[x] of inl(u : U) ⇒ let x′ :T, u′ :U = z.rdx,u(x :T, u : U. M[x, u]) in x′ + inl(u′).rdx(x :T. L[x]) | inr(v : V ) ⇒ . . .
An abbreviation castlT,U(M) ≡ cases M of x : T ⇒ x | y : U ⇒ UNDEF Redex ϕ | ρ ⊢ V ⇒ inlT,U(W ) ϕ | ρ[W /x] ⊢ M C ϕ | ρ ⊢ cases V of x : T ⇒ M | y : U ⇒ N let x : T = castlT,U(V ) in C
Is reverse : List(R) → List(R) differentiable? It can be considered as a collection of functions reversen,n : List(R, n) → List(R, n) As List(R, n) = Rn we say it is differentiable everywhere as each of its components reversen,n is. Example At which lists is sort : List(R) → List(R) differentiable?
Any function: f : List(R) → List(R) decomposes into a collection of components fnm : List(R, n) ⇀ List(R, m) where fnm(l) ≃
(f (l) has length m) ↓ (otherwise) (l ∈ List(R, n)) We say f is differentiable at l if fnm is at l (where n = |l|, and m = |f (l)|). We say f is differentiable with open domain if, and only if, each of its components fnm is.
As fnm : List(R, n) ⇀ List(R, m) have dRfn,m : List(R, n) × List(R, m) ⇀ List(R, n) So might expect a dependent type dRf :
List(R, |f (l)|) ⇀ List(R, n) but we instead use a simple type dRf : List(R) × List(R) ⇀ List(R)
Given a container, viz: A set S of shapes. For each shape s ∈ S, a finite set Ps of places. Shapely differentiable datatypes have the form DS,P =
RPs = {s, x | s ∈ S, x:Ps → R}
Sets X ∼ =
R∅ Finite products of R: Rn ∼ =
R[n] Lists of reals List(R) ∼ =
R[n] Tensors of rank k ≥ 0 of reals Tensork(R) ∼ =
>0
R[d0]×...×[dk−1] Binary trees of reals BinaryTrees(R) ∼ =
RBranches(s)
A differentiable manifold of varying dimension is: A Hausdorff topological space X, plus an atlas on X, ie a collection of open subsets Ui covering X and each with a specified homeomorphism Ui
ϕi
− → Rn (a coordinate chart) to an open subset of some Rn, subject to some axioms. For shapely differentiable datatypes we have the charts: Us = {s} × RPs − − − → R|Ps| (s ∈ S) This connects shapely differentiable datatypes with standard notions of differentiable functions. Manifolds figure commonly in learning theory. Pymanopt, Townsend et al, 2016 Should (a suitable version of) manifolds be datatypes of differentiable programming languages? Pearlmutter, Automatic Differentiation: History and Headroom, NIPS Autodiff Workshop, 2016.
More language features in either external or internal mode, according to whether they cannot or can be differentiated. Examples:
Higher-order functions. External: Autograd; Internal cf. convenient vector spaces. Effects: exceptions, global state, I/O. All available with shapely differential datatypes.
a mixture of discrete distributions and those with a density.
Connect traditional semantic frameworks with differentiation:
Domain theory: could do streams and higher-order functions Metric spaces: could relate ideal computation with reals with approximate computation; differentiation of iteration scheme computations.
Make less sweeping, more realistic, assumptions about smoothness
Work with functions (and hence programs) in smoothness classes C k. Allow weaker forms of differentiability, eg some form of Clarke generalised derivative, as in work of Edalat and Gianantonio (who use domain theory).
Look at theory of work in automatic differentiation, to establish correctness of their techniques for efficiency. Investigate use of dependent types to track shape analysis of tensor computations. Consider how to program with manifolds as types.