One-and-a-Half Simple Differential Programming Languages Gordon - PowerPoint PPT Presentation

One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martín Abadi ~

Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and loops Language semantics: operational and denotational Beyond powers of R Conclusion and future work

Neural networks: a very brief introduction Neural networks Deep learning is based on neural networks: ● loosely inspired by the brain; ● built from simple, trainable functions. input output

Primitives: the programmer’s neuron Primitives: the neuron y = F ( w 1 x 1 +...+ w n x n + b ) ● w 1 ... w n are weights, ● b is a bias, ● weights and biases are parameters, ● F is a “differentiable” non-linear function, X 1 ... X n e.g., the “ReLU” inputs F ( x ) = max(0, x )

Two activation functions: ReLU and Swish ReLU max( x , 0) 16/12/2017 Desmos | Graphing Calculator x · σ ( β x ), where σ ( z ) = (1 + e − z ) − 1 Swish 16/12/2017 Desmos | Graphing Calculator (Ramachandran, Zoph, and Le, 2017) https://www.desmos.com/calculator 1/2 https://www.desmos.com/calculator 1/2

13/12/2017 Conv Nets: A Modular Perspective - colah's blog Conv Nets: A Modular Perspective Posted on July 8, 2014 neural networks (../../posts/tags/neural_networks.html), deep learning (../../posts/tags/deep_learning.html), convolutional neural networks (../../posts/tags/convolutional_neural_networks.html), modular neural networks (../../posts/tags/modular_neural_networks.html) Introduction In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition. One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network . At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical 1 copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters – the values describing how neurons behave – that need to be learned fairly small. 2d convolutional network 1 A 2D Convolutional Neural Network With thanks to C. Olah This trick of having multiple copies of the same neuron is roughly analogous to the abstraction of functions in mathematics and computer science. When programming, we write a function once and use it in many places – not writing the same code a hundred times in different places makes it faster to program, and results in fewer bugs. Similarly, a convolutional neural network can learn a neuron once and use it in many places, making it easier to learn the model and reducing error. Structure of Convolutional Neural Networks Suppose you want a neural network to look at audio samples and predict whether a human is speaking or not. Maybe you want to do more analysis if someone is speaking. You get audio samples at different points in time. The samples are evenly spaced.

Some tensors � Knoldus Inc c

�� 4282 �� Convolutional image classification Inception-Resnet-v1 achitecture Schema Stem Szegedy et al, arxiv.org/abs/1602.07261

Recurrent neural networks (RNNs) Recurrent Neural Networks (RNNs) ≅ Cell Cell Cell Cell with shared parameters There are many variants, e.g., LSTMs.

Mixture of experts A model MoE architecture with a conditional and a loop: With thanks to Yu et al.

26/12/2017 Attention and Augmented Recurrent Neural Networks Neural Turing Machines Neural Turing Machines Neural Turing Machines combine a RNN with an external memory Neural Turing Machines combine a RNN with an external memory bank. Since vectors are the natural language of neural bank: networks, the memory is an array of vectors: Memory is an array of vectors. Network A writes and reads from this memory each step. x0 y0 x1 y1 x2 y2 x3 y3 But how does reading and writing work? The challenge is that we want to make them differentiable. In particular, we want to make them differentiable with respect to the location we read from or write to, so that we can learn where to read and write. With thanks to C. Olah This is tricky because memory addresses seem to be fundamentally discrete. NTMs take a very clever solution to this: every step, they read and write everywhere, just to different extents. As an example, let’s focus on reading. Instead of specifying a single location, the RNN outputs an “attention distribution” that describes how we spread out the amount we care about different memory positions. As such, the result of the read operation is a weighted sum. attention The RNN gives an attention distribution which describe how we spread out the amount we care about different memory memory positions. The read result is a weighted sum. Similarly, we write everywhere at once to different extents. Again, an attention distribution describes how much we write at every location. We do this by having the new value of a position in memory be a convex combination of the old memory content and the write value, with the position between the two decided by the attention weight. https://distill.pub/2016/augmented-rnns/ 2/16

Supervised learning Given a training dataset of (input, output) pairs, e.g., a set of images with labels: While not done: Pick a pair ( x , y ) Run the neural network on x to get Net( x , b , . . . ) Compare this to y to calculate the loss (= error = cost) Loss( b , . . . ) = | y − Net( x , b , . . . ) | Adjust parameters b , . . . to reduce the loss More generally, pick a “mini-batch" ( x 1 , y 1 ) , . . . , ( x n , y n ) and minimise the loss � n � � � ( y i − Net ( x i , b , . . . )) 2 Loss( b , . . . ) = � i =1

Slope of a line slope = change in y change in x = ∆ y ∆ x So ∆ y = slope × ∆ x So x ′ = x − r slope = ⇒ y ′ = y − r slope 2

Gradient descent Gradient descent Compute partial derivatives along paths in the neural network. Follow the gradient of the loss function Follow the gradient of the loss with respect to the parameters. Thus: x ′ := x − r (slope of Loss at x ) = r d Loss(x) d x

Multi-dimensional gradient descent x ′ := x − r ∂ L(x,y) y ′ := y − r ∂ L(x,y) and ∂ x ∂ y ( x ′ , y ′ ) := ( x , y ) − r ( ∂ L(x,y) , ∂ L(x,y) ) ∂ x ∂ y v ′ := v − r ∇ L

Looking at differentiation Expressions with several variables: � ∂ e [ x , y ] � ∂ x � x , y = a , b Gradient of functions f : R 2 → R of two arguments: ∇ ( f ) : R 2 → R 2 � ∂ f ( u , v ) , ∂ f ( u , v ) � ∇ ( f )( u , v ) = ∂ u ∂ v Chain rule ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ x ∂ u ∂ x ∂ v ∂ x where u , v = g ( x , y , z ) , h ( x , y , z ).

A matrix view of the multiargument chain rule. We have: ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ x ∂ u ∂ x ∂ v ∂ x ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ y ∂ u ∂ y ∂ v ∂ y ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ z ∂ u ∂ z ∂ v ∂ z Set k = � g , h � : R 3 → R 2 and define its Jacobian to be the 2 × 3 matrix:  ∂ g ∂ g ∂ g  ∂ x ∂ y ∂ z J k =   ∂ h ∂ h ∂ h ∂ x ∂ y ∂ z Then the gradient of the composition f ◦ k is given by the vector-Jacobian product: ∇ f ( g ( x , y , z ) , h ( x , y , z )) = ∇ f ( u , v ) · J k ( x , y , z )

One-and-a-Half Simple Differential Programming Languages Gordon - PowerPoint PPT Presentation

One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martn Abadi ~ Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and

Color Half Toning Half Toning Digital Half Toning Half toning and Colors Half Toning Half

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

Half Toning Color Half Toning 1 Color Half Toning 2 Half Toning Emulating 5 different levels

Chapter 2 Early History: low level languages The 1950s: first programming languages History of

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Programming Languages Chapter One Modern Programming Languages, 2nd ed. 1 Outline What

Differential equations Programming of Differential Equations A differential equation (ODE)

Differential equations Programming of Differential Equations A differential equation (ODE)

Half Year Results 2012 Half Year Results 2012 Half Year Results 2012 Roland Junck Greg McMillan

The History Of Programming Languages Chapter Twenty-Four Modern Programming Languages, 2nd ed.

Big Ideas for CS 251 Theory of Programming Languages Principles of Programming Languages

Big Ideas for CS 251 Theory of Programming Languages Principles of Programming Languages

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

Linear Differential Equations as a Data-Structure Bruno Salvy Inria & ENS de Lyon FoCM,

Cauchy-Euler Equations Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

Chapter 4: Higher-Order Differential Equations Part 1 Department of Electrical Engineering

Differentially Flat Nonlinear Control Systems: Overview of the Theory and Applications, and

The Novikov conjecture for geometrically discrete groups of diffeomorphisms Jianchao Wu Texas

Antiderivative of 1 / x 1 / x : So Z 1 ln( x ): x dx = ln | x | + C ln | x | : Warm up: Decide

Introduction to Mobile Robotics Wheeled Locomotion Wolfram Burgard, Cyrill Stachniss, Maren

Lecture 7. Case studies Functional Programming 2018/19 Alejandro Serrano [ Faculty of Science

One-and-a-Half Simple Differential Programming Languages Gordon - PowerPoint PPT Presentation

One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martn Abadi ~ Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and

Color Half Toning Half Toning Digital Half Toning Half toning and Colors Half Toning Half

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

Half Toning Color Half Toning 1 Color Half Toning 2 Half Toning Emulating 5 different levels

Chapter 2 Early History: low level languages The 1950s: first programming languages History of

Tutorial: Differential Categories and Cartesian Differential Categories JS Pacaud Lemay FMCS

Programming Languages Chapter One Modern Programming Languages, 2nd ed. 1 Outline What

Differential equations Programming of Differential Equations A differential equation (ODE)

Differential equations Programming of Differential Equations A differential equation (ODE)

Half Year Results 2012 Half Year Results 2012 Half Year Results 2012 Roland Junck Greg McMillan

The History Of Programming Languages Chapter Twenty-Four Modern Programming Languages, 2nd ed.

Big Ideas for CS 251 Theory of Programming Languages Principles of Programming Languages

Big Ideas for CS 251 Theory of Programming Languages Principles of Programming Languages

Classification of curves Simple, not closed Simple, closed Closed, not simple Not simple, not

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

Linear Differential Equations as a Data-Structure Bruno Salvy Inria &amp; ENS de Lyon FoCM,

Cauchy-Euler Equations Bernd Schr oder logo1 Bernd Schr oder Louisiana Tech University,

Chapter 4: Higher-Order Differential Equations Part 1 Department of Electrical Engineering

Differentially Flat Nonlinear Control Systems: Overview of the Theory and Applications, and

The Novikov conjecture for geometrically discrete groups of diffeomorphisms Jianchao Wu Texas

Antiderivative of 1 / x 1 / x : So Z 1 ln( x ): x dx = ln | x | + C ln | x | : Warm up: Decide

Introduction to Mobile Robotics Wheeled Locomotion Wolfram Burgard, Cyrill Stachniss, Maren

Lecture 7. Case studies Functional Programming 2018/19 Alejandro Serrano [ Faculty of Science

Linear Differential Equations as a Data-Structure Bruno Salvy Inria & ENS de Lyon FoCM,