Differentiable Programming Atlm Gne Baydin National University of - PowerPoint PPT Presentation

Differentiable Programming Atılım Güneş Baydin National University of Ireland Maynooth (Based on joint work with Barak Pearlmutter) Microsoft Research Cambridge, February 1, 2016

Deep learning layouts Neural network models are assembled from building blocks and trained with backpropagation 1/40

Deep learning layouts Neural network models are assembled from building blocks and trained with backpropagation Traditional: Feedforward Convolutional Recurrent 1/40

Deep learning layouts Newer additions: Make algorithmic elements continuous and differentiable → enables use in deep learning NTM on copy task (Graves et al. 2014) Neural Turing Machine (Graves et al., 2014) → can infer algorithms: copy, sort, recall Stack-augmented RNN (Joulin & Mikolov, 2015) End-to-end memory network (Sukhbaatar et al., 2015) Stack, queue, deque (Grefenstette et al., 2015) Discrete interfaces (Zaremba & Sutskever, 2015) 2/40

(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385) ResNet, 152 layers (deep residual learning) (ILSVRC 2015) VGG, 19 layers (ILSVRC 2014) AlexNet, 8 layers (ILSVRC 2012) Stacking of many layers, trained through backpropagation Deep learning layouts 7x7 conv, 64, /2, pool/2 11x11 conv, 96, /4, pool/2 3x3 conv, 64 1x1 conv, 64 5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 3x3 conv, 64 3x3 conv, 128 3x3 conv, 384 1x1 conv, 256 3x3 conv, 128, pool/2 3x3 conv, 384 1x1 conv, 64 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 64 3x3 conv, 256 fc, 4096 1x1 conv, 256 3x3 conv, 256 fc, 4096 1x1 conv, 64 fc, 1000 3x3 conv, 256, pool/2 3x3 conv, 64 3x3 conv, 512 1x1 conv, 256 3x3 conv, 512 1x1 conv, 128, /2 3x3 conv, 512 3x3 conv, 128 3x3 conv, 512, pool/2 1x1 conv, 512 3x3 conv, 512 1x1 conv, 128 3x3 conv, 512 3x3 conv, 128 3x3 conv, 512 1x1 conv, 512 3x3 conv, 512, pool/2 1x1 conv, 128 fc, 4096 3x3 conv, 128 fc, 4096 1x1 conv, 512 fc, 1000 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 3/40

The bigger picture One way of viewing deep learning systems is “differentiable functional programming” Two main characteristics: Differentiability → optimization Chained function composition → successive transformations → successive levels of distributed representations (Bengio 2013) → the chain rule of calculus propagates derivatives 4/40

The bigger picture In a functional interpretation Weight-tying or multiple applications of the same neuron (e.g., ConvNets and RNNs) resemble function abstraction Structural patterns of composition resemble higher-order functions (e.g., map, fold, unfold, zip) 5/40

The bigger picture Even when you have complex compositions , differentiability ensures that they can be trained end-to-end with backpropagation (Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555) 6/40

The bigger picture These insights clearly put into words in Christopher Olah’s blog post (September 3, 2015) http://colah.github.io/posts/2015-09-NN-Types-FP/ “The field does not (yet) have a unifying insight or narrative” and reiterated in David Dalrymple’s essay (January 2016) http://edge.org/response-detail/26794 “The most natural playground ... would be a new language that can run back-propagation directly on functional programs. ” 7/40

In this talk Vision: Functional languages with deeply embedded, general-purpose differentiation capability, i.e., differentiable programming Automatic (algorithmic) differentiation (AD) in a functional framework is a manifestation of this vision. 8/40

In this talk I will talk about: Mainstream frameworks What AD research can contribute My ongoing work 9/40

Mainstream Frameworks

Frameworks “Theano-like” Fine-grained Define computational graphs in a symbolic way Graph analysis and optimizations Examples: Theano Computation Graph Toolkit (CGT) TensorFlow (Kenneth Tran. “Evaluation of Deep Learning Toolkits”. Computational Network Toolkit https://github.com/zer0n/deepframeworks ) (CNTK) 10/40

Frameworks “Torch-like” Coarse-grained Build models by combining pre-specified modules Each module is manually implemented, hand-tuned Examples: Torch7 Caffe 11/40

Frameworks Common in both: Define models using the framework’s (constrained) symbolic language The framework handles backpropagation → you don’t have to code derivatives (unless adding new modules) Because derivatives are “automatic”, some call it “autodiff” or “automatic differentiation” This is NOT the traditional meaning of automatic differentiation (AD) (Griewank & Walther, 2008) Because “automatic” is a generic (and bad) term, algorithmic differentiation is a better name 12/40

“But, how is AD different from Theano?” 13/40

“But, how is AD different from Theano?” In Theano express all math relations using symbolic placeholders use a mini-language with very limited control flow (e.g. scan ) end up designing a symbolic graph for your algorithm Theano optimizes it 13/40

“But, how is AD different from Theano?” Theano gives you automatic derivatives Transforms your graph into a derivative graph Applies optimizations Identical subgraph elimination Simplifications Stability improvements ( http://deeplearning.net/software/theano/ optimizations.html ) Compiles to a highly optimized form 14/40

“But, how is AD different from Theano?” You are limited to symbolic graph building, with the mini-language 15/40

“But, how is AD different from Theano?” You are limited to symbolic graph building, with the mini-language For example, instead of this in pure Python (for A k ): 15/40

“But, how is AD different from Theano?” You are limited to symbolic graph building, with the mini-language For example, instead of this in pure Python (for A k ): You build this symbolic graph: 15/40

“But, how is AD different from Theano?” AD allows you to just fully use your host language and gives you exact and efficient derivatives 16/40

“But, how is AD different from Theano?” AD allows you to just fully use your host language and gives you exact and efficient derivatives So, you just do this: 16/40

Differentiable Programming Atlm Gne Baydin National University of - PowerPoint PPT Presentation

Differentiable Programming Atlm Gne Baydin National University of Ireland Maynooth (Based on joint work with Barak Pearlmutter) Microsoft Research Cambridge, February 1, 2016 Deep learning layouts Neural network models are assembled

An Enriched Perspective on Differentiable Stacks Benjamin MacAdam Joint work with Jonathan

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

The Differentiable Curry Martin Abadi, Dan Belov, Gordon Plotkin, Richard Wei, Dimitrios Vytiniotis

Differentiable Rendering for Mesh and Implicit Surface Weikai Chen Tencent America GAMES

Learning to map between ferns with differentiable binary embedding networks Maximilian Blendowski

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This

Reparameterization Gradient for Non-differentiable Models Wonyeol Lee Hangyeol Yu Hongseok

Differentiable Cloth Simulation for Inverse Problems Junbang Liang 1 Content Motivation

Programming With A Differentiable Forth Interpreter Varun Gangal, CMU Based on the work of

Automatic Differentiation (or Differentiable Programming) Atlm Gne Baydin National

DiffTaichi: Differentiable Programming for Physical Simulation End2end optimization of neural

Differentiable Functional Programming Atlm Gne Baydin University of Oxford

GPU-accelerated End-to-end Differentiable Planning and Reasoning Tim Rockt aschel Whiteson

Myia: A Differentiable Language for Deep Learning Olivier Breuleux Computer Analyst, MILA Bart

Differentiable Tree Planning for Deep RL Greg Farquhar 1 In Collaboration With Tim

Charge Separation Part 1: Diode Lecture 5 9/22/2011 MIT Fundamentals of Photovoltaics

Numerical Solutions to Partial Differential Equations Zhiping Li LMAM and School of Mathematical

A Logical Account for Linear Partial Differential Equations Marie Kerjean IRIF, Universit e

Equations d evolution et calcul diff erentiel non commutatifs. Non-commutative

Software Evolution Massimo Felici Massimo Felici Software Evolution 2011 c Title Slide

Post registration Specialist Practice qualifications review General Practice Nursing Webinar

Digital Image Processing (CS/ECE 545) Lecture 8: Regions in Binary Images (Part 2) and Color (Part

scRNA-seq Differential expression analyses Olga Dethlefsen olga.dethlefsen@nbis.se NBIS,