CSC421/2516 Lecture 3: Automatic Differentiation & Distributed - PowerPoint PPT Presentation

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 1 / 49

Overview Lecture 2 covered the algebraic view of backprop. This lecture focuses on how to implement an automatic differentiation library: build the computation graph vector-Jacobian products (VJP) for primitive ops the backwards pass We’ll cover, Autograd, a lightweight autodiff tool. PyTorch’s implementation is very similar. You will probably never have to implement autodiff yourself but it is good to know its inner workings. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 2 / 49

Confusing Terminology Automatic differentiation (autodiff) refers to a general way of taking a program which computes a value, and automatically constructing a procedure for computing derivatives of that value. Backpropagation is the special case of autodiff applied to neural nets But in machine learning, we often use backprop synonymously with autodiff Autograd is the name of a particular autodiff library we will cover in this lecture. There are many others, e.g. PyTorch, TensorFlow. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 3 / 49

What Autodiff Is Not: Finite Differences We often use finite differences to check our gradient calculations. One-sided version: ∂ x i f ( x 1 , . . . , x N ) ≈ f ( x 1 , . . . , x i + h , . . . , x N ) − f ( x 1 , . . . , x i , . . . , x N ) ∂ h Two-sided version: ∂ x i f ( x 1 , . . . , x N ) ≈ f ( x 1 , . . . , x i + h , . . . , x N ) − f ( x 1 , . . . , x i − h , . . . , x N ) ∂ 2 h Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 4 / 49

What Autodiff Is Not: Finite Differences Autodiff is not finite differences. Finite differences are expensive, since you need to do a forward pass for each derivative. It also induces huge numerical error. Normally, we only use it for testing. Autodiff is both efficient (linear in the cost of computing the value) and numerically stable. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 5 / 49

What Autodiff Is An autodiff system will convert the program into a sequence of primitive operations (ops) which have specified routines for computing derivatives. In this representation, backprop can be done in a completely mechanical way. Sequence of primitive operations: t 1 = wx Original program: z = t 1 + b z = wx + b t 3 = − z 1 t 4 = exp( t 3 ) y = 1 + exp( − z ) t 5 = 1 + t 4 L = 1 2( y − t ) 2 y = 1 / t 5 t 6 = y − t t 7 = t 2 6 L = t 7 / 2 Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 6 / 49

What Autodiff Is Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 7 / 49

Autograd The rest of this lecture covers how Autograd is implemented. Source code for the original Autograd package: https://github.com/HIPS/autograd Autodidact, a pedagogical implementation of Autograd — you are encouraged to read the code. https://github.com/mattjj/autodidact Thanks to Matt Johnson for providing this! Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 8 / 49

Building the Computation Graph Most autodiff systems, including Autograd, explicitly construct the computation graph. Some frameworks like TensorFlow provide mini-languages for building computation graphs directly. Disadvantage: need to learn a totally new API. Autograd instead builds them by tracing the forward pass computation, allowing for an interface nearly indistinguishable from NumPy. The Node class (defined in tracer.py ) represents a node of the computation graph. It has attributes: value , the actual value computed on a particular set of inputs fun , the primitive operation defining the node args and kwargs , the arguments the op was called with parents , the parent Node s Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 9 / 49

Building the Computation Graph Autograd’s fake NumPy module provides primitive ops which look and feel like NumPy functions, but secretly build the computation graph. They wrap around NumPy functions: Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 10 / 49

Building the Computation Graph Example: Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 11 / 49

Recap: Vector-Jacobian Products Recall: the Jacobian is the matrix of partial derivatives:  ∂ y 1 ∂ y 1  · · · ∂ x 1 ∂ x n J = ∂ y . . ...   ∂ x = . .  . .    ∂ y m ∂ y m · · · ∂ x 1 ∂ x n The backprop equation (single child node) can be written as a vector-Jacobian product (VJP): ∂ y i � x = y ⊤ J x j = y i ∂ x j i That gives a row vector. We can treat it as a column vector by taking x = J ⊤ y Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 12 / 49

Recap: Vector-Jacobian Products Examples Matrix-vector product x = W ⊤ z z = Wx J = W Elementwise operations  exp( z 1 ) 0  ... y = exp( z ) J = z = exp( z ) ◦ y     0 exp( z D ) Note: we never explicitly construct the Jacobian. It’s usually simpler and more efficient to compute the VJP directly. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 13 / 49

Vector-Jacobian Products For each primitive operation, we must specify VJPs for each of its arguments. Consider y = exp( x ). This is a function which takes in the output gradient (i.e. y ), the answer ( y ), and the arguments ( x ), and returns the input gradient ( x ) defvjp (defined in core.py ) is a convenience routine for registering VJPs. It just adds them to a dict. Examples from numpy/numpy vjps.py Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 14 / 49

Backprop as Message Passing Consider a na¨ ıve backprop implementation where the z module needs to compute z using the formula: z = ∂ r ∂ zr + ∂ s ∂ zs + ∂ t ∂ zt This breaks modularity, since z needs to know how it’s used in the network in order to compute partial derivatives of r , s , and t . Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 15 / 49

Backprop as Message Passing Backprop as message passing: Each node receives a bunch of messages from its children, which it aggregates to get its error signal. It then passes messages to its parents. Each of these messages is a VJP. This formulation provides modularity: each node needs to know how to compute its outgoing messages, i.e. the VJPs corresponding to each of its parents (arguments to the function). The implementation of z doesn’t need to know where z came from. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 16 / 49

Backward Pass The backwards pass is defined in core.py . The argument g is the error signal for the end node; for us this is always L = 1. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 17 / 49

Backward Pass grad (in differential operators.py ) is just a wrapper around make vjp (in core.py ) which builds the computation graph and feeds it to backward pass . grad itself is viewed as a VJP, if we treat L as the 1 × 1 matrix with entry 1. ∂ L ∂ w = ∂ L ∂ w L Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 18 / 49

Recap We saw three main parts to the code: tracing the forward pass to build the computation graph vector-Jacobian products for primitive ops the backwards pass Building the computation graph requires fancy NumPy gymnastics, but other two items are basically what I showed you. You’re encouraged to read the full code ( < 200 lines!) at: https://github.com/mattjj/autodidact/tree/master/autograd Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 19 / 49

Learning to learning by gradient descent by gradient descent https://arxiv.org/pdf/1606.04474.pdf Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 20 / 49

Gradient-Based Hyperparameter Optimization https://arxiv.org/abs/1502.03492 Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 21 / 49

After the break After the break: Distributed Representations Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 22 / 49

Overview Let’s now take a break from backpropagation and see a real example of a neural net to learn feature representations of words. We’ll see a lot more neural net architectures later in the course. We’ll also introduce the models used in Programming Assignment 1. Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 23 / 49

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed - PowerPoint PPT Presentation

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 1 / 49 Overview Lecture 2 covered the algebraic view of

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 22: Go Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421/2516

CSC421/2516 Lecture 16: Attention Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 10: Image Classification Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 19: Bayesian Neural Nets Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 13: Recurrent Neural Networks Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 20: Policy Gradient Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

CSC421/2516 Lecture 11: Optimizing the Input Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

CSC421/2516 Lecture 3: Multilayer Perceptrons Roger Grosse and Jimmy Ba Roger Grosse and Jimmy

CSC421/2516 Lectures 78: Optimization Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC413/2516 Lecture 11: Q-Learning & the Game of Go Jimmy Ba Jimmy Ba CSC413/2516 Lecture

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

CSC421 Lecture 2: Linear Models Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421

Authentication of LZ77 compressed data Stefano Lonardi University of California, Riverside

Message authentication and cryptographic hashing 2MMC10 Cryptology Andreas H ulsing

Data Security: The art of providing secure communication over insecure channels. Not a problem

WHOLEHEARTED Digging Deeper to Broaden Our Reach WE WEAR THE MASK We Wear the Mask BY PAUL

1 3.1.1 Formal Properties and a little Remarks (III) Theory This definition of a MAS is

Do Managers and Leaders Really Do Different Things? by John OLeary JUNE 20, 2016 Business

W ISE M OVE ? A research platform that mimics our autonomous driving stack. Objective:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed - PowerPoint PPT Presentation

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba Jimmy Ba CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations 1 / 49 Overview Lecture 2 covered the algebraic view of

CSC421/2516 Lecture 6: Automatic Differentiation Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 22: Go Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421/2516

CSC421/2516 Lecture 16: Attention Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 10: Image Classification Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 17: Variational Autoencoders Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 19: Bayesian Neural Nets Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 13: Recurrent Neural Networks Roger Grosse and Jimmy Ba Roger Grosse and

CSC421/2516 Lecture 20: Policy Gradient Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

CSC421/2516 Lecture 11: Optimizing the Input Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC421/2516 Lecture 14: Exploding and Vanishing Gradients Roger Grosse and Jimmy Ba Roger Grosse

CSC421/2516 Lecture 3: Multilayer Perceptrons Roger Grosse and Jimmy Ba Roger Grosse and Jimmy

CSC421/2516 Lectures 78: Optimization Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba

CSC413/2516 Lecture 11: Q-Learning &amp; the Game of Go Jimmy Ba Jimmy Ba CSC413/2516 Lecture

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

CSC421 Lecture 2: Linear Models Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421

Authentication of LZ77 compressed data Stefano Lonardi University of California, Riverside

Message authentication and cryptographic hashing 2MMC10 Cryptology Andreas H ulsing

Data Security: The art of providing secure communication over insecure channels. Not a problem

WHOLEHEARTED Digging Deeper to Broaden Our Reach WE WEAR THE MASK We Wear the Mask BY PAUL

1 3.1.1 Formal Properties and a little Remarks (III) Theory This definition of a MAS is

Do Managers and Leaders Really Do Different Things? by John OLeary JUNE 20, 2016 Business

W ISE M OVE ? A research platform that mimics our autonomous driving stack. Objective:

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 20: Dictionary

CSC413/2516 Lecture 11: Q-Learning & the Game of Go Jimmy Ba Jimmy Ba CSC413/2516 Lecture