Backpropagation and Gradients Agenda Motivation Backprop Tips - - PowerPoint PPT Presentation

backpropagation and gradients agenda
SMART_READER_LITE
LIVE PREVIEW

Backpropagation and Gradients Agenda Motivation Backprop Tips - - PowerPoint PPT Presentation

Backpropagation and Gradients Agenda Motivation Backprop Tips & Tricks Matrix calculus primer Example: 2-layer Neural Network Motivation Recall: Optimization objective is minimize loss Goal: how should we tweak the


slide-1
SLIDE 1

Backpropagation and Gradients

slide-2
SLIDE 2

Agenda

  • Motivation
  • Backprop Tips & Tricks
  • Matrix calculus primer
  • Example: 2-layer Neural Network
slide-3
SLIDE 3

Motivation

Recall: Optimization objective is minimize loss Goal: how should we tweak the parameters to decrease the loss slightly?

Plotted on WolframAlpha

slide-4
SLIDE 4

Approach #1: Random search

Intuition: the way we tweak parameters is the direction we step in our optimization What if we randomly choose a direction?

slide-5
SLIDE 5

Approach #2: Numerical gradient

Intuition: gradient describes rate of change of a function with respect to a variable surrounding an infinitesimally small region Finite Differences: Challenge: how do we compute the gradient independent of each input?

slide-6
SLIDE 6

Approach #3: Analytical gradient

Recall: chain rule Assuming we know the structure of the computational graph beforehand… Intuition: upstream gradient values propagate backwards -- we can reuse them!

slide-7
SLIDE 7

What about autograd?

  • Deep learning frameworks can automatically perform backprop!
  • Problems might surface related to underlying gradients when debugging your

models “Yes You Should Understand Backprop”

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

slide-8
SLIDE 8

Problem Statement

Given a function f with respect to inputs x, labels y, and parameters compute the gradient of Loss with respect to

slide-9
SLIDE 9

Backpropagation

An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients

slide-10
SLIDE 10

Backpropagation

1. Identify intermediate functions (forward prop) 2. Compute local gradients 3. Combine with upstream error signal to get full gradient

slide-11
SLIDE 11

Modularity - Simple Example

Compound function Intermediate Variables

(forward propagation)

slide-12
SLIDE 12

Modularity - Neural Network Example

Compound function Intermediate Variables

(forward propagation)

slide-13
SLIDE 13

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

slide-14
SLIDE 14

Chain Rule Behavior

Key chain rule intuition: Slopes multiply

slide-15
SLIDE 15

Circuit Intuition

slide-16
SLIDE 16

Matrix Calculus Primer

Scalar-by-Vector Vector-by-Vector

slide-17
SLIDE 17

Matrix Calculus Primer

Vector-by-Matrix Scalar-by-Matrix

slide-18
SLIDE 18

Vector-by-Matrix Gradients

Let

slide-19
SLIDE 19
slide-20
SLIDE 20

Backpropagation Shape Rule

When you take gradients against a scalar The gradient at each intermediate step has shape of denominator

slide-21
SLIDE 21

Dimension Balancing

slide-22
SLIDE 22

Dimension Balancing

slide-23
SLIDE 23

Dimension Balancing

Dimension balancing is the “cheap” but efficient approach to gradient calculations in most practical settings Read gradient computation notes to understand how to derive matrix expressions for gradients from first principles

slide-24
SLIDE 24

Activation Function Gradients

is an element-wise function on each index of h (scalar-to-scalar) Officially, Diagonal matrix represents that and have no dependence if

slide-25
SLIDE 25

Activation Function Gradients

Element-wise multiplication (hadamard product) corresponds to matrix product with a diagonal matrix

slide-26
SLIDE 26

Backprop Menu for Success

1. Write down variable graph 2. Compute derivative of cost function 3. Keep track of error signals 4. Enforce shape rule on error signals 5. Use matrix balancing when deriving over a linear transformation

slide-27
SLIDE 27

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 13, 2017 76

As promised: A matrix example...

? ?

slide-28
SLIDE 28

import numpy as np # forward prop z_1 = np.dot(X, W_1) h_1 = np.maximum(z_1, 0) y_hat = np.dot(h_1, W_2) L = np.sum(y_hat**2) # backward prop dy_hat = 2.0*y_hat dW2 = h_1.T.dot(dy_hat) dh1 = dy_hat.dot(W_2.T) dz1 = dh1.copy() dz1[z1 < 0] = 0 dW1 = X.T.dot(dz1)

As promised: A matrix example...