Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion - - PowerPoint PPT Presentation

▶

Feb 25, 2024 978 likes •1.38k views

Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion Section Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen Agenda Motivation Backprop Tips & Tricks Matrix calculus primer Agenda Motivation

SLIDE 1

Backpropagation

Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen

TA: Yi Wen

April 17, 2020 CS231n Discussion Section

SLIDE 2

Agenda

Motivation
Backprop Tips & Tricks
Matrix calculus primer

SLIDE 3

Agenda

Motivation
Backprop Tips & Tricks
Matrix calculus primer

SLIDE 4

Motivation

Recall: Optimization objective is minimize loss

SLIDE 5

Motivation

Recall: Optimization objective is minimize loss Goal: how should we tweak the parameters to decrease the loss?

SLIDE 6

Agenda

Motivation
Backprop Tips & Tricks
Matrix calculus primer

SLIDE 7

A Simple Example

Loss Goal: Tweak the parameters to minimize loss => minimize a multivariable function in parameter space

SLIDE 8

A Simple Example

=> minimize a multivariable function

Plotted on WolframAlpha

SLIDE 9

Approach #1: Random Search

Intuition: the step we take in the domain of function

SLIDE 10

Approach #2: Numerical Gradient

Intuition: rate of change of a function with respect to a variable surrounding a small region

SLIDE 11

Approach #2: Numerical Gradient

Intuition: rate of change of a function with respect to a variable surrounding a small region Finite Differences:

SLIDE 12

Approach #3: Analytical Gradient

Recall: partial derivative by limit definition

SLIDE 13

Approach #3: Analytical Gradient

Recall: chain rule

SLIDE 14

Approach #3: Analytical Gradient

Recall: chain rule E.g.

SLIDE 15

Approach #3: Analytical Gradient

Recall: chain rule E.g.

SLIDE 16

Approach #3: Analytical Gradient

Recall: chain rule Intuition: upstream gradient values propagate backwards -- we can reuse them!

SLIDE 17

Gradient

“direction and rate of fastest increase” Numerical Gradient vs Analytical Gradient

SLIDE 18

What about Autograd?

Deep learning frameworks can automatically perform backprop!
Problems might surface related to underlying gradients when debugging your

models “Yes You Should Understand Backprop”

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

SLIDE 19

Problem Statement: Backpropagation

Given a function f with respect to inputs x, labels y, and parameters 𝜄 compute the gradient of Loss with respect to 𝜄

SLIDE 20

Problem Statement: Backpropagation

An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients:

1. Identify intermediate functions (forward prop)
2. Compute local gradients (chain rule)
3. Combine with upstream error signal to get full gradient

Input x y

utput

local(x,W,b) => y dx,dW,db <= grad_local(dy,x,W,b)

dx dy

W,b dW,db

SLIDE 21

Modularity: Previous Example

Compound function Intermediate Variables

(forward propagation)

SLIDE 22

Modularity: 2-Layer Neural Network

Compound function Intermediate Variables

(forward propagation)

=> Squared Euclidean Distance

between and

SLIDE 23

? f(x;W,b) = Wx + b ?

Intermediate Variables

(forward propagation)

(↑ lecture note) Input one feature vector (← here) Input a batch of data (matrix)

SLIDE 24

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

1. intermediate functions
2. local gradients
3. full gradients

？？？？？？？？？

SLIDE 25

Agenda

Motivation
Backprop Tips & Tricks
Matrix calculus primer

SLIDE 26

Derivative w.r.t. Vector

Scalar-by-Vector Vector-by-Vector

SLIDE 27

Derivative w.r.t. Vector: Chain Rule

1. intermediate functions
2. local gradients
3. full gradients

SLIDE 28

Derivative w.r.t. Vector: Takeaway

SLIDE 29

Derivative w.r.t. Matrix

Vector-by-Matrix ? Scalar-by-Matrix

SLIDE 30

Derivative w.r.t. Matrix: Dimension Balancing

When you take scalar-by-matrix gradients The gradient has shape of denominator

Dimension balancing is the “cheap” but efficient approach to

gradient calculations in most practical settings

SLIDE 31

Derivative w.r.t. Matrix: Takeaway

SLIDE 32

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

1. intermediate functions
2. local gradients
3. full gradients

SLIDE 33

Backprop Menu for Success

1. Write down variable graph 2. Keep track of error signals 3. Compute derivative of loss function 4. Enforce shape rule on error signals, especially when deriving

ver a linear transformation

SLIDE 34

Vector-by-vector

SLIDE 35

Vector-by-vector

SLIDE 36

Vector-by-vector

SLIDE 37

Vector-by-vector

SLIDE 38

Matrix multiplication [Backprop]

? ?

SLIDE 39

Backpropagation

Agenda

Agenda

Motivation

Motivation

Agenda

A Simple Example

A Simple Example

Approach #1: Random Search

Approach #2: Numerical Gradient

Approach #2: Numerical Gradient

Approach #3: Analytical Gradient

Approach #3: Analytical Gradient

Approach #3: Analytical Gradient

Approach #3: Analytical Gradient

Approach #3: Analytical Gradient

Gradient

What about Autograd?

Problem Statement: Backpropagation

Problem Statement: Backpropagation

Modularity: Previous Example

Modularity: 2-Layer Neural Network

? f(x;W,b) = Wx + b ?

Agenda

Derivative w.r.t. Vector

Derivative w.r.t. Vector: Chain Rule

Derivative w.r.t. Vector: Takeaway

Derivative w.r.t. Matrix

Derivative w.r.t. Matrix: Dimension Balancing

Derivative w.r.t. Matrix: Takeaway

Backprop Menu for Success

Vector-by-vector

Vector-by-vector

Vector-by-vector

Vector-by-vector

Matrix multiplication [Backprop]

Elementwise function [Backprop]