Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion - - PowerPoint PPT Presentation

backpropagation
SMART_READER_LITE
LIVE PREVIEW

Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion - - PowerPoint PPT Presentation

Backpropagation TA: Yi Wen April 17, 2020 CS231n Discussion Section Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen Agenda Motivation Backprop Tips & Tricks Matrix calculus primer Agenda Motivation


slide-1
SLIDE 1

Backpropagation

Slides credits: Barak Oshri, Vincent Chen, Nish Khandwala, Yi Wen

TA: Yi Wen

April 17, 2020 CS231n Discussion Section

slide-2
SLIDE 2

Agenda

  • Motivation
  • Backprop Tips & Tricks
  • Matrix calculus primer
slide-3
SLIDE 3

Agenda

  • Motivation
  • Backprop Tips & Tricks
  • Matrix calculus primer
slide-4
SLIDE 4

Motivation

Recall: Optimization objective is minimize loss

slide-5
SLIDE 5

Motivation

Recall: Optimization objective is minimize loss Goal: how should we tweak the parameters to decrease the loss?

slide-6
SLIDE 6

Agenda

  • Motivation
  • Backprop Tips & Tricks
  • Matrix calculus primer
slide-7
SLIDE 7

A Simple Example

Loss Goal: Tweak the parameters to minimize loss => minimize a multivariable function in parameter space

slide-8
SLIDE 8

A Simple Example

=> minimize a multivariable function

Plotted on WolframAlpha

slide-9
SLIDE 9

Approach #1: Random Search

Intuition: the step we take in the domain of function

slide-10
SLIDE 10

Approach #2: Numerical Gradient

Intuition: rate of change of a function with respect to a variable surrounding a small region

slide-11
SLIDE 11

Approach #2: Numerical Gradient

Intuition: rate of change of a function with respect to a variable surrounding a small region Finite Differences:

slide-12
SLIDE 12

Approach #3: Analytical Gradient

Recall: partial derivative by limit definition

slide-13
SLIDE 13

Approach #3: Analytical Gradient

Recall: chain rule

slide-14
SLIDE 14

Approach #3: Analytical Gradient

Recall: chain rule E.g.

slide-15
SLIDE 15

Approach #3: Analytical Gradient

Recall: chain rule E.g.

slide-16
SLIDE 16

Approach #3: Analytical Gradient

Recall: chain rule Intuition: upstream gradient values propagate backwards -- we can reuse them!

slide-17
SLIDE 17

Gradient

“direction and rate of fastest increase” Numerical Gradient vs Analytical Gradient

slide-18
SLIDE 18

What about Autograd?

  • Deep learning frameworks can automatically perform backprop!
  • Problems might surface related to underlying gradients when debugging your

models “Yes You Should Understand Backprop”

https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b

slide-19
SLIDE 19

Problem Statement: Backpropagation

Given a function f with respect to inputs x, labels y, and parameters 𝜄 compute the gradient of Loss with respect to 𝜄

slide-20
SLIDE 20

Problem Statement: Backpropagation

An algorithm for computing the gradient of a compound function as a series of local, intermediate gradients:

  • 1. Identify intermediate functions (forward prop)
  • 2. Compute local gradients (chain rule)
  • 3. Combine with upstream error signal to get full gradient

Input x y

  • utput

local(x,W,b) => y dx,dW,db <= grad_local(dy,x,W,b)

dx dy

W,b dW,db

slide-21
SLIDE 21

Modularity: Previous Example

Compound function Intermediate Variables

(forward propagation)

slide-22
SLIDE 22

Modularity: 2-Layer Neural Network

Compound function Intermediate Variables

(forward propagation)

=> Squared Euclidean Distance

between and

slide-23
SLIDE 23

? f(x;W,b) = Wx + b ?

Intermediate Variables

(forward propagation)

(↑ lecture note) Input one feature vector (← here) Input a batch of data (matrix)

slide-24
SLIDE 24

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

  • 1. intermediate functions
  • 2. local gradients
  • 3. full gradients

??? ??? ???

slide-25
SLIDE 25

Agenda

  • Motivation
  • Backprop Tips & Tricks
  • Matrix calculus primer
slide-26
SLIDE 26

Derivative w.r.t. Vector

Scalar-by-Vector Vector-by-Vector

slide-27
SLIDE 27

Derivative w.r.t. Vector: Chain Rule

  • 1. intermediate functions
  • 2. local gradients
  • 3. full gradients

?

slide-28
SLIDE 28

Derivative w.r.t. Vector: Takeaway

slide-29
SLIDE 29

Derivative w.r.t. Matrix

Vector-by-Matrix ? Scalar-by-Matrix

slide-30
SLIDE 30

Derivative w.r.t. Matrix: Dimension Balancing

When you take scalar-by-matrix gradients The gradient has shape of denominator

  • Dimension balancing is the “cheap” but efficient approach to

gradient calculations in most practical settings

slide-31
SLIDE 31

Derivative w.r.t. Matrix: Takeaway

slide-32
SLIDE 32

Intermediate Variables

(forward propagation)

Intermediate Gradients

(backward propagation)

  • 1. intermediate functions
  • 2. local gradients
  • 3. full gradients
slide-33
SLIDE 33

Backprop Menu for Success

1. Write down variable graph 2. Keep track of error signals 3. Compute derivative of loss function 4. Enforce shape rule on error signals, especially when deriving

  • ver a linear transformation
slide-34
SLIDE 34

Vector-by-vector

?

slide-35
SLIDE 35

Vector-by-vector

?

slide-36
SLIDE 36

Vector-by-vector

?

slide-37
SLIDE 37

Vector-by-vector

?

slide-38
SLIDE 38

Matrix multiplication [Backprop]

? ?

slide-39
SLIDE 39

Elementwise function [Backprop]

?