CS 4803 / 7643: Deep Learning Topics: Backpropagation - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Backpropagation - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Backpropagation Vector/Matrix/Tensor math Deriving vectorized gradients for ReLU Zsolt Kira Georgia Tech Administrivia PS1/HW1 out Start thinking about project topics/teams (C)


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– Backpropagation – Vector/Matrix/Tensor math – Deriving vectorized gradients for ReLU

slide-2
SLIDE 2

Administrivia

  • PS1/HW1 out
  • Start thinking about project topics/teams

(C) Dhruv Batra & Zsolt Kira 2

slide-3
SLIDE 3

Do the Readings!

(C) Dhruv Batra & Zsolt Kira 3

slide-4
SLIDE 4

Recap from last time

(C) Dhruv Batra & Zsolt Kira 4

slide-5
SLIDE 5

Gradient Descent Pseudocode

for i in {0,…,num_epochs}: for x, y in data: 𝑧 𝑇𝑁 𝑋𝑦 𝑀 𝐷𝐹 𝑧 , 𝑧

  • ? ? ?

𝑋 ≔ 𝑋 𝛽

  • Some design decisions:
  • How many examples to use to calculate gradient per iteration?
  • What should alpha (learning rate) be?
  • Should it be constant throughout?
  • How many epochs to run to?
slide-6
SLIDE 6

How to Simplify?

  • Calculating gradients for large functions is

complicated

  • Step 1: Decompose the function and compute local

gradients for each part!

  • Step 2: Apply generic algorithm that computes

gradients locally and uses chain rule to propagate across computation graph

(C) Dhruv Batra & Zsolt Kira 6

slide-7
SLIDE 7

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 7

Computational Graph

slide-8
SLIDE 8

Key Computation: Forward-Prop

(C) Dhruv Batra & Zsolt Kira 8

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-9
SLIDE 9

Key Computation: Back-Prop

(C) Dhruv Batra & Zsolt Kira 9

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-10
SLIDE 10

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

(C) Dhruv Batra & Zsolt Kira 10

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-11
SLIDE 11

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

(C) Dhruv Batra & Zsolt Kira 11

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-12
SLIDE 12

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

(C) Dhruv Batra & Zsolt Kira 12

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-13
SLIDE 13

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

  • Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 13

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-14
SLIDE 14

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

  • Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 14

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-15
SLIDE 15

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

  • Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 15

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-16
SLIDE 16

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

  • Step 2: Compute gradients wrt parameters [B-Pass]
  • Step 3: Use gradient to update parameters

(C) Dhruv Batra & Zsolt Kira 16

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-17
SLIDE 17

17

e.g. x = -2, y = 5, z = -4 Want: Chain rule:

Upstream gradient Local gradient

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-18
SLIDE 18

add gate: gradient distributor max gate: gradient router mul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-19
SLIDE 19

(C) Dhruv Batra and Zsolt Kira 19

Summary

  • We will have a composed non-linear function as our model

– Several portions will have parameters

  • We will use (stochastic/mini-batch) gradient descent with a loss

function to define our objective

  • Rather than analytically derive gradients for complex function, we will

modularize computation

– Back propagation = Gradient Descent + Chain Rule

  • Now:

– Work through mathematical view – Vectors, matrices, and tensors – Next time: Can the computer do this for us automatically?

  • Read:

– https://explained.ai/matrix-calculus/index.html – https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/slides/L5_gradients _notes.pdf

slide-20
SLIDE 20

Matrix/Vector Derivatives Notation

  • Read:

– https://explained.ai/matrix-calculus/index.html – https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/slide s/L5_gradients_notes.pdf

  • Matrix/Vector Derivatives Notation
  • Vector Derivative Example
  • Extension to Tensors
  • Chain Rule: Composite Functions

– Scalar Case – Vector Case – Jacobian view – Graphical view – Tensors

  • Logistic Regression Derivatives

(C) Dhruv Batra & Zsolt Kira 20

slide-21
SLIDE 21

(C) Dhruv Batra & Zsolt Kira 21