CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv Layers Forward mode vs Reverse mode AD Modern CNN Architectures Zsolt Kira Georgia Tech Administrivia HW1 due date moved! Due:


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– (Finish) Computing Gradients – Backprop in Conv Layers – Forward mode vs Reverse mode AD – Modern CNN Architectures

slide-2
SLIDE 2

Administrivia

  • HW1 due date moved!

– Due: 02/18, 11:55pm

  • Project topic submissions

– Submit by 02/21 to get comments – Form filled out with:

  • Members identified
  • Paragraph of problem and another paragraph of what has been done in

the literature and approach (note: the approach can be selected from an existing paper and reimplemented)

  • Description of what each member will do
  • Link

– A project idea: ICLR reproducibility challenge (https://reproducibility-challenge.github.io/iclr_2019/)

  • Official submission Jan but can still do it and submit later!

(C) Dhruv Batra and Zsolt Kira 2

slide-3
SLIDE 3
  • Google cloud credits out! (see piazza for details)
  • Clouderizer ties in with Google Cloud Platform (GCP)

(C) Dhruv Batra and Zsolt Kira 3

slide-4
SLIDE 4

Matrix/Vector Derivatives Notation

(C) Dhruv Batra and Zsolt Kira 4

slide-5
SLIDE 5

(C) Dhruv Batra and Zsolt Kira 5

slide-6
SLIDE 6

(C) Dhruv Batra and Zsolt Kiraand Zsolt Kira 6

slide-7
SLIDE 7

Plan for Today

  • Topics:

– (Finish) Computing Gradients – Backprop in Conv Layers – Forward mode vs Reverse mode AD – Modern CNN Architectures

(C) Dhruv Batra and Zsolt Kira 7

slide-8
SLIDE 8

Backprop in Convolutional Layers

(C) Dhruv Batra and Zsolt Kira 8

slide-9
SLIDE 9

How do we compute gradients?

  • Analytic or “Manual” Differentiation
  • Symbolic Differentiation
  • Numerical Differentiation
  • Automatic Differentiation

– Forward mode AD – Reverse mode AD

  • aka “backprop”

(C) Dhruv Batra and Zsolt Kira 9

slide-10
SLIDE 10

x W

hinge loss

R

+

L

s (scores)

*

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph

slide-11
SLIDE 11

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra and Zsolt Kira 11

Computational Graph

slide-12
SLIDE 12

Directed Acyclic Graphs (DAGs)

  • Exactly what the name suggests

– Directed edges – No (directed) cycles – Underlying undirected cycles okay

(C) Dhruv Batra and Zsolt Kira 12

slide-13
SLIDE 13

Directed Acyclic Graphs (DAGs)

  • Concept

– Topological Ordering

(C) Dhruv Batra and Zsolt Kira 13

slide-14
SLIDE 14

Directed Acyclic Graphs (DAGs)

(C) Dhruv Batra and Zsolt Kira 14

slide-15
SLIDE 15

Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :( In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Numerical vs Analytic Gradients

slide-16
SLIDE 16

How do we compute gradients?

  • Analytic or “Manual” Differentiation
  • Symbolic Differentiation
  • Numerical Differentiation
  • Automatic Differentiation

– Forward mode AD – Reverse mode AD

  • aka “backprop”

(C) Dhruv Batra and Zsolt Kira 16

slide-17
SLIDE 17

Forward mode vs Reverse Mode

  • Key Computations

(C) Dhruv Batra and Zsolt Kira 17

slide-18
SLIDE 18

18

g

Forward mode AD

slide-19
SLIDE 19

19

g

Reverse mode AD

slide-20
SLIDE 20

Example: Forward mode AD

(C) Dhruv Batra and Zsolt Kira 20

+ sin( ) x1 x2 *

slide-21
SLIDE 21

Example: Forward mode AD

(C) Dhruv Batra and Zsolt Kira 21

+ sin( ) x1 x2 *

slide-22
SLIDE 22

(C) Dhruv Batra and Zsolt Kira 22

+ sin( ) x1 x2 *

Example: Forward mode AD

slide-23
SLIDE 23

(C) Dhruv Batra and Zsolt Kira 23

+ sin( ) x1 x2 *

Example: Forward mode AD

slide-24
SLIDE 24

Example: Reverse mode AD

(C) Dhruv Batra and Zsolt Kira 24

+ sin( ) x1 x2 *

slide-25
SLIDE 25

(C) Dhruv Batra and Zsolt Kira 25

Example: Reverse mode AD

+ sin( ) x1 x2 *

slide-26
SLIDE 26

Forward Pass vs Forward mode AD vs Reverse Mode AD

(C) Dhruv Batra and Zsolt Kira 26

+

sin( )

x1 x2 * +

sin( )

x2 * x1 +

sin( )

x1 x2 *

slide-27
SLIDE 27

Forward mode vs Reverse Mode

  • What are the differences?

(C) Dhruv Batra and Zsolt Kira 27

+ sin( ) x2 * + sin( ) x1 x2 * x1

slide-28
SLIDE 28

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is faster to compute?

– Forward or backward?

(C) Dhruv Batra and Zsolt Kira 28

slide-29
SLIDE 29

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is faster to compute?

– Forward or backward?

  • Which one is more memory efficient (less storage)?

– Forward or backward?

(C) Dhruv Batra and Zsolt Kira 29

slide-30
SLIDE 30

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-31
SLIDE 31

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-32
SLIDE 32

add gate: gradient distributor

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-33
SLIDE 33

add gate: gradient distributor Q: What is a max gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-34
SLIDE 34

add gate: gradient distributor max gate: gradient router

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-35
SLIDE 35

add gate: gradient distributor max gate: gradient router Q: What is a mul gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-36
SLIDE 36

add gate: gradient distributor max gate: gradient router mul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-37
SLIDE 37

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-38
SLIDE 38

Duality in Fprop and Bprop

(C) Dhruv Batra and Zsolt Kira 38

+ + FPROP BPROP

SUM COPY

slide-39
SLIDE 39

39 Graph (or Net) object (rough psuedo code)

Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-40
SLIDE 40

40

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-41
SLIDE 41

41

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-42
SLIDE 42

42

Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-43
SLIDE 43

43

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Caffe Sigmoid Layer

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-44
SLIDE 44

(C) Dhruv Batra and Zsolt Kira 44

Figure Credit: Andrea Vedaldi