CS 7643: Deep Learning Topics: Computational Graphs Notation + - - PowerPoint PPT Presentation

cs 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 7643: Deep Learning Topics: Computational Graphs Notation + - - PowerPoint PPT Presentation

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing Gradients Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech Administrativia HW1 Released Due: 09/22 PS1 Solutions


slide-1
SLIDE 1

CS 7643: Deep Learning

Dhruv Batra Georgia Tech

Topics:

– Computational Graphs – Notation + example – Computing Gradients – Forward mode vs Reverse mode AD

slide-2
SLIDE 2

Administrativia

  • HW1 Released

– Due: 09/22

  • PS1 Solutions

– Coming soon

(C) Dhruv Batra 2

slide-3
SLIDE 3

Project

  • Goal

– Chance to try Deep Learning – Combine with other classes / research / credits / anything

  • You have our blanket permission
  • Extra credit for shooting for a publication

– Encouraged to apply to your research (computer vision, NLP, robotics,…) – Must be done this semester.

  • Main categories

– Application/Survey

  • Compare a bunch of existing algorithms on a new application domain of

your interest

– Formulation/Development

  • Formulate a new model or algorithm for a new or old problem

– Theory

  • Theoretically analyze an existing algorithm

(C) Dhruv Batra 3

slide-4
SLIDE 4

Administrativia

  • Project Teams Google Doc

– https://docs.google.com/spreadsheets/d/1AaXY0JE4lAbHvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 – Project Title – 1-3 sentence project summary TL;DR – Team member names + GT IDs

(C) Dhruv Batra 4

slide-5
SLIDE 5

Recap of last time

(C) Dhruv Batra 5

slide-6
SLIDE 6

How do we compute gradients?

  • Manual Differentiation
  • Symbolic Differentiation
  • Numerical Differentiation
  • Automatic Differentiation

– Forward mode AD – Reverse mode AD

  • aka “backprop”

(C) Dhruv Batra 6

slide-7
SLIDE 7

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 7

Computational Graph

slide-8
SLIDE 8

Directed Acyclic Graphs (DAGs)

  • Exactly what the name suggests

– Directed edges – No (directed) cycles – Underlying undirected cycles okay

(C) Dhruv Batra 8

slide-9
SLIDE 9

Directed Acyclic Graphs (DAGs)

  • Concept

– Topological Ordering

(C) Dhruv Batra 9

slide-10
SLIDE 10

Directed Acyclic Graphs (DAGs)

(C) Dhruv Batra 10

slide-11
SLIDE 11

Computational Graphs

  • Notation #1

(C) Dhruv Batra 11

f(x1, x2) = x1x2 + sin(x1)

slide-12
SLIDE 12

Computational Graphs

  • Notation #2

(C) Dhruv Batra 12

f(x1, x2) = x1x2 + sin(x1)

slide-13
SLIDE 13

Example

(C) Dhruv Batra 13

f(x1, x2) = x1x2 + sin(x1)

+ sin( ) x1 x2 *

slide-14
SLIDE 14

Logistic Regression as a Cascade

(C) Dhruv Batra 14

Given a library of simple functions Compose into a complicate function

− log ✓ 1 1 + e−w|x ◆ w

|x

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-15
SLIDE 15

Forward mode vs Reverse Mode

  • Key Computations

(C) Dhruv Batra 15

slide-16
SLIDE 16

16

g

Forward mode AD

slide-17
SLIDE 17

17

g

Reverse mode AD

slide-18
SLIDE 18

Example: Forward mode AD

(C) Dhruv Batra 18

f(x1, x2) = x1x2 + sin(x1)

+ sin( ) x1 x2 *

slide-19
SLIDE 19

(C) Dhruv Batra 19

+ sin( ) x1 x2 *

˙ x1 ˙ x1 ˙ w1 = cos(x1) ˙ x1 ˙ x2 ˙ w2 = ˙ x1x2 + x1 ˙ x2 ˙ w3 = ˙ w1 + ˙ w2

Example: Forward mode AD

f(x1, x2) = x1x2 + sin(x1)

slide-20
SLIDE 20

(C) Dhruv Batra 20

+ sin( ) x1 x2 *

˙ x1 ˙ x1 ˙ w1 = cos(x1) ˙ x1 ˙ x2 ˙ w2 = ˙ x1x2 + x1 ˙ x2 ˙ w3 = ˙ w1 + ˙ w2

Example: Forward mode AD

f(x1, x2) = x1x2 + sin(x1)

slide-21
SLIDE 21

Example: Reverse mode AD

(C) Dhruv Batra 21

f(x1, x2) = x1x2 + sin(x1)

+ sin( ) x1 x2 *

slide-22
SLIDE 22

(C) Dhruv Batra 22

Example: Reverse mode AD

f(x1, x2) = x1x2 + sin(x1)

+ sin( ) x1 x2 *

¯ w3 = 1 ¯ w1 = ¯ w3 ¯ w2 = ¯ w3 ¯ x1 = ¯ w1 cos(x1) ¯ x1 = ¯ w2x2 ¯ x2 = ¯ w2x1

slide-23
SLIDE 23

Forward Pass vs Forward mode AD vs Reverse Mode AD

(C) Dhruv Batra 23

+

sin( )

x1 x2 *

¯ w3 = 1 ¯ w1 = ¯ w3 ¯ w2 = ¯ w3 ¯ x1 = ¯ w2x2 ¯ x2 = ¯ w2x1 ¯ x1 = ¯ w1 cos(x1)

+

sin( )

x2 *

˙ x1 ˙ x1 ˙ x2 ˙ w2 = ˙ x1x2 + x1 ˙ x2 ˙ w3 = ˙ w1 + ˙ w2 ˙ w1 = cos(x1) ˙ x1

x1 +

sin( )

x1 x2 *

f(x1, x2) = x1x2 + sin(x1)

slide-24
SLIDE 24

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is more memory efficient (less storage)?

– Forward or backward?

(C) Dhruv Batra 24

slide-25
SLIDE 25

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is more memory efficient (less storage)?

– Forward or backward?

  • Which one is faster to compute?

– Forward or backward?

(C) Dhruv Batra 25

slide-26
SLIDE 26

Plan for Today

  • (Finish) Computing Gradients

– Forward mode vs Reverse mode AD – Patterns in backprop – Backprop in FC+ReLU NNs

  • Convolutional Neural Networks

(C) Dhruv Batra 26

slide-27
SLIDE 27

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-28
SLIDE 28

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-29
SLIDE 29

add gate: gradient distributor

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-30
SLIDE 30

add gate: gradient distributor Q: What is a max gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-31
SLIDE 31

add gate: gradient distributor max gate: gradient router

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-32
SLIDE 32

add gate: gradient distributor max gate: gradient router Q: What is a mul gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-33
SLIDE 33

add gate: gradient distributor max gate: gradient router mul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-34
SLIDE 34

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-35
SLIDE 35

Duality in Fprop and Bprop

(C) Dhruv Batra 35

+ + FPROP BPROP

SUM COPY

slide-36
SLIDE 36

36 Graph (or Net) object (rough psuedo code)

Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-37
SLIDE 37

37

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-38
SLIDE 38

38

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-39
SLIDE 39

39

Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-40
SLIDE 40

40

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Caffe Sigmoid Layer

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-41
SLIDE 41

(C) Dhruv Batra 41

slide-42
SLIDE 42

(C) Dhruv Batra 42

slide-43
SLIDE 43

Key Computation in DL: Forward-Prop

(C) Dhruv Batra 43

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-44
SLIDE 44

Key Computation in DL: Back-Prop

(C) Dhruv Batra 44

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-45
SLIDE 45

f(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-46
SLIDE 46

46

f(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Q: what is the size of the Jacobian matrix?

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-47
SLIDE 47

47

f(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Q: what is the size of the Jacobian matrix? [4096 x 4096!]

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-48
SLIDE 48

i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\

f(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Q: what is the size of the Jacobian matrix? [4096 x 4096!]

in practice we process an entire minibatch (e.g. 100)

  • f examples at one time:

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-49
SLIDE 49

Q: what is the size of the Jacobian matrix? [4096 x 4096!] Q2: what does it look like?

f(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-50
SLIDE 50

Jacobians of FC-Layer

(C) Dhruv Batra 50

slide-51
SLIDE 51

Jacobians of FC-Layer

(C) Dhruv Batra 51

slide-52
SLIDE 52

Jacobians of FC-Layer

(C) Dhruv Batra 52

slide-53
SLIDE 53

Convolutional Neural Networks

(without the brain stuff)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-54
SLIDE 54

54

Example: 200x200 image 40K hidden units ~2B parameters!!!

  • Spatial correlation is local
  • Waste of resources + we have not enough

training samples anyway..

Fully Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-55
SLIDE 55

55

Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition).

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-56
SLIDE 56

56

STATIONARITY? Statistics is similar at different locations Note: This parameterization is good when input image is registered (e.g., face recognition). Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-57
SLIDE 57

57

Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

slide-58
SLIDE 58

Convolutions for mathematicians

(C) Dhruv Batra 58

slide-59
SLIDE 59

(C) Dhruv Batra 59

"Convolution of box signal with itself2" by Convolution_of_box_signal_with_itself.gif: Brian Ambergderivative work: Tinos (talk)

  • Convolution_of_box_signal_with_itself.gif. Licensed under CC BY-SA 3.0 via Commons -

https://commons.wikimedia.org/wiki/File:Convolution_of_box_signal_with_itself2.gif#/media/File:Convolution_of_box_signal_wi th_itself2.gif

slide-60
SLIDE 60

Convolutions for computer scientists

(C) Dhruv Batra 60

slide-61
SLIDE 61

Convolutions for programmers

(C) Dhruv Batra 61

slide-62
SLIDE 62

Convolution Explained

  • http://setosa.io/ev/image-kernels/
  • https://github.com/bruckner/deepViz

(C) Dhruv Batra 62

slide-63
SLIDE 63

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 63

slide-64
SLIDE 64

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 64

slide-65
SLIDE 65

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 65

slide-66
SLIDE 66

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 66

slide-67
SLIDE 67

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 67

slide-68
SLIDE 68

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 68

slide-69
SLIDE 69

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 69

slide-70
SLIDE 70

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 70

slide-71
SLIDE 71

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 71

slide-72
SLIDE 72

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 72

slide-73
SLIDE 73

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 73

slide-74
SLIDE 74

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 74

slide-75
SLIDE 75

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 75

slide-76
SLIDE 76

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 76

slide-77
SLIDE 77

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 77

slide-78
SLIDE 78

Mathieu et al. “Fast training of CNNs through FFTs” ICLR 2014

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 78

slide-79
SLIDE 79

*

  • 1 0 1
  • 1 0 1
  • 1 0 1

=

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 79

slide-80
SLIDE 80

Learn multiple filters.

E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra 80

slide-81
SLIDE 81

3072 1

Fully Connected Layer

32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights activation input 1 10

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-82
SLIDE 82

3072 1

Fully Connected Layer

32x32x3 image -> stretch to 3072 x 1

10 x 3072 weights activation input

1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product)

1 10

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-83
SLIDE 83

83

Convolutional Layer

slide-84
SLIDE 84

84

Convolutional Layer

slide-85
SLIDE 85

32 32 3

Convolution Layer

32x32x3 image -> preserve spatial structure

width height depth

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-86
SLIDE 86

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-87
SLIDE 87

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-88
SLIDE 88

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-89
SLIDE 89

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-90
SLIDE 90

32 32 3

Convolution Layer

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation maps 1 28 28

consider a second, green filter

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-91
SLIDE 91

32 32 3 Convolution Layer activation maps 6 28 28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n