CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward autodifferentiation (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech Administrivia PS0 released mean of 20.7


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Zsolt Kira Georgia Tech

Topics:

– Specifying Layers – Forward & Backward autodifferentiation – (Beginning of) Convolutional neural networks

slide-2
SLIDE 2

Administrivia

  • PS0 released

– mean of 20.7 – standard deviation of 3.4 – median of 21 – max of 25 – See me if you did not pass

  • PS1/HW1 out
  • Start thinking about project topics/teams

– More details on project next time

(C) Dhruv Batra & Zsolt Kira 2

slide-3
SLIDE 3

Recap from last time

(C) Dhruv Batra & Zsolt Kira 3

slide-4
SLIDE 4

Gradient Descent Pseudocode

for i in {0,…,num_epochs}: for x, y in data:

Some design decisions:

  • How many examples to use to calculate gradient per iteration?
  • What should alpha (learning rate) be?
  • Should it be constant throughout?
  • How many epochs to run to?
slide-5
SLIDE 5

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 5

Computational Graph

slide-6
SLIDE 6

Key Computation: Back-Prop

(C) Dhruv Batra & Zsolt Kira 6

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-7
SLIDE 7

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

(C) Dhruv Batra & Zsolt Kira 7

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-8
SLIDE 8

Neural Network Training

  • Step 1: Compute Loss on mini-batch

[F-Pass]

  • Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra & Zsolt Kira 8

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

slide-9
SLIDE 9

General Flow Graphs

“Deep Learning” book, Bengio

slide-10
SLIDE 10

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

g(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU

slide-13
SLIDE 13

13

g(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Q: what is the size of the Jacobian matrix?

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU

slide-14
SLIDE 14

14

g(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Q: what is the size of the Jacobian matrix? [4096 x 4096!]

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-15
SLIDE 15

Q: what is the size of the Jacobian matrix? [4096 x 4096!] Q2: what does it look like?

g(x) = max(0,x) (elementwise)

4096-d input vector 4096-d

  • utput vector

Jacobian of ReLU

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-16
SLIDE 16

Plan for Today

  • Specifying Layers
  • Forward & Backward auto-differentiation
  • (Beginning of) Convolutional neural networks

(C) Dhruv Batra & Zsolt Kira 17

slide-17
SLIDE 17

Deep Learning = Differentiable Programming

  • Computation = Graph

– Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering

  • What do we need to do?

– Generic code for representing the graph of modules – Specify modules (both forward and backward function)

(C) Dhruv Batra & Zsolt Kira 18

slide-18
SLIDE 18

19 Graph (or Net) object (rough psuedo code)

Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-19
SLIDE 19

20

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-20
SLIDE 20

21

(x,y,z are scalars) x y z

* Modularized implementation: forward / backward API

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-21
SLIDE 21

22

Example: Caffe layers

Caffe is licensed under BSD 2-Clause

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-22
SLIDE 22

23

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Caffe Sigmoid Layer

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-23
SLIDE 23

Deep Learning = Differentiable Programming

  • Computation = Graph

– Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering

  • Auto-Diff

– A family of algorithms for implementing chain-rule on computation graphs

(C) Dhruv Batra & Zsolt Kira 24

slide-24
SLIDE 24

Forward mode vs Reverse Mode

  • Key Computations

(C) Dhruv Batra & Zsolt Kira 25

slide-25
SLIDE 25

26

g

Forward mode AD

slide-26
SLIDE 26

27

g

Reverse mode AD

slide-27
SLIDE 27

Example: Forward mode AD

(C) Dhruv Batra & Zsolt Kira 28

+ sin( ) x1 x2 *

slide-28
SLIDE 28

Example: Forward mode AD

(C) Dhruv Batra & Zsolt Kira 29

+ sin( ) x1 x2 *

slide-29
SLIDE 29

(C) Dhruv Batra & Zsolt Kira 30

+ sin( ) x1 x2 *

Example: Forward mode AD

slide-30
SLIDE 30

(C) Dhruv Batra & Zsolt Kira 31

+ sin( ) x1 x2 *

Example: Forward mode AD

Q: What happens if there’s another input variable x3?

slide-31
SLIDE 31

(C) Dhruv Batra & Zsolt Kira 32

+ sin( ) x1 x2 *

Example: Forward mode AD

Q: What happens if there’s another input variable x3? A: more sophisticated graph; d “forward props” for d variables

slide-32
SLIDE 32

(C) Dhruv Batra & Zsolt Kira 33

+ sin( ) x1 x2 *

Example: Forward mode AD

Q: What happens if there’s another output variable f2?

slide-33
SLIDE 33

(C) Dhruv Batra & Zsolt Kira 34

+ sin( ) x1 x2 *

Example: Forward mode AD

Q: What happens if there’s another output variable f2? A: more sophisticated graph; single “forward prop”

slide-34
SLIDE 34

Example: Reverse mode AD

(C) Dhruv Batra & Zsolt Kira 35

+ sin( ) x1 x2 *

slide-35
SLIDE 35

(C) Dhruv Batra & Zsolt Kira 36

Example: Reverse mode AD

+ sin( ) x1 x2 *

slide-36
SLIDE 36

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-37
SLIDE 37

(C) Dhruv Batra & Zsolt Kira 38

Example: Reverse mode AD

+ sin( ) x1 x2 *

Q: What happens if there’s another input variable x3?

slide-38
SLIDE 38

(C) Dhruv Batra & Zsolt Kira 39

Example: Reverse mode AD

+ sin( ) x1 x2 *

Q: What happens if there’s another input variable x3? A: more sophisticated graph; single “backward prop”

slide-39
SLIDE 39

(C) Dhruv Batra & Zsolt Kira 40

Example: Reverse mode AD

+ sin( ) x1 x2 *

Q: What happens if there’s another output variable f2?

slide-40
SLIDE 40

(C) Dhruv Batra & Zsolt Kira 41

Example: Reverse mode AD

+ sin( ) x1 x2 *

Q: What happens if there’s another output variable f2? A: more sophisticated graph; c “backward props” for c vars

slide-41
SLIDE 41

Forward mode vs Reverse Mode

  • x  Graph  L
  • Intuition of Jacobian

(C) Dhruv Batra & Zsolt Kira 42

slide-42
SLIDE 42

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is faster to compute?

– Forward or backward?

(C) Dhruv Batra & Zsolt Kira 43

slide-43
SLIDE 43

Forward mode vs Reverse Mode

  • What are the differences?
  • Which one is faster to compute?

– Forward or backward?

  • Which one is more memory efficient (less storage)?

– Forward or backward?

(C) Dhruv Batra & Zsolt Kira 44

+

sin( )

x1 x2 * +

sin( )

x2 * x1

slide-44
SLIDE 44

Practical Note 2: Software Frameworks

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n A few weeks ago! +Keras

slide-45
SLIDE 45

PyTorch

slide-46
SLIDE 46
slide-47
SLIDE 47

Plan for Today (Cont.)

  • Specifying Layers
  • Forward & Backward auto-differentiation
  • (Beginning of) Convolutional neural networks

– What is a convolution? – FC vs Conv Layers

(C) Dhruv Batra & Zsolt Kira 48

slide-48
SLIDE 48

Image parameters

  • r weights

W

f(x,W)

10 numbers giving class scores

Array of 32x32x3 numbers (3072 numbers total)

f(x,W) = Wx + b

3072x1 10x1 10x3072 10x1

Recall: Linear Classifier

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-49
SLIDE 49

50

Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2

  • 0.3

W

Input image

56 231 24 2 56 231 24 2

Stretch pixels into column

1.1 3.2

  • 1.2

+

  • 96.8

437.9 61.95

=

Cat score Dog score Ship score

b

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-50
SLIDE 50

51

Recall: (Fully-Connected) Neural networks

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

x h

W1

s

W2 3072 100 10

(Before) Linear score function: (Now) 2-layer Neural Network

slide-51
SLIDE 51

Convolutional Neural Networks

(without the brain stuff)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-52
SLIDE 52

53

Example: 200x200 image 40K hidden units

  • Spatial correlation is local
  • Waste of resources + we have not enough

training samples anyway..

Fully Connected Layer

Slide Credit: Marc'Aurelio Ranzato

~2B parameters!!!

slide-53
SLIDE 53

54

Example: 200x200 image 40K hidden units “Filter” size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition).

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-54
SLIDE 54

55

STATIONARITY? Statistics similar at all locations

Locally Connected Layer

Slide Credit: Marc'Aurelio Ranzato

slide-55
SLIDE 55

56

Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

slide-56
SLIDE 56

What filter to use?

slide-57
SLIDE 57

Discrete convolution

  • Discrete Convolution!
  • Very similar to correlation but associative

2D Convolution 1D Convolution Filter

slide-58
SLIDE 58

A note on sizes

Filter m m Input N N Output N-m+1 N-m+1

MATLAB to the rescue!

  • conv2(x,w, ‘valid’)
slide-59
SLIDE 59

Convolutions!

  • Math vs. CS vs. programming viewpoints

(C) Dhruv Batra & Zsolt Kira 60

slide-60
SLIDE 60

Convolutions for mathematicians

  • On operation on two functions

and to produce a third function

  • E.g. input

and kernel or weighting function

61 (C) Peter Anderson

slide-61
SLIDE 61

Convolutions for mathematicians

  • One dimension

62

  • Two dimensions

1 2 1 2 1 2

(C) Peter Anderson

slide-62
SLIDE 62

Convolutions for CS/Programmers

(C) Dhruv Batra & Zsolt Kira 63

slide-63
SLIDE 63

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 64

slide-64
SLIDE 64

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 65

slide-65
SLIDE 65

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 66

slide-66
SLIDE 66

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 67

slide-67
SLIDE 67

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 68

slide-68
SLIDE 68

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 69

slide-69
SLIDE 69

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 70

slide-70
SLIDE 70

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 71

slide-71
SLIDE 71

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 72

slide-72
SLIDE 72

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 73

slide-73
SLIDE 73

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 74

slide-74
SLIDE 74

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 75

slide-75
SLIDE 75

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 76

slide-76
SLIDE 76

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 77

slide-77
SLIDE 77

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 78

slide-78
SLIDE 78

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 79

slide-79
SLIDE 79

Convolution Explained

  • http://setosa.io/ev/image-kernels/
  • https://github.com/bruckner/deepViz

(C) Dhruv Batra & Zsolt Kira 81

slide-80
SLIDE 80

Learn multiple filters.

E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters

Convolutional Layer

Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra & Zsolt Kira 82

slide-81
SLIDE 81

32 32 3

Convolution Layer

32x32x3 image -> preserve spatial structure

width height depth

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-82
SLIDE 82

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-83
SLIDE 83

32 32 3

Convolution Layer

5x5x3 filter 32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-84
SLIDE 84

32 32 3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer

slide-85
SLIDE 85

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation map 1 28 28

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer

slide-86
SLIDE 86

32 32 3

32x32x3 image 5x5x3 filter

convolve (slide) over all spatial locations activation maps 1 28 28

consider a second, green filter

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer

slide-87
SLIDE 87

32 32 3 Convolution Layer activation maps 6 28 28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-88
SLIDE 88

Im2Col

(C) Dhruv Batra & Zsolt Kira 90

Figure Credit: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

slide-89
SLIDE 89

General Matrix Multiply (GEMM)

(C) Dhruv Batra & Zsolt Kira 91

Figure Credit: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

slide-90
SLIDE 90

Time Distribution of AlexNet

(C) Dhruv Batra & Zsolt Kira 92

Figure Credit: Yangqing Jia, PhD Thesis

slide-91
SLIDE 91

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

slide-92
SLIDE 92

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU

….

10 24 24

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n