CS 4803 / 7643: Deep Learning
Zsolt Kira Georgia Tech
Topics:
– Specifying Layers – Forward & Backward autodifferentiation – (Beginning of) Convolutional neural networks
CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward autodifferentiation (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech Administrivia PS0 released mean of 20.7
Topics:
– Specifying Layers – Forward & Backward autodifferentiation – (Beginning of) Convolutional neural networks
– mean of 20.7 – standard deviation of 3.4 – median of 21 – max of 25 – See me if you did not pass
– More details on project next time
(C) Dhruv Batra & Zsolt Kira 2
(C) Dhruv Batra & Zsolt Kira 3
for i in {0,…,num_epochs}: for x, y in data:
Some design decisions:
Any DAG of differentiable modules is allowed!
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 5
(C) Dhruv Batra & Zsolt Kira 6
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
[F-Pass]
(C) Dhruv Batra & Zsolt Kira 7
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
[F-Pass]
(C) Dhruv Batra & Zsolt Kira 8
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Deep Learning” book, Bengio
10
11
g(x) = max(0,x) (elementwise)
4096-d input vector 4096-d
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
13
g(x) = max(0,x) (elementwise)
4096-d input vector 4096-d
Q: what is the size of the Jacobian matrix?
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
14
g(x) = max(0,x) (elementwise)
4096-d input vector 4096-d
Q: what is the size of the Jacobian matrix? [4096 x 4096!]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Q: what is the size of the Jacobian matrix? [4096 x 4096!] Q2: what does it look like?
g(x) = max(0,x) (elementwise)
4096-d input vector 4096-d
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra & Zsolt Kira 17
– Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering
– Generic code for representing the graph of modules – Specify modules (both forward and backward function)
(C) Dhruv Batra & Zsolt Kira 18
19 Graph (or Net) object (rough psuedo code)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
20
(x,y,z are scalars) x y z
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
21
(x,y,z are scalars) x y z
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
22
Caffe is licensed under BSD 2-Clause
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
23
* top_diff (chain rule)
Caffe is licensed under BSD 2-Clause
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
– Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering
– A family of algorithms for implementing chain-rule on computation graphs
(C) Dhruv Batra & Zsolt Kira 24
(C) Dhruv Batra & Zsolt Kira 25
26
27
(C) Dhruv Batra & Zsolt Kira 28
+ sin( ) x1 x2 *
(C) Dhruv Batra & Zsolt Kira 29
+ sin( ) x1 x2 *
(C) Dhruv Batra & Zsolt Kira 30
+ sin( ) x1 x2 *
(C) Dhruv Batra & Zsolt Kira 31
+ sin( ) x1 x2 *
Q: What happens if there’s another input variable x3?
(C) Dhruv Batra & Zsolt Kira 32
+ sin( ) x1 x2 *
Q: What happens if there’s another input variable x3? A: more sophisticated graph; d “forward props” for d variables
(C) Dhruv Batra & Zsolt Kira 33
+ sin( ) x1 x2 *
Q: What happens if there’s another output variable f2?
(C) Dhruv Batra & Zsolt Kira 34
+ sin( ) x1 x2 *
Q: What happens if there’s another output variable f2? A: more sophisticated graph; single “forward prop”
(C) Dhruv Batra & Zsolt Kira 35
+ sin( ) x1 x2 *
(C) Dhruv Batra & Zsolt Kira 36
+ sin( ) x1 x2 *
+
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra & Zsolt Kira 38
+ sin( ) x1 x2 *
Q: What happens if there’s another input variable x3?
(C) Dhruv Batra & Zsolt Kira 39
+ sin( ) x1 x2 *
Q: What happens if there’s another input variable x3? A: more sophisticated graph; single “backward prop”
(C) Dhruv Batra & Zsolt Kira 40
+ sin( ) x1 x2 *
Q: What happens if there’s another output variable f2?
(C) Dhruv Batra & Zsolt Kira 41
+ sin( ) x1 x2 *
Q: What happens if there’s another output variable f2? A: more sophisticated graph; c “backward props” for c vars
(C) Dhruv Batra & Zsolt Kira 42
– Forward or backward?
(C) Dhruv Batra & Zsolt Kira 43
– Forward or backward?
– Forward or backward?
(C) Dhruv Batra & Zsolt Kira 44
+
sin( )
x1 x2 * +
sin( )
x2 * x1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n A few weeks ago! +Keras
– What is a convolution? – FC vs Conv Layers
(C) Dhruv Batra & Zsolt Kira 48
Image parameters
10 numbers giving class scores
Array of 32x32x3 numbers (3072 numbers total)
3072x1 10x1 10x3072 10x1
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
50
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
0.2
0.1 2.0 1.5 1.3 2.1 0.0 0.25 0.2
Input image
56 231 24 2 56 231 24 2
Stretch pixels into column
1.1 3.2
437.9 61.95
Cat score Dog score Ship score
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
51
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
x h
W1
s
W2 3072 100 10
(Before) Linear score function: (Now) 2-layer Neural Network
(without the brain stuff)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
53
training samples anyway..
Slide Credit: Marc'Aurelio Ranzato
54
Example: 200x200 image 40K hidden units “Filter” size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition).
Slide Credit: Marc'Aurelio Ranzato
55
STATIONARITY? Statistics similar at all locations
Slide Credit: Marc'Aurelio Ranzato
56
Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels
Slide Credit: Marc'Aurelio Ranzato
2D Convolution 1D Convolution Filter
Filter m m Input N N Output N-m+1 N-m+1
(C) Dhruv Batra & Zsolt Kira 60
and to produce a third function
and kernel or weighting function
61 (C) Peter Anderson
62
1 2 1 2 1 2
(C) Peter Anderson
(C) Dhruv Batra & Zsolt Kira 63
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 64
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 65
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 66
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 67
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 68
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 69
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 70
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 71
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 72
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 73
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 74
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 75
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 76
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 77
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 78
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 79
(C) Dhruv Batra & Zsolt Kira 81
E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra & Zsolt Kira 82
32 32 3
32x32x3 image -> preserve spatial structure
width height depth
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3
5x5x3 filter 32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3
5x5x3 filter 32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products” Filters always extend the full depth of the input volume
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3
32x32x3 image 5x5x3 filter
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3
32x32x3 image 5x5x3 filter
convolve (slide) over all spatial locations activation map 1 28 28
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3
32x32x3 image 5x5x3 filter
convolve (slide) over all spatial locations activation maps 1 28 28
consider a second, green filter
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
32 32 3 Convolution Layer activation maps 6 28 28
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: We stack these up to get a “new image” of size 28x28x6!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra & Zsolt Kira 90
Figure Credit: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
(C) Dhruv Batra & Zsolt Kira 91
Figure Credit: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
(C) Dhruv Batra & Zsolt Kira 92
Figure Credit: Yangqing Jia, PhD Thesis
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 32 3 28 28 6 CONV, ReLU e.g. 6 5x5x3 filters
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 32 3 CONV, ReLU e.g. 6 5x5x3 filters 28 28 6 CONV, ReLU e.g. 10 5x5x6 filters CONV, ReLU
….
10 24 24
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n