CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv Layers Forward mode vs Reverse mode AD Modern CNN Architectures Zsolt Kira Georgia Tech Administrivia HW1 due date moved! Due:
Administrivia
- HW1 due date moved!
– Due: 02/18, 11:55pm
- Project topic submissions
– Submit by 02/21 to get comments – Form filled out with:
- Members identified
- Paragraph of problem and another paragraph of what has been done in
the literature and approach (note: the approach can be selected from an existing paper and reimplemented)
- Description of what each member will do
- Link
– A project idea: ICLR reproducibility challenge (https://reproducibility-challenge.github.io/iclr_2019/)
- Official submission Jan but can still do it and submit later!
(C) Dhruv Batra and Zsolt Kira 2
- Google cloud credits out! (see piazza for details)
- Clouderizer ties in with Google Cloud Platform (GCP)
(C) Dhruv Batra and Zsolt Kira 3
Matrix/Vector Derivatives Notation
(C) Dhruv Batra and Zsolt Kira 4
(C) Dhruv Batra and Zsolt Kira 5
(C) Dhruv Batra and Zsolt Kiraand Zsolt Kira 6
Plan for Today
- Topics:
– (Finish) Computing Gradients – Backprop in Conv Layers – Forward mode vs Reverse mode AD – Modern CNN Architectures
(C) Dhruv Batra and Zsolt Kira 7
Backprop in Convolutional Layers
(C) Dhruv Batra and Zsolt Kira 8
How do we compute gradients?
- Analytic or “Manual” Differentiation
- Symbolic Differentiation
- Numerical Differentiation
- Automatic Differentiation
– Forward mode AD – Reverse mode AD
- aka “backprop”
(C) Dhruv Batra and Zsolt Kira 9
x W
hinge loss
R
+
L
s (scores)
*
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Computational Graph
Any DAG of differentiable modules is allowed!
Slide Credit: Marc'Aurelio Ranzato
(C) Dhruv Batra and Zsolt Kira 11
Computational Graph
Directed Acyclic Graphs (DAGs)
- Exactly what the name suggests
– Directed edges – No (directed) cycles – Underlying undirected cycles okay
(C) Dhruv Batra and Zsolt Kira 12
Directed Acyclic Graphs (DAGs)
- Concept
– Topological Ordering
(C) Dhruv Batra and Zsolt Kira 13
Directed Acyclic Graphs (DAGs)
(C) Dhruv Batra and Zsolt Kira 14
Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :( In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Numerical vs Analytic Gradients
How do we compute gradients?
- Analytic or “Manual” Differentiation
- Symbolic Differentiation
- Numerical Differentiation
- Automatic Differentiation
– Forward mode AD – Reverse mode AD
- aka “backprop”
(C) Dhruv Batra and Zsolt Kira 16
Forward mode vs Reverse Mode
- Key Computations
(C) Dhruv Batra and Zsolt Kira 17
18
g
Forward mode AD
19
g
Reverse mode AD
Example: Forward mode AD
(C) Dhruv Batra and Zsolt Kira 20
+ sin( ) x1 x2 *
Example: Forward mode AD
(C) Dhruv Batra and Zsolt Kira 21
+ sin( ) x1 x2 *
(C) Dhruv Batra and Zsolt Kira 22
+ sin( ) x1 x2 *
Example: Forward mode AD
(C) Dhruv Batra and Zsolt Kira 23
+ sin( ) x1 x2 *
Example: Forward mode AD
Example: Reverse mode AD
(C) Dhruv Batra and Zsolt Kira 24
+ sin( ) x1 x2 *
(C) Dhruv Batra and Zsolt Kira 25
Example: Reverse mode AD
+ sin( ) x1 x2 *
Forward Pass vs Forward mode AD vs Reverse Mode AD
(C) Dhruv Batra and Zsolt Kira 26
+
sin( )
x1 x2 * +
sin( )
x2 * x1 +
sin( )
x1 x2 *
Forward mode vs Reverse Mode
- What are the differences?
(C) Dhruv Batra and Zsolt Kira 27
+ sin( ) x2 * + sin( ) x1 x2 * x1
Forward mode vs Reverse Mode
- What are the differences?
- Which one is faster to compute?
– Forward or backward?
(C) Dhruv Batra and Zsolt Kira 28
Forward mode vs Reverse Mode
- What are the differences?
- Which one is faster to compute?
– Forward or backward?
- Which one is more memory efficient (less storage)?
– Forward or backward?
(C) Dhruv Batra and Zsolt Kira 29
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor Q: What is a max gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor max gate: gradient router
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor max gate: gradient router Q: What is a mul gate?
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
add gate: gradient distributor max gate: gradient router mul gate: gradient switcher
Patterns in backward flow
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
+
Gradients add at branches
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Duality in Fprop and Bprop
(C) Dhruv Batra and Zsolt Kira 38
+ + FPROP BPROP
SUM COPY
39 Graph (or Net) object (rough psuedo code)
Modularized implementation: forward / backward API
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
40
(x,y,z are scalars) x y z
* Modularized implementation: forward / backward API
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
41
(x,y,z are scalars) x y z
* Modularized implementation: forward / backward API
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
42
Example: Caffe layers
Caffe is licensed under BSD 2-Clause
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
43
* top_diff (chain rule)
Caffe is licensed under BSD 2-Clause
Caffe Sigmoid Layer
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(C) Dhruv Batra and Zsolt Kira 44
Figure Credit: Andrea Vedaldi