Backpropagation
Ryan Cotterell and Clara Meister
Backpropagation Ryan Cotterell and Clara Meister Administrivia - - PowerPoint PPT Presentation
Backpropagation Ryan Cotterell and Clara Meister Administrivia Changes in the Teaching Staff Clara Meister (Head TA) BSc/MSc from Stanford University Despite the last name, my German ist sehr schlecht Niklas Stoehr
Ryan Cotterell and Clara Meister
3
coding ○ The teaching staff is preparing the assignment ○ We will update you as things become clearer!
○ You should form groups of 2 to 4 people ■ Feel free to use Piazza to reach out to other students in the course ○ We will require you to write a 1-page project proposal where we will give you feedback on the idea ■ Expect to turn this in before the end of October; date will be given soon
4
○ When statistical, we have to estimate parameters from data ○ How do we estimate the parameters?
○ This lecture teaches you how to compute the gradient of virtually any model efficiently
6
○ Many of you may find it irksome, but we are teaching backpropagation out
○ Backpropagation is the 21th century’s algorithm: You need to know it ○ At many places in this course, I am going to say: You can compute X with backpropagation and move on to cover more interesting things ○ Many NLP algorithms come in duals where one is the “backpropagation version” of the other ■ Forward → Forward–Backward (by backpropagation) ■ Inside → Inside–Outside (by backpropagation) ■ Computing a normalizer → computing marginals
7
8
Chris Olah’s Blog, Justin Domke’s Notes, Tim Vieira’s Blog, Moritz Hardt’s Notes, Baur and Strassen (1983), Griewank and Walter (2008), Eisner (2016)
9
Backpropagation
1
Calculus Review
2
Computation Graphs
3
Reverse-Mode AD
4
11
12
13
○ The chain rule (Leibniz, 1676; L'Hôpital, 1696) ○ Dynamic Programming (DP, Bellman, 1957) ○ Minimisation of errors through gradient descent (Cauchy 1847, Hadamard, 1908) ■ in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems (Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961; Pontryagin et al., 1961, …)
connected, NN-like networks apparently was first described in 1970 by Finnish master student Seppo Linnainmaa
(1982)
BP for NNs as computers became faster
14
http://people.idsia.ch/~juergen/who-invented-backpropagation.html
○ The chain rule (Leibniz, 1676; L'Hôpital, 1696) ○ Dynamic Programming (DP, Bellman, 1957) ○ Minimisation of errors through gradient descent (Cauchy 1847, Hadamard, 1908) ■ in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems (Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961; Pontryagin et al., 1961, …)
connected, NN-like networks apparently was first described in 1970 by Finnish master student Seppo Linnainmaa
(1982)
BP for NNs as computers became faster
15
http://people.idsia.ch/~juergen/who-invented-backpropagation.html
See this critique for some CS drama!!
16
17
18
19
20
21
22
23
24
25
instantaneous rate of change.
x ∊ ℝ is defined as: where f is said to be differentiable at x if such a limit exists. Generally, this simply requires that f be smooth and continuous at x.
27
approximately ε∙f ’(x)
line is the best linear approximation of the function near x. ○ We can then use as a locally linear approximation of f at x for some x0
28
where is the (partial) derivative of f with respect to xi
if we move x along the ith coordinate axis.
29
Now, ∇f(x) is a vector!
each element of x and each element of y. I.e., the (i, j)-th element of tells us the amount by which yi will change if xj is changed by a small amount.
30
are Jacobians). Consequently, we can form the Jacobian where
31
see https://en.wikipedia.org/wiki/Chain_rule#General_rule for proof of multivariate case
Ex:
where
33
where each node is a variable and each hyperedge is labeled with a function.
34
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
35
36
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
37
38
39
40
41
42
43
44
How many paths must we sum over in order to calculate ?
45
Answer: 3x3 for each derivative.
How many paths must we sum over in order to calculate ?
46
What happens if we add another “layer?”
How many paths must we sum over in order to calculate ?
47
What happens if we add another “layer?” There are an exponential number of paths! We have O(3n) paths in this case
How many paths must we sum over in order to calculate ?
48
49
50
51
53
Set of primitives
54
And their derivatives
55
56
57
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
Example:
58
x = 2 y= -1 z = 0 a = 4 b = 1 c = -1 d = 3 e = 0.141 g = 0.141
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
Example:
59
labels pi (for primitives) on the hyperarcs
60
61
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
Example:
62
x = 2 y= -1 z = 0 a = 4 b = 1 c = -1 d = 3 e = 0.141 g = 0.141
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
Example:
63
x = 2 y= -1 z = 0 a = 4 b = 1 c = -1 d = 3 e = 0.141 g = 0.141
Example:
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
64
x = 2 y= -1 z = 0 a = 4 b = 1 c = -1 d = 3 e = 0.141 g = 0.141
Example:
= -3.96
= -0.99
= -0.99
= 1.99
= 0.99 = -0.99 = 1
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
= -0.99
65
x = 2 y= -1 z = 0 a = 4 b = 1 c = -1 d = 3 e = 0.141 g = 0.141
Example:
= -3.96
= -0.99
= -0.99
= 1.99
= 0.99 = -0.99 = 1
(·)2
exp(∙)
y
a
b
⨉ c
sin(∙) d
e
z
g
= -3.96 = -0.99 = 1.99
= -0.99
Example:
with respect to each variable: a simple application of the chain rule!
66
67
labels pi (for primitives) on the hyperarcs
68
labels pi (for primitives) on the hyperarcs
base case for
69
70
○
The finite-difference approximation
71
○
The finite-difference approximation
72
repeated computation repeated computation
○
The finite-difference approximation
73
Much, much slower in general
○
The finite-difference approximation
74
simple equality constraints for a constrained optimization problem. Example:
76
simple equality constraints for a constrained optimization problem. General Case:
77
directed acyclic hypergraph with N edges and labels pi (for primitives) on the hyperarcs
so i < j implies vi is before vj
nodes and set to x
problems—with Lagrange multipliers—we can exactly recover the intermediate derivatives in the backprop algorithm. Derivation:
78
directed acyclic hypergraph with N edges and labels pi (for primitives) on the hyperarcs
so i < j implies vi is before vj
nodes and set to x
problems—with Lagrange multipliers—we can exactly recover the intermediate derivatives in the backprop algorithm. Derivation:
79
problems—with Lagrange multipliers—we can exactly recover the intermediate derivatives in the backprop algorithm. Derivation: Optimality Condition (setting Lagrangian equal to zero):
80
problems—with Lagrange multipliers—we can exactly recover the intermediate derivatives in the backprop algorithm. Derivation: Solving the equations
81
problems—with Lagrange multipliers—we can exactly recover the intermediate derivatives in the backprop algorithm. Derivation: Solving the equations
82
look familiar?
Recall our backprop algorithm:
83
85
Probability Refresher Log-Linear Models Softmax Function The Exponential Family
1 2 3 4
Afterwards: we are finally ready to do some NLP together! 🙃
learning
how it works ○ Life lesson: You should understand the tools you are using!
complexity of computing the derivative of a function ○ Same asymptotic complexity as the original function! ○ Many inefficient algorithms were published because the authors did not fully understand backpropagation
87
88
89