Basics of Numerical Optimization: Computing Derivatives
Ju Sun
Computer Science & Engineering University of Minnesota, Twin Cities
February 25, 2020
1 / 36
Basics of Numerical Optimization: Computing Derivatives Ju Sun - - PowerPoint PPT Presentation
Basics of Numerical Optimization: Computing Derivatives Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities February 25, 2020 1 / 36 Derivatives for numerical optimization gradient descent Newtons
1 / 36
Credit: aria42.com
2 / 36
Credit: aria42.com
2 / 36
Credit: aria42.com
2 / 36
Credit: aria42.com
2 / 36
Credit: [Baydin et al., 2017]
3 / 36
4 / 36
5 / 36
f (x) ∇h (f (x)) .
5 / 36
f (x) ∇h (f (x)) .
6 / 36
6 / 36
6 / 36
W 1,W 2,W 3 f (W 1, W 2, W 3) .
F
7 / 36
W 1,W 2,W 3 f (W 1, W 2, W 3) .
F
7 / 36
W 1,W 2,W 3 f (W 1, W 2, W 3) .
F
F
F
F − 2 yi − W 3W 2W 1xi, W 3∆W 1xi + O(∆2 F )
F
3 (yi − W 3W 2W 1xi) (W 1xi)⊺ , ∆ + O
F
W 1,W 2,W 3 f (W 1, W 2, W 3) .
F
F
F
F − 2 yi − W 3W 2W 1xi, W 3∆W 1xi + O(∆2 F )
F
3 (yi − W 3W 2W 1xi) (W 1xi)⊺ , ∆ + O
F
i W ⊺ 3 (yi − W 3W 2W 1xi) (W 1xi)⊺.
7 / 36
8 / 36
8 / 36
8 / 36
9 / 36
10 / 36
F ?
10 / 36
F ?
10 / 36
(Credit: numex-blog.com)
f(x+δ)−f(x) δ
11 / 36
(Credit: numex-blog.com)
f(x+δ)−f(x) δ
11 / 36
(Credit: numex-blog.com)
f(x+δ)−f(x) δ
11 / 36
(Credit: numex-blog.com)
f(x+δ)−f(x) δ
11 / 36
2
2
2
2
2
2
1 δ (f (x + δei) − f (x)) = 1 δ
∂xi + O
∂f ∂xi + O(δ)
12 / 36
2
2
2
1 δ (f (x + δei) − f (x)) = 1 δ
∂xi + O
∂f ∂xi + O(δ)
δ (f (x + δei) − f (x − δei)) = 1 2δ
∂xi + 1 2δ2 ∂2f ∂x2
i + δ ∂f
∂xi − 1 2δ2 ∂2f ∂x2
i + O
∂f ∂xi + O(δ2)
12 / 36
∂f ∂xi (x) : Rn → R. So
∂xi
∂xi
13 / 36
∂f ∂xi (x) : Rn → R. So
∂xi
∂xi
∂x
∂x
2
⊺
13 / 36
14 / 36
14 / 36
14 / 36
15 / 36
Credit: [Baydin et al., 2017]
16 / 36
Credit: [Baydin et al., 2017]
16 / 36
17 / 36
d f dx
17 / 36
d f dx
Input: x0, initialization
dy0 dy0
= 1 for i = 1, . . . , k do compute yi = fi
dyi dy0
=
dyi dyi−1
·
dyi−1 dy0
= f′
i
dyi−1
dy0
end for Output:
dyk dy0
17 / 36
18 / 36
d f dx
18 / 36
d f dx
Input: x0, dyk
dyk = 1
for i = 1, . . . , k do compute yi = fi
// forward pass for i = k − 1, k − 2, . . . , 0 do compute
dyk dyi
=
dyk dyi+1
·
dyi+1 dyi
= f′
i+1 (yi) dyk dyi+1
end for // backward pass Output:
dyk dy0
18 / 36
19 / 36
19 / 36
19 / 36
19 / 36
m
20 / 36
m
NB: this is a computational graph, not a NN
20 / 36
m
NB: this is a computational graph, not a NN
20 / 36
m
NB: this is a computational graph, not a NN
∂A is the
20 / 36
m
NB: this is a computational graph, not a NN
∂A is the
20 / 36
m
NB: this is a computational graph, not a NN
∂A is the
20 / 36
∂ ∂x1 ; for each variable
∂x1
21 / 36
∂ ∂x1 ; for each variable
∂x1
∂v1 ˙
∂v3 ˙
21 / 36
∂ ∂x1 ; for each variable
∂x1
∂v1 ˙
∂v3 ˙
21 / 36
∂ ∂x1 ; for each variable
∂x1
∂v1 ˙
∂v3 ˙
21 / 36
∂ ∂x1 ; for each variable
∂x1
∂v1 ˙
∂v3 ˙
21 / 36
22 / 36
∂ ; for each variable
∂y ∂vi (called adjoint
22 / 36
∂ ; for each variable
∂y ∂vi (called adjoint
∂v4 v5 + ∂v6 ∂v4 v6
22 / 36
∂ ; for each variable
∂y ∂vi (called adjoint
∂v4 v5 + ∂v6 ∂v4 v6
22 / 36
∂ ; for each variable
∂y ∂vi (called adjoint
∂v4 v5 + ∂v6 ∂v4 v6
22 / 36
∂ ; for each variable
∂y ∂vi (called adjoint
∂v4 v5 + ∂v6 ∂v4 v6
22 / 36
23 / 36
∂x (x—root of interest)
∂y ∂vi (y—leaf of interest)
23 / 36
24 / 36
24 / 36
24 / 36
24 / 36
fq = ∇x (f ⊺q), i.e., Jacobian-trans-vector product
24 / 36
fq = ∇x (f ⊺q), i.e., Jacobian-trans-vector product
d dvi (f ⊺q): d dvi (f ⊺q) = k:outgoing ∂vk ∂vi d dvk (f ⊺q)
24 / 36
25 / 36
25 / 36
F
25 / 36
F
25 / 36
26 / 36
26 / 36
26 / 36
26 / 36
26 / 36
26 / 36
26 / 36
2 y − Ax2 2 with ∇f (x) = −A⊺ (y − Ax)
27 / 36
2 y − Ax2 2 with ∇f (x) = −A⊺ (y − Ax)
27 / 36
2
28 / 36
29 / 36
30 / 36
30 / 36
30 / 36
30 / 36
Credit: https://people.csail.mit.edu/tzumao/gradient_halide/
30 / 36
Credit: https://people.csail.mit.edu/tzumao/gradient_halide/
31 / 36
32 / 36
32 / 36
33 / 36
34 / 36
35 / 36
[Baydin et al., 2017] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,
Journal of Machine Learning Research, 18(1):5595–5637. [Griewank and Walther, 2008] Griewank, A. and Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics. 36 / 36