Basics of Numerical Optimization: Computing Derivatives Ju Sun - - PowerPoint PPT Presentation

basics of numerical optimization computing derivatives
SMART_READER_LITE
LIVE PREVIEW

Basics of Numerical Optimization: Computing Derivatives Ju Sun - - PowerPoint PPT Presentation

Basics of Numerical Optimization: Computing Derivatives Ju Sun Computer Science & Engineering University of Minnesota, Twin Cities February 25, 2020 1 / 36 Derivatives for numerical optimization gradient descent Newtons


slide-1
SLIDE 1

Basics of Numerical Optimization: Computing Derivatives

Ju Sun

Computer Science & Engineering University of Minnesota, Twin Cities

February 25, 2020

1 / 36

slide-2
SLIDE 2

Derivatives for numerical optimization

Credit: aria42.com

– gradient descent – Newton’s method – momentum methods – quasi-Newton methods – coordinate descent – conjugate gradient methods – trust-region methods – etc

2 / 36

slide-3
SLIDE 3

Derivatives for numerical optimization

Credit: aria42.com

– gradient descent – Newton’s method – momentum methods – quasi-Newton methods – coordinate descent – conjugate gradient methods – trust-region methods – etc – Almost all methods require low-order derivatives, i.e., gradient and/or Hessian, to proceed.

2 / 36

slide-4
SLIDE 4

Derivatives for numerical optimization

Credit: aria42.com

– gradient descent – Newton’s method – momentum methods – quasi-Newton methods – coordinate descent – conjugate gradient methods – trust-region methods – etc – Almost all methods require low-order derivatives, i.e., gradient and/or Hessian, to proceed. – Numerical derivatives (i.e., numbers), rather than analytic derivatives (i.e., math expressions), are needed

2 / 36

slide-5
SLIDE 5

Derivatives for numerical optimization

Credit: aria42.com

– gradient descent – Newton’s method – momentum methods – quasi-Newton methods – coordinate descent – conjugate gradient methods – trust-region methods – etc – Almost all methods require low-order derivatives, i.e., gradient and/or Hessian, to proceed. – Numerical derivatives (i.e., numbers), rather than analytic derivatives (i.e., math expressions), are needed This lecture: how to compute the numerical derivatives

2 / 36

slide-6
SLIDE 6

Four kinds of computing techniques

Credit: [Baydin et al., 2017]

3 / 36

slide-7
SLIDE 7

Outline

Analytic differentiation Finite-difference approximation Automatic differentiation Differentiable programming Suggested reading

4 / 36

slide-8
SLIDE 8

Analytic derivatives

Idea: derive the analytic derivatives first, then make numerical substitution

5 / 36

slide-9
SLIDE 9

Analytic derivatives

Idea: derive the analytic derivatives first, then make numerical substitution To derive the analytic derivatives by hand: – Chain rule (vector version) method Let f : Rm → Rn and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and J [h◦f] (x) = J h (f (x)) J f (x) . When k = 1, ∇ [h ◦ f] (x) = J ⊤

f (x) ∇h (f (x)) .

5 / 36

slide-10
SLIDE 10

Analytic derivatives

Idea: derive the analytic derivatives first, then make numerical substitution To derive the analytic derivatives by hand: – Chain rule (vector version) method Let f : Rm → Rn and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and J [h◦f] (x) = J h (f (x)) J f (x) . When k = 1, ∇ [h ◦ f] (x) = J ⊤

f (x) ∇h (f (x)) .

– Taylor expansion method Expand the perturbed function f (x + δ) and then match it against Taylor expansions to read off the gradient and/or Hessian: f (x + δ) ≈ f (x) + ∇f (x) , δ f (x + δ) ≈ f (x) + ∇f (x) , δ + 1 2

  • δ, ∇2f (x) δ
  • 5 / 36
slide-11
SLIDE 11

Derive chain rule by Taylor expansion (optional)

Start with h (f (x + δ)), where δ is always sufficiently small as we want

6 / 36

slide-12
SLIDE 12

Derive chain rule by Taylor expansion (optional)

Start with h (f (x + δ)), where δ is always sufficiently small as we want h (f (x + δ)) = h   f (x) + Jf (x) δ + o (δ2)

  • perturbation

   = h (f (x)) + Jh (f (x)) [Jf (x) δ + o (δ2)] +

  • (Jf (x) δ + o (δ2))
  • (δ2)

= h (f (x)) + Jh (f (x)) Jf (x) δ

  • linear term

+o (δ2) ,

6 / 36

slide-13
SLIDE 13

Derive chain rule by Taylor expansion (optional)

Start with h (f (x + δ)), where δ is always sufficiently small as we want h (f (x + δ)) = h   f (x) + Jf (x) δ + o (δ2)

  • perturbation

   = h (f (x)) + Jh (f (x)) [Jf (x) δ + o (δ2)] +

  • (Jf (x) δ + o (δ2))
  • (δ2)

= h (f (x)) + Jh (f (x)) Jf (x) δ

  • linear term

+o (δ2) , So, Jh◦f(x) = Jh (f (x)) Jf (x) .

6 / 36

slide-14
SLIDE 14

Taylor expansion method, again

Derive gradient of a three-layer linear neural network min

W 1,W 2,W 3 f (W 1, W 2, W 3) .

=

  • i

yi − W 3W 2W 1xi2

F

7 / 36

slide-15
SLIDE 15

Taylor expansion method, again

Derive gradient of a three-layer linear neural network min

W 1,W 2,W 3 f (W 1, W 2, W 3) .

=

  • i

yi − W 3W 2W 1xi2

F

We can derive the partial gradients wrt W i’s separately. (Why?)

7 / 36

slide-16
SLIDE 16

Taylor expansion method, again

Derive gradient of a three-layer linear neural network min

W 1,W 2,W 3 f (W 1, W 2, W 3) .

=

  • i

yi − W 3W 2W 1xi2

F

We can derive the partial gradients wrt W i’s separately. (Why?) For example, for W 2, f (W 1, W 2 + ∆, W 3) =

  • i

yi − W 3 (W 2 + ∆) W 1xi2

F

=

  • i

(yi − W 3W 2W 1xi) − W 3∆W 1xi2

F

=

  • i

yi − W 3W 2W 1xi2

F − 2 yi − W 3W 2W 1xi, W 3∆W 1xi + O(∆2 F )

=

  • i

yi − W 3W 2W 1xi2

F

− 2

  • i

W ⊺

3 (yi − W 3W 2W 1xi) (W 1xi)⊺ , ∆ + O

  • ∆2

F

  • 7 / 36
slide-17
SLIDE 17

Taylor expansion method, again

Derive gradient of a three-layer linear neural network min

W 1,W 2,W 3 f (W 1, W 2, W 3) .

=

  • i

yi − W 3W 2W 1xi2

F

We can derive the partial gradients wrt W i’s separately. (Why?) For example, for W 2, f (W 1, W 2 + ∆, W 3) =

  • i

yi − W 3 (W 2 + ∆) W 1xi2

F

=

  • i

(yi − W 3W 2W 1xi) − W 3∆W 1xi2

F

=

  • i

yi − W 3W 2W 1xi2

F − 2 yi − W 3W 2W 1xi, W 3∆W 1xi + O(∆2 F )

=

  • i

yi − W 3W 2W 1xi2

F

− 2

  • i

W ⊺

3 (yi − W 3W 2W 1xi) (W 1xi)⊺ , ∆ + O

  • ∆2

F

  • So: ∇W 2f = −2

i W ⊺ 3 (yi − W 3W 2W 1xi) (W 1xi)⊺.

7 / 36

slide-18
SLIDE 18

Symbolic differentiation

Idea: derive the analytic derivatives first, then make numerical substitution To derive the analytic derivatives by software:

8 / 36

slide-19
SLIDE 19

Symbolic differentiation

Idea: derive the analytic derivatives first, then make numerical substitution To derive the analytic derivatives by software: – Matlab (Symbolic Math Toolbox, diff) – Python (SymPy, diff) – Mathmatica (D)

8 / 36

slide-20
SLIDE 20

Symbolic differentiation

Idea: derive the analytic derivatives first, then make numerical substitution To derive the analytic derivatives by software: – Matlab (Symbolic Math Toolbox, diff) – Python (SymPy, diff) – Mathmatica (D) Effective for functions with few variables only

8 / 36

slide-21
SLIDE 21

Outline

Analytic differentiation Finite-difference approximation Automatic differentiation Differentiable programming Suggested reading

9 / 36

slide-22
SLIDE 22

Limitation of analytic differentiation

10 / 36

slide-23
SLIDE 23

Limitation of analytic differentiation

What is the gradient and/or Hessian of f (W ) =

  • i

yi − σ (W kσ (W k−1σ . . . (W 1xi)))2

F ?

Applying the chain rule is boring and -prone. Performing Taylor expansion is also tedious

10 / 36

slide-24
SLIDE 24

Limitation of analytic differentiation

What is the gradient and/or Hessian of f (W ) =

  • i

yi − σ (W kσ (W k−1σ . . . (W 1xi)))2

F ?

Applying the chain rule is boring and -prone. Performing Taylor expansion is also tedious Lesson we learn from technology history: leave boring jobs to computers

10 / 36

slide-25
SLIDE 25

Approximate the gradient

(Credit: numex-blog.com)

f ′ (x) = limδ→0

f(x+δ)−f(x) δ

11 / 36

slide-26
SLIDE 26

Approximate the gradient

(Credit: numex-blog.com)

f ′ (x) = limδ→0

f(x+δ)−f(x) δ

For f (x) : Rn → R, ∂f ∂xi ≈ f (x + δei) − f (x) δ (forward) ∂f ∂xi ≈ f (x) − f (x − δei) δ (backward) ∂f ∂xi ≈ f (x + δei) − f (x − δei) 2δ (central)

11 / 36

slide-27
SLIDE 27

Approximate the gradient

(Credit: numex-blog.com)

f ′ (x) = limδ→0

f(x+δ)−f(x) δ

For f (x) : Rn → R, ∂f ∂xi ≈ f (x + δei) − f (x) δ (forward) ∂f ∂xi ≈ f (x) − f (x − δei) δ (backward) ∂f ∂xi ≈ f (x + δei) − f (x − δei) 2δ (central) Similarly, to approximate the Jacobian for f (x) : Rn → Rm: ∂fj ∂xi ≈ fj (x + δei) − fj (x) δ (one element each time) ∂f ∂xi ≈ f (x + δei) − f (x) δ (one column each time) Jp ≈ f (x + δp) − f (x) δ (directional)

11 / 36

slide-28
SLIDE 28

Approximate the gradient

(Credit: numex-blog.com)

f ′ (x) = limδ→0

f(x+δ)−f(x) δ

For f (x) : Rn → R, ∂f ∂xi ≈ f (x + δei) − f (x) δ (forward) ∂f ∂xi ≈ f (x) − f (x − δei) δ (backward) ∂f ∂xi ≈ f (x + δei) − f (x − δei) 2δ (central) Similarly, to approximate the Jacobian for f (x) : Rn → Rm: ∂fj ∂xi ≈ fj (x + δei) − fj (x) δ (one element each time) ∂f ∂xi ≈ f (x + δei) − f (x) δ (one column each time) Jp ≈ f (x + δp) − f (x) δ (directional) central themes can also be derived

11 / 36

slide-29
SLIDE 29

Why central?

Stronger form of Taylor’s theorems – 1st order: If f (x) : Rn → R is twice continuously differentiable, f (x + δ) = f (x) + ∇f (x) , δ + O

  • δ2

2

  • – 2nd order: If f (x) : Rn → R is three-times continuously differentiable,

f (x + δ) = f (x) + ∇f (x) , δ + 1

2

  • δ, ∇2f (x) δ
  • + O
  • δ3

2

  • 12 / 36
slide-30
SLIDE 30

Why central?

Stronger form of Taylor’s theorems – 1st order: If f (x) : Rn → R is twice continuously differentiable, f (x + δ) = f (x) + ∇f (x) , δ + O

  • δ2

2

  • – 2nd order: If f (x) : Rn → R is three-times continuously differentiable,

f (x + δ) = f (x) + ∇f (x) , δ + 1

2

  • δ, ∇2f (x) δ
  • + O
  • δ3

2

  • Why the central theme is better?

– Forward: by 1st-order Taylor expansion

1 δ (f (x + δei) − f (x)) = 1 δ

  • δ ∂f

∂xi + O

  • δ2

=

∂f ∂xi + O(δ)

12 / 36

slide-31
SLIDE 31

Why central?

Stronger form of Taylor’s theorems – 1st order: If f (x) : Rn → R is twice continuously differentiable, f (x + δ) = f (x) + ∇f (x) , δ + O

  • δ2

2

  • – 2nd order: If f (x) : Rn → R is three-times continuously differentiable,

f (x + δ) = f (x) + ∇f (x) , δ + 1

2

  • δ, ∇2f (x) δ
  • + O
  • δ3

2

  • Why the central theme is better?

– Forward: by 1st-order Taylor expansion

1 δ (f (x + δei) − f (x)) = 1 δ

  • δ ∂f

∂xi + O

  • δ2

=

∂f ∂xi + O(δ)

– Central: by 2nd-order Taylor expansion 1

δ (f (x + δei) − f (x − δei)) = 1 2δ

  • δ ∂f

∂xi + 1 2δ2 ∂2f ∂x2

i + δ ∂f

∂xi − 1 2δ2 ∂2f ∂x2

i + O

  • δ3

=

∂f ∂xi + O(δ2)

12 / 36

slide-32
SLIDE 32

Approximate the Hessian

– Recall that for f (x) : Rn → R that is 2nd-order differentiable,

∂f ∂xi (x) : Rn → R. So

∂f 2 ∂xj∂xi (x) = ∂ ∂xj ∂f ∂xi

  • (x) ≈
  • ∂f

∂xi

  • (x + δej) −
  • ∂f

∂xi

  • (x)

δ

13 / 36

slide-33
SLIDE 33

Approximate the Hessian

– Recall that for f (x) : Rn → R that is 2nd-order differentiable,

∂f ∂xi (x) : Rn → R. So

∂f 2 ∂xj∂xi (x) = ∂ ∂xj ∂f ∂xi

  • (x) ≈
  • ∂f

∂xi

  • (x + δej) −
  • ∂f

∂xi

  • (x)

δ – We can also compute one row of Hessian each time by ∂ ∂xj ∂f ∂x

  • (x) ≈

∂f

∂x

  • (x + δej) −

∂f

∂x

  • (x)

δ ,

  • btaining

H, which might not be symmetric. Return 1

2

  • H +

H

instead – Most times (e.g., in TRM, Newton-CG), only ∇2f (x) v for certain v’s needed: (see, e.g., Manopt https://www.manopt.org/) ∇2f (x) v ≈ ∇f (x + δv) − f (x) δ (1)

13 / 36

slide-34
SLIDE 34

A few words

– Can be used for sanity check for correctness of analytic gradient

14 / 36

slide-35
SLIDE 35

A few words

– Can be used for sanity check for correctness of analytic gradient – Finite-difference approximation of higher (i.e., ≥ 2)-order derivatives combined with high-order iterative methods can be very efficient (e.g., Manopt https://www.manopt.org/tutorial.html#costdescription)

14 / 36

slide-36
SLIDE 36

A few words

– Can be used for sanity check for correctness of analytic gradient – Finite-difference approximation of higher (i.e., ≥ 2)-order derivatives combined with high-order iterative methods can be very efficient (e.g., Manopt https://www.manopt.org/tutorial.html#costdescription) – Numerical stability can be an issue: truncation and round off s (finite δ; accurate evaluation of the nominators)

14 / 36

slide-37
SLIDE 37

Outline

Analytic differentiation Finite-difference approximation Automatic differentiation Differentiable programming Suggested reading

15 / 36

slide-38
SLIDE 38

Four kinds of computing techniques

Credit: [Baydin et al., 2017]

16 / 36

slide-39
SLIDE 39

Four kinds of computing techniques

Credit: [Baydin et al., 2017]

Misnomer: should be automatic numerical differentiation

16 / 36

slide-40
SLIDE 40

Forward mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form:

17 / 36

slide-41
SLIDE 41

Forward mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • 17 / 36
slide-42
SLIDE 42

Forward mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • Compute

d f dx

  • x0 in one pass, from inner to outer most parenthesis:

17 / 36

slide-43
SLIDE 43

Forward mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • Compute

d f dx

  • x0 in one pass, from inner to outer most parenthesis:

Input: x0, initialization

dy0 dy0

  • x0

= 1 for i = 1, . . . , k do compute yi = fi

  • yi−1
  • compute

dyi dy0

  • x0

=

dyi dyi−1

  • yi−1

·

dyi−1 dy0

  • x0

= f′

i

  • yi−1

dyi−1

dy0

  • x0

end for Output:

dyk dy0

  • x0

17 / 36

slide-44
SLIDE 44

Reverse mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form:

18 / 36

slide-45
SLIDE 45

Reverse mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • 18 / 36
slide-46
SLIDE 46

Reverse mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • Compute

d f dx

  • x0 in two passes, from inner to outer most parenthesis for the 2nd:

18 / 36

slide-47
SLIDE 47

Reverse mode in 1D

Consider a univariate function fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f1 (x) : R → R. Write y0 = x, y1 = f1 (x), y2 = f2 (y1), . . . , yk = f (yk−1), or in computational graph form: Chain rule: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • Compute

d f dx

  • x0 in two passes, from inner to outer most parenthesis for the 2nd:

Input: x0, dyk

dyk = 1

for i = 1, . . . , k do compute yi = fi

  • yi−1
  • end for

// forward pass for i = k − 1, k − 2, . . . , 0 do compute

dyk dyi

  • yi

=

dyk dyi+1

  • yi+1

·

dyi+1 dyi

  • yi

= f′

i+1 (yi) dyk dyi+1

  • yi+1

end for // backward pass Output:

dyk dy0

  • x0

18 / 36

slide-48
SLIDE 48

Forward vs reverse modes

19 / 36

slide-49
SLIDE 49

Forward vs reverse modes

– forward mode AD: one forward pass, compute the intermediate variable and derivative values together – reverse mode AD: one forward pass to compute the intermediate variable values, one backward pass to compute the intermediate derivatives

19 / 36

slide-50
SLIDE 50

Forward vs reverse modes

– forward mode AD: one forward pass, compute the intermediate variable and derivative values together – reverse mode AD: one forward pass to compute the intermediate variable values, one backward pass to compute the intermediate derivatives Effectively, two different ways of grouping the multiplicative differential terms: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the root: dy0

dy0 → dy1 dy0 → dy2 dy0 → · · · → dyk dy0 d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the leaf: dyk

dyk → dyk dyk−1 → dyk dyk−2 → · · · → dyk dy0 ...mixed forward and reverse modes are indeed possible!

slide-51
SLIDE 51

Forward vs reverse modes

– forward mode AD: one forward pass, compute the intermediate variable and derivative values together – reverse mode AD: one forward pass to compute the intermediate variable values, one backward pass to compute the intermediate derivatives Effectively, two different ways of grouping the multiplicative differential terms: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the root: dy0

dy0 → dy1 dy0 → dy2 dy0 → · · · → dyk dy0 d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the leaf: dyk

dyk → dyk dyk−1 → dyk dyk−2 → · · · → dyk dy0

19 / 36

slide-52
SLIDE 52

Forward vs reverse modes

– forward mode AD: one forward pass, compute the intermediate variable and derivative values together – reverse mode AD: one forward pass to compute the intermediate variable values, one backward pass to compute the intermediate derivatives Effectively, two different ways of grouping the multiplicative differential terms: d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the root: dy0

dy0 → dy1 dy0 → dy2 dy0 → · · · → dyk dy0 d f dx = d f dy0 = dyk dyk−1 dyk−1 dyk−2

  • . . .

dy2 dy1 dy1 dy0

  • i.e., starting from the leaf: dyk

dyk → dyk dyk−1 → dyk dyk−2 → · · · → dyk dy0 ...mixed forward and reverse modes are indeed possible!

19 / 36

slide-53
SLIDE 53

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

20 / 36

slide-54
SLIDE 54

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

NB: this is a computational graph, not a NN

20 / 36

slide-55
SLIDE 55

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

NB: this is a computational graph, not a NN

– Each node is a variable, as a function of all incoming variables

20 / 36

slide-56
SLIDE 56

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

NB: this is a computational graph, not a NN

– Each node is a variable, as a function of all incoming variables – If node B a descent of node A, ∂B

∂A is the

rate of change in B wrt change in A

20 / 36

slide-57
SLIDE 57

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

NB: this is a computational graph, not a NN

– Each node is a variable, as a function of all incoming variables – If node B a descent of node A, ∂B

∂A is the

rate of change in B wrt change in A – Traveling along a path, rates of changes should be multiplied

20 / 36

slide-58
SLIDE 58

Chain rule in computational graphs

Chain rule Let f : Rn → Rm and h : Rn → Rk, and f is differentiable at x and y = f (x) and h is differentiable at y. Then, h ◦ f : Rn → Rk is differentiable at x, and (write z = h (y)) J [h◦f] (x) = J h (f (x)) J f (x) ,

  • r ∂zj

∂xi =

m

  • ℓ=1

∂zj ∂yℓ ∂yℓ ∂xi ∀ i, j

NB: this is a computational graph, not a NN

– Each node is a variable, as a function of all incoming variables – If node B a descent of node A, ∂B

∂A is the

rate of change in B wrt change in A – Traveling along a path, rates of changes should be multiplied – Chain rule: summing up rates over all connecting paths! (e.g., x2 to zj as shown)

20 / 36

slide-59
SLIDE 59

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • 21 / 36
slide-60
SLIDE 60

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • – interested in

∂ ∂x1 ; for each variable

vi, write ˙ vi . = ∂vi

∂x1

21 / 36

slide-61
SLIDE 61

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • – interested in

∂ ∂x1 ; for each variable

vi, write ˙ vi . = ∂vi

∂x1

– for each node, sum up partials

  • ver all incoming edges, e.g.,

˙ v4 = ∂v4

∂v1 ˙

v1 + ∂v4

∂v3 ˙

v3

21 / 36

slide-62
SLIDE 62

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • – interested in

∂ ∂x1 ; for each variable

vi, write ˙ vi . = ∂vi

∂x1

– for each node, sum up partials

  • ver all incoming edges, e.g.,

˙ v4 = ∂v4

∂v1 ˙

v1 + ∂v4

∂v3 ˙

v3 – complexity:

21 / 36

slide-63
SLIDE 63

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • – interested in

∂ ∂x1 ; for each variable

vi, write ˙ vi . = ∂vi

∂x1

– for each node, sum up partials

  • ver all incoming edges, e.g.,

˙ v4 = ∂v4

∂v1 ˙

v1 + ∂v4

∂v3 ˙

v3 – complexity: O (#edges + #nodes)

21 / 36

slide-64
SLIDE 64

A multivariate example — forward mode

y =

  • sin x1

x2 + x1 x2 − ex2 x1 x2 − ex2

  • – interested in

∂ ∂x1 ; for each variable

vi, write ˙ vi . = ∂vi

∂x1

– for each node, sum up partials

  • ver all incoming edges, e.g.,

˙ v4 = ∂v4

∂v1 ˙

v1 + ∂v4

∂v3 ˙

v3 – complexity: O (#edges + #nodes) – for f : Rn → Rm, make n forward passes: O (n (#edges + #nodes))

21 / 36

slide-65
SLIDE 65

A multivariate example — reverse mode

22 / 36

slide-66
SLIDE 66

A multivariate example — reverse mode

– interested in ∂y

∂ ; for each variable

vi, write vi . =

∂y ∂vi (called adjoint

variable)

22 / 36

slide-67
SLIDE 67

A multivariate example — reverse mode

– interested in ∂y

∂ ; for each variable

vi, write vi . =

∂y ∂vi (called adjoint

variable) – for each node, sum up partials

  • ver all outgoing edges, e.g.,

v4 = ∂v5

∂v4 v5 + ∂v6 ∂v4 v6

22 / 36

slide-68
SLIDE 68

A multivariate example — reverse mode

– interested in ∂y

∂ ; for each variable

vi, write vi . =

∂y ∂vi (called adjoint

variable) – for each node, sum up partials

  • ver all outgoing edges, e.g.,

v4 = ∂v5

∂v4 v5 + ∂v6 ∂v4 v6

– complexity:

22 / 36

slide-69
SLIDE 69

A multivariate example — reverse mode

– interested in ∂y

∂ ; for each variable

vi, write vi . =

∂y ∂vi (called adjoint

variable) – for each node, sum up partials

  • ver all outgoing edges, e.g.,

v4 = ∂v5

∂v4 v5 + ∂v6 ∂v4 v6

– complexity: O (#edges + #nodes)

22 / 36

slide-70
SLIDE 70

A multivariate example — reverse mode

– interested in ∂y

∂ ; for each variable

vi, write vi . =

∂y ∂vi (called adjoint

variable) – for each node, sum up partials

  • ver all outgoing edges, e.g.,

v4 = ∂v5

∂v4 v5 + ∂v6 ∂v4 v6

– complexity: O (#edges + #nodes) – for f : Rn → Rm, make n forward passes: O (m (#edges + #nodes)) example from Ch 1

  • f [Griewank and Walther, 2008]

22 / 36

slide-71
SLIDE 71

Forward vs. reverse modes

For general function f : Rn → Rm, suppose there is no loop in the computational graph, i.e., acyclic graph. Define E: set of edges ; V : set of nodes

23 / 36

slide-72
SLIDE 72

Forward vs. reverse modes

For general function f : Rn → Rm, suppose there is no loop in the computational graph, i.e., acyclic graph. Define E: set of edges ; V : set of nodes forward mode reverse mode start from roots leaves end with leaves roots invariants ˙ vi . = ∂vi

∂x (x—root of interest)

vi . =

∂y ∂vi (y—leaf of interest)

rule sum over incoming edges sum over outgoing edges complexity O(n |E| + n |V |) O(m |E| + m |V |) better when m ≫ n n ≫ m

23 / 36

slide-73
SLIDE 73

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations,

24 / 36

slide-74
SLIDE 74

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations, forward mode: compute J fp, i.e., Jacobian-vector product

24 / 36

slide-75
SLIDE 75

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations, forward mode: compute J fp, i.e., Jacobian-vector product – Why? (1) Columns of J f can be obtained by setting p = e1, . . . , en. (2) When J f has special structures (e.g., sparsity), save computation by judicious choices of p’s (3) Problem may only need J fp for a specific p, not J f itself—save computation again

24 / 36

slide-76
SLIDE 76

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations, forward mode: compute J fp, i.e., Jacobian-vector product – Why? (1) Columns of J f can be obtained by setting p = e1, . . . , en. (2) When J f has special structures (e.g., sparsity), save computation by judicious choices of p’s (3) Problem may only need J fp for a specific p, not J f itself—save computation again – How? (1) initialize Dpvn−1 = p1, . . . , Dpv0 = pn. (2) apply chain rule: ∇xvi =

  • j:incoming

∂vi ∂vj ∇xvj = ⇒ Dpvi =

  • j:incoming

∂vi ∂vj Dpvj

24 / 36

slide-77
SLIDE 77

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations, forward mode: compute J fp, i.e., Jacobian-vector product – Why? (1) Columns of J f can be obtained by setting p = e1, . . . , en. (2) When J f has special structures (e.g., sparsity), save computation by judicious choices of p’s (3) Problem may only need J fp for a specific p, not J f itself—save computation again – How? (1) initialize Dpvn−1 = p1, . . . , Dpv0 = pn. (2) apply chain rule: ∇xvi =

  • j:incoming

∂vi ∂vj ∇xvj = ⇒ Dpvi =

  • j:incoming

∂vi ∂vj Dpvj reverse mode: compute J ⊺

fq = ∇x (f ⊺q), i.e., Jacobian-trans-vector product

24 / 36

slide-78
SLIDE 78

Directional derivatives

Consider f (x) : Rn → Rm. Let vs’s be the variables in its computational graph. Particularly, vn−1 = x1, vn−2 = x2, . . . , v0 = xn. Dp (·) means directional derivative wrt p. In practical implementations, forward mode: compute J fp, i.e., Jacobian-vector product – Why? (1) Columns of J f can be obtained by setting p = e1, . . . , en. (2) When J f has special structures (e.g., sparsity), save computation by judicious choices of p’s (3) Problem may only need J fp for a specific p, not J f itself—save computation again – How? (1) initialize Dpvn−1 = p1, . . . , Dpv0 = pn. (2) apply chain rule: ∇xvi =

  • j:incoming

∂vi ∂vj ∇xvj = ⇒ Dpvi =

  • j:incoming

∂vi ∂vj Dpvj reverse mode: compute J ⊺

fq = ∇x (f ⊺q), i.e., Jacobian-trans-vector product

– Why? Similar to the above – How? Track

d dvi (f ⊺q): d dvi (f ⊺q) = k:outgoing ∂vk ∂vi d dvk (f ⊺q)

24 / 36

slide-79
SLIDE 79

Tensor abstraction

Tensors: multi-dimensional arrays

25 / 36

slide-80
SLIDE 80

Tensor abstraction

Tensors: multi-dimensional arrays Each node in the computational graph can be a tensor (scalar, vector, matrix, 3-D tensor, ...)

25 / 36

slide-81
SLIDE 81

Tensor abstraction

Tensors: multi-dimensional arrays Each node in the computational graph can be a tensor (scalar, vector, matrix, 3-D tensor, ...) f (W ) = Y − σ (W kσ (W k−1σ . . . (W 1X)))2

F

25 / 36

slide-82
SLIDE 82

Tensor abstraction

Tensors: multi-dimensional arrays Each node in the computational graph can be a tensor (scalar, vector, matrix, 3-D tensor, ...) f (W ) = Y − σ (W kσ (W k−1σ . . . (W 1X)))2

F

computational graph for DNN

25 / 36

slide-83
SLIDE 83

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple

26 / 36

slide-84
SLIDE 84

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation

26 / 36

slide-85
SLIDE 85

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation – Basis of implementation for: Tensorflow, Pytorch, Jax, etc Jax: https://github.com/google/jax

26 / 36

slide-86
SLIDE 86

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation – Basis of implementation for: Tensorflow, Pytorch, Jax, etc Jax: https://github.com/google/jax Good to know: – In practice, graphs are built automatically by software

26 / 36

slide-87
SLIDE 87

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation – Basis of implementation for: Tensorflow, Pytorch, Jax, etc Jax: https://github.com/google/jax Good to know: – In practice, graphs are built automatically by software – Higher-order derivatives can also be done, particularly Hessian-vector product ∇2f (x) v (Check out Jax!)

26 / 36

slide-88
SLIDE 88

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation – Basis of implementation for: Tensorflow, Pytorch, Jax, etc Jax: https://github.com/google/jax Good to know: – In practice, graphs are built automatically by software – Higher-order derivatives can also be done, particularly Hessian-vector product ∇2f (x) v (Check out Jax!) – Auto-diff in Tensorflow and Pytorch are specialized to DNNs and focus on 1st order, Jax (in Python) is full fledged and also supports GPU

26 / 36

slide-89
SLIDE 89

Tensor abstraction

– Abstract out low-level details; operations are often simple e.g., ∗, σ so partials are simple – Tensor (i.e., vector) chain rules apply, often via tensor-free computation – Basis of implementation for: Tensorflow, Pytorch, Jax, etc Jax: https://github.com/google/jax Good to know: – In practice, graphs are built automatically by software – Higher-order derivatives can also be done, particularly Hessian-vector product ∇2f (x) v (Check out Jax!) – Auto-diff in Tensorflow and Pytorch are specialized to DNNs and focus on 1st order, Jax (in Python) is full fledged and also supports GPU – General resources for autodiff: http://www.autodiff.org/, [Griewank and Walther, 2008]

26 / 36

slide-90
SLIDE 90

Autodiff in Pytorch

Solve least squares f (x) = 1

2 y − Ax2 2 with ∇f (x) = −A⊺ (y − Ax)

27 / 36

slide-91
SLIDE 91

Autodiff in Pytorch

Solve least squares f (x) = 1

2 y − Ax2 2 with ∇f (x) = −A⊺ (y − Ax)

loss vs. iterate

27 / 36

slide-92
SLIDE 92

Autodiff in Pytorch

Train a shallow neural network f (W ) =

  • i

yi − W 2σ (W 1xi)2

2

where σ(z) = max (z, 0), i.e., ReLU https://pytorch.org/tutorials/beginner/pytorch_with_ examples.html – torch.mm – torch.clamp – torch.no grad() Back propagation is reverse mode auto-differentiation!

28 / 36

slide-93
SLIDE 93

Outline

Analytic differentiation Finite-difference approximation Automatic differentiation Differentiable programming Suggested reading

29 / 36

slide-94
SLIDE 94

Example: image enhancement

30 / 36

slide-95
SLIDE 95

Example: image enhancement

– Each stage applies a parameterized function to the image, i.e., qwk ◦ · · · ◦ hw3 ◦ gw2 ◦ fw1 (X) (X is the camera raw)

30 / 36

slide-96
SLIDE 96

Example: image enhancement

– Each stage applies a parameterized function to the image, i.e., qwk ◦ · · · ◦ hw3 ◦ gw2 ◦ fw1 (X) (X is the camera raw) – The parameterized functions may or may not be DNNs

30 / 36

slide-97
SLIDE 97

Example: image enhancement

– Each stage applies a parameterized function to the image, i.e., qwk ◦ · · · ◦ hw3 ◦ gw2 ◦ fw1 (X) (X is the camera raw) – The parameterized functions may or may not be DNNs – Each function may be analytic, or simply a chunk of codes dependent on the parameters

30 / 36

slide-98
SLIDE 98

Example: image enhancement

– Each stage applies a parameterized function to the image, i.e., qwk ◦ · · · ◦ hw3 ◦ gw2 ◦ fw1 (X) (X is the camera raw) – The parameterized functions may or may not be DNNs – Each function may be analytic, or simply a chunk of codes dependent on the parameters – wi’s are the trainable parameters

Credit: https://people.csail.mit.edu/tzumao/gradient_halide/

30 / 36

slide-99
SLIDE 99

Example: image enhancement

– the trainable parameters are learned by gradient descent based on auto-differentiation – This is generalization of training DNNs with the classic feedforward structure to training general parameterized functions, using derivative-based methods

Credit: https://people.csail.mit.edu/tzumao/gradient_halide/

31 / 36

slide-100
SLIDE 100

Example: control a trebuchet

https://fluxml.ai/2019/03/05/dp-vs-rl.html

32 / 36

slide-101
SLIDE 101

Example: control a trebuchet

https://fluxml.ai/2019/03/05/dp-vs-rl.html – Given wind speed and target distance, the DNN predicts the angle of release and mass of counterweight – Given the angle of release and mass of counterweight as initial conditions, the ODE solver calculates the actual distance by iterative methods – We perform auto-differentiation through the iterative ODE solver and the DNN

32 / 36

slide-102
SLIDE 102

Differential programming

Interesting resources – Notable implementations: Swift for Tensorflow https://www.tensorflow.org/swift, and Zygote in Julia https://github.com/FluxML/Zygote.jl – Flux: machine learning package based on Zygote https://fluxml.ai/ – Taichi: differentiable programming language tailored to 3D computer graphics https://github.com/taichi-dev/taichi

33 / 36

slide-103
SLIDE 103

Outline

Analytic differentiation Finite-difference approximation Automatic differentiation Differentiable programming Suggested reading

34 / 36

slide-104
SLIDE 104

Autodiff in DNNs – http: //neuralnetworksanddeeplearning.com/chap2.html – https://colah.github.io/posts/2015-08-Backprop/ Differentiable programming – https://en.wikipedia.org/wiki/Differentiable_ programming – https://fluxml.ai/2019/02/07/ what-is-differentiable-programming.html – https://fluxml.ai/2019/03/05/dp-vs-rl.html

35 / 36

slide-105
SLIDE 105

References i

[Baydin et al., 2017] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,

  • J. M. (2017). Automatic differentiation in machine learning: a survey. The

Journal of Machine Learning Research, 18(1):5595–5637. [Griewank and Walther, 2008] Griewank, A. and Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics. 36 / 36