Second Order Reverse Mode of AD : A Vertex Elimination Perspective - - PowerPoint PPT Presentation

second order reverse mode of ad a vertex elimination
SMART_READER_LITE
LIVE PREVIEW

Second Order Reverse Mode of AD : A Vertex Elimination Perspective - - PowerPoint PPT Presentation

Second Order Reverse Mode of AD : A Vertex Elimination Perspective Mu Wang, Alex Pothen and Paul Hovland Computer Science, Purdue University MCS Division, Argonne National Lab Thanks : NSF, DOE, Intel October 10, 2016 Wang et.al (Purdue


slide-1
SLIDE 1

Second Order Reverse Mode of AD : A Vertex Elimination Perspective

Mu Wang, Alex Pothen and Paul Hovland

Computer Science, Purdue University MCS Division, Argonne National Lab Thanks : NSF, DOE, Intel

October 10, 2016

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 1 / 21

slide-2
SLIDE 2

Outline

◮ Second order reverse mode of Automatic Differentiation ◮ Vertex elimination for evaluating the Gradient and the Hessian ◮ The correspondence between second order reverse mode and vertex

elimination

◮ Discussion and board picture

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 2 / 21

slide-3
SLIDE 3

AD Fundamentals

◮ Automatic Differentiation (AD) is a technique that augments a

computer program so that the augmented program computes the derivatives as well as the values of the function defined by the original program.

◮ Scalar Objective Function f : Rn → R1

◮ Implemented as a computer program ◮ The evaluation is on a sequence of decomposed elemental functions

For k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 3 / 21

slide-4
SLIDE 4

AD Fundamentals

◮ Automatic Differentiation (AD) is a technique that augments a

computer program so that the augmented program computes the derivatives as well as the values of the function defined by the original program.

◮ Scalar Objective Function f : Rn → R1

◮ Implemented as a computer program ◮ The evaluation is on a sequence of decomposed elemental functions

For k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 3 / 21

slide-5
SLIDE 5

AD Fundamentals

◮ Automatic Differentiation (AD) is a technique that augments a

computer program so that the augmented program computes the derivatives as well as the values of the function defined by the original program.

◮ Scalar Objective Function f : Rn → R1

◮ Implemented as a computer program ◮ The evaluation is on a sequence of decomposed elemental functions

For k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

◮ y = pow(pow(x*x, 2.0), x),

(x > 0, y = x4x)

◮ v0 <<= x ◮ v1 = ϕ1(v0) = v0 ∗ v0 ◮ v2 = ϕ2(v1) = pow(v1, 2.0) ◮ v3 = ϕ3(v2, v0) = pow(v2, v0) ◮ v3 >>= y Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 3 / 21

slide-6
SLIDE 6

AD Fundamentals

◮ Automatic Differentiation (AD) is a technique that augments a

computer program so that the augmented program computes the derivatives as well as the values of the function defined by the original program.

◮ Scalar Objective Function f : Rn → R1

◮ Implemented as a computer program ◮ The evaluation is on a sequence of decomposed elemental functions

For k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

◮ Indexing convention :

◮ Independent variables : v1−n, · · · , v0 ◮ Intermediate variables : v1, · · · , vl−1 ◮ Dependent variable : vl Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 3 / 21

slide-7
SLIDE 7

Second Order Reverse Mode : Story Line

◮ First Proposed by Gower and Mello1

◮ Called Edge Pushing initially ◮ From the closed form of second order derivative for composite functions

◮ Wang, Gebremedhin, and Pothen provided a second perspective by

adopting live variable analysis 2 from compiler theory.

◮ Better complexity bound ◮ Correct Implementation ◮ Further improved with preaccumulation

◮ The new proof can be extended into general high orders.

1Gower, Robert Mansel, and Margarida P. Mello. Hessian matrices via automatic

  • differentiation. Universidade Estadual de Campinas, Instituto de Matemtica, Estatstica e

Computao Cientfica, 2010.

2Wang, Mu, Assefaw Gebremedhin, and Alex Pothen. ”Capitalizing on live variables:

new algorithms for efficient Hessian computation via automatic differentiation.” Mathematical Programming Computation (2016): 1-41.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 4 / 21

slide-8
SLIDE 8

Second Order Reverse Mode : Story Line

◮ First Proposed by Gower and Mello1

◮ Called Edge Pushing initially ◮ From the closed form of second order derivative for composite functions

◮ Wang, Gebremedhin, and Pothen provided a second perspective by

adopting live variable analysis 2 from compiler theory.

◮ Better complexity bound ◮ Correct Implementation ◮ Further improved with preaccumulation

◮ The new proof can be extended into general high orders.

1Gower, Robert Mansel, and Margarida P. Mello. Hessian matrices via automatic

  • differentiation. Universidade Estadual de Campinas, Instituto de Matemtica, Estatstica e

Computao Cientfica, 2010.

2Wang, Mu, Assefaw Gebremedhin, and Alex Pothen. ”Capitalizing on live variables:

new algorithms for efficient Hessian computation via automatic differentiation.” Mathematical Programming Computation (2016): 1-41.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 4 / 21

slide-9
SLIDE 9

Second Order Reverse Mode : Story Line

◮ First Proposed by Gower and Mello1

◮ Called Edge Pushing initially ◮ From the closed form of second order derivative for composite functions

◮ Wang, Gebremedhin, and Pothen provided a second perspective by

adopting live variable analysis 2 from compiler theory.

◮ Better complexity bound ◮ Correct Implementation ◮ Further improved with preaccumulation

◮ The new proof can be extended into general high orders.

1Gower, Robert Mansel, and Margarida P. Mello. Hessian matrices via automatic

  • differentiation. Universidade Estadual de Campinas, Instituto de Matemtica, Estatstica e

Computao Cientfica, 2010.

2Wang, Mu, Assefaw Gebremedhin, and Alex Pothen. ”Capitalizing on live variables:

new algorithms for efficient Hessian computation via automatic differentiation.” Mathematical Programming Computation (2016): 1-41.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 4 / 21

slide-10
SLIDE 10

Reverse Mode of AD

◮ Function evaluation : evaluate each elemental function

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

◮ Reverse mode of AD : process sequence of elemental functions in

reverse order for k = l, l − 1, · · · , 1 do something with vk = ϕk(vi){vi≺vk}

◮ Equivalent function fk(Sk) : a function defined by the elemental

functions ϕl, · · · , ϕk that have been processed at the end of step k, in reverse mode

◮ f = ϕl ◦ · · · ◦ ϕk

  • fk(Sk)
  • ϕk−1 ◦ · · · ◦ ϕ1.

◮ The independent variables of fk are denoted by Sk. Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 5 / 21

slide-11
SLIDE 11

Reverse Mode of AD

◮ Function evaluation : evaluate each elemental function

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

◮ Reverse mode of AD : process sequence of elemental functions in

reverse order for k = l, l − 1, · · · , 1 do something with vk = ϕk(vi){vi≺vk}

◮ Equivalent function fk(Sk) : a function defined by the elemental

functions ϕl, · · · , ϕk that have been processed at the end of step k, in reverse mode

◮ f = ϕl ◦ · · · ◦ ϕk

  • fk(Sk)
  • ϕk−1 ◦ · · · ◦ ϕ1.

◮ The independent variables of fk are denoted by Sk. Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 5 / 21

slide-12
SLIDE 12

Reverse Mode of AD

◮ Function evaluation : evaluate each elemental function

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

◮ Reverse mode of AD : process sequence of elemental functions in

reverse order for k = l, l − 1, · · · , 1 do something with vk = ϕk(vi){vi≺vk}

◮ Equivalent function fk(Sk) : a function defined by the elemental

functions ϕl, · · · , ϕk that have been processed at the end of step k, in reverse mode

◮ f = ϕl ◦ · · · ◦ ϕk

  • fk(Sk)
  • ϕk−1 ◦ · · · ◦ ϕ1.

◮ The independent variables of fk are denoted by Sk. Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 5 / 21

slide-13
SLIDE 13

Reverse Mode of AD

For k = l, l − 1, · · · , 1

do something with vk = ϕk(vi){vi:vi≺vk}

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk})

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-14
SLIDE 14

Reverse Mode of AD

For k = l, l − 1, · · · , 1

do something with vk = ϕk(vi){vi:vi≺vk}

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk})

f =

fk+1(Sk+1)

  • ϕl ◦ · · · ◦ ϕk+1 ◦ϕk ◦ ϕk−1 ◦ · · · ◦ ϕ1

f = ϕl ◦ · · · ◦ ϕk+1 ◦ ϕk

  • fk(Sk)
  • ϕk−1 ◦ · · · ◦ ϕ1

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-15
SLIDE 15

Reverse Mode of AD

For k = l, l − 1, · · · , 1

do something with vk = ϕk(vi){vi:vi≺vk}

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ First order chain rule : ∂fk ∂vi = ∂fk+1 ∂vi

+ ∂vk

∂vi ∂fk+1 ∂vk

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-16
SLIDE 16

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ First order chain rule : ∂fk ∂vi = ∂fk+1 ∂vi

+ ∂vk

∂vi ∂fk+1 ∂vk

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-17
SLIDE 17

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ Second order chain rule : ∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-18
SLIDE 18

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

For all unordered pairs (vi, vj), vi ≺ vk or vj ≺ vk :

∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ Second order chain rule : ∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-19
SLIDE 19

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

→ ¯ vi+ = ∂vk

∂vi ¯

vk For all unordered pairs (vi, vj), vi ≺ vk or vj ≺ vk :

∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ Adjoint variable ¯

vi :

◮ Holds the value of ∂fk

∂vi after the step k

◮ Incremental updates in implementation Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-20
SLIDE 20

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

→ ¯ vi+ = ∂vk

∂vi ¯

vk For all unordered pairs (vi, vj), vi ≺ vk or vj ≺ vk :

∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ Adjoint variable ¯

vi :

◮ Holds the value of ∂fk

∂vi after the step k

◮ Incremental updates in implementation

◮ More implementation details for second order for exploiting sparsity

and symmetry.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-21
SLIDE 21

Reverse Mode of AD

For k = l, l − 1, · · · , 1

For all vi ≺ vk :

∂fk ∂vi = ∂fk+1 ∂vi + ∂vk ∂vi ∂fk+1 ∂vk

→ ¯ vi+ = ∂vk

∂vi ¯

vk For all unordered pairs (vi, vj), vi ≺ vk or vj ≺ vk :

∂2fk ∂vi∂vj = ∂2fk+1 ∂v∂u + ∂vk ∂vi ∂2fk+1 ∂vj∂vk + ∂vk ∂vj ∂2fk+1 ∂vi∂vk

+ ∂vk

∂vi ∂vk ∂vj ∂2fk+1 ∂vk∂vk + ∂2vk ∂vi∂vi ∂fk+1 ∂vk

◮ fk(Sk) = fk+1(Sk+1 \ {vk}, vk = ϕk(vi){vi:vi≺vk}) ◮ General high order chain rule → general high order reverse mode ◮ Taking advantage of symmetry becomes more important

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 6 / 21

slide-22
SLIDE 22

Reverse Mode of AD : Implementation

◮ Second order reverse mode : Initially implemented as LivarH in

ADOL-C

◮ https://github.com/CSCsw/LivarH

◮ ReverseAD : an operator overloading implementation of general high

  • rder reverse mode in C++11.

◮ https://github.com/wangmu0701/ReverseAD ◮ Available for experimentation

◮ Monotonic indexing for variables on the trace

vi ≺ vk = ⇒ index(vi) < index(vj)

◮ Not satisfied by ADOL-C ◮ An immature fix was provided for LivarH Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 7 / 21

slide-23
SLIDE 23

Reverse Mode of AD : Implementation

◮ Second order reverse mode : Initially implemented as LivarH in

ADOL-C

◮ https://github.com/CSCsw/LivarH

◮ ReverseAD : an operator overloading implementation of general high

  • rder reverse mode in C++11.

◮ https://github.com/wangmu0701/ReverseAD ◮ Available for experimentation

◮ Monotonic indexing for variables on the trace

vi ≺ vk = ⇒ index(vi) < index(vj)

◮ Not satisfied by ADOL-C ◮ An immature fix was provided for LivarH Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 7 / 21

slide-24
SLIDE 24

Reverse Mode of AD : Implementation

◮ Second order reverse mode : Initially implemented as LivarH in

ADOL-C

◮ https://github.com/CSCsw/LivarH

◮ ReverseAD : an operator overloading implementation of general high

  • rder reverse mode in C++11.

◮ https://github.com/wangmu0701/ReverseAD ◮ Available for experimentation

◮ Monotonic indexing for variables on the trace

vi ≺ vk = ⇒ index(vi) < index(vj)

◮ Not satisfied by ADOL-C ◮ An immature fix was provided for LivarH Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 7 / 21

slide-25
SLIDE 25

Reverse Mode of AD : Performance

◮ The FeasNewt Benchmark (T. S. Munson and P. D. Hovland, 2005) ◮ A mesh optimization problem with sparse Hessian matrix. ◮ Compared with compression-and-recovery approach implemented in

ADOL-C + ColPack n : 2,598 12,597 39,379 #nnz in H : 46,488 253,029 828,129 Direct #colors : 54 62 65 runtime(S) : 3.77 39.34 137.07 Indirect #colors : 31 30 31 runtime(S) : 3.56 31.07 119.04 ReverseAD runtime(S) : 0.51 3.37 12.40

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 8 / 21

slide-26
SLIDE 26

From Analytical to Combinatorial

◮ The second (high) order reverse mode is derived from a purely

analytical point of view.

◮ Same as the original derivation of Edge Pushing.

◮ There are combinatorial models for AD algorithms based on the

concept of Computational Graph G of the objective function.

◮ Edge Elimination ◮ Vertex Elimination ◮ Face Elimination

◮ Closely related to the classical linear algebra problem of sparse

Gaussian elimination.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 9 / 21

slide-27
SLIDE 27

From Analytical to Combinatorial

◮ The second (high) order reverse mode is derived from a purely

analytical point of view.

◮ Same as the original derivation of Edge Pushing.

◮ There are combinatorial models for AD algorithms based on the

concept of Computational Graph G of the objective function.

◮ Edge Elimination ◮ Vertex Elimination ◮ Face Elimination

◮ Closely related to the classical linear algebra problem of sparse

Gaussian elimination.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 9 / 21

slide-28
SLIDE 28

From Analytical to Combinatorial

◮ The second (high) order reverse mode is derived from a purely

analytical point of view.

◮ Same as the original derivation of Edge Pushing.

◮ There are combinatorial models for AD algorithms based on the

concept of Computational Graph G of the objective function.

◮ Edge Elimination ◮ Vertex Elimination ◮ Face Elimination

◮ Closely related to the classical linear algebra problem of sparse

Gaussian elimination.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 9 / 21

slide-29
SLIDE 29

Computational Graph

◮ Computational graph : G = (V , E)

◮ Variables are vertices : V = {vi|1 − n ≤ i ≤ l} ◮ Precedence relations are directed edges :

E = {vi → vk|vi ≺ vk, 1 − n ≤ i < k ≤ l}

◮ Edge weights : c(i, k) .

= w(vi, vk) = ∂vk

∂vi

◮ v1 = ϕ1(v0) = v0 ∗ v0 ◮ v2 = ϕ2(v1) = pow(v1, 2.0) ◮ v3 = ϕ3(v2, v0) = pow(v2, v0)

v3 v2 v1 v0

c(0, 3) = ∂v3

∂v0

= log v2 · v3 c(1, 2) = ∂v2

∂v1

= 2 · v1 c(2, 3) = ∂v3

∂v2

= v3

v2 · v0

c(0, 1) = ∂v1

∂v0 = 2 · v0

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 10 / 21

slide-30
SLIDE 30

Computational Graph

◮ Computational graph : G = (V , E)

◮ Variables are vertices : V = {vi|1 − n ≤ i ≤ l} ◮ Precedence relations are directed edges :

E = {vi → vk|vi ≺ vk, 1 − n ≤ i < k ≤ l}

◮ Edge weights : c(i, k) .

= w(vi, vk) = ∂vk

∂vi

◮ v1 = ϕ1(v0) = v0 ∗ v0 ◮ v2 = ϕ2(v1) = pow(v1, 2.0) ◮ v3 = ϕ3(v2, v0) = pow(v2, v0)

v3 v2 v1 v0

c(0, 3) = ∂v3

∂v0

= log v2 · v3 c(1, 2) = ∂v2

∂v1

= 2 · v1 c(2, 3) = ∂v3

∂v2

= v3

v2 · v0

c(0, 1) = ∂v1

∂v0 = 2 · v0

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 10 / 21

slide-31
SLIDE 31

Vertex Elimination

Repeat ◮ Pick intermediate node vj ◮ For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k) ◮ Remove vj from V Until V has no intermediate vertices

v3 v2 v1 v0

c(0, 3) c(1, 2) c(2, 3) c(0, 1)

◮ Proposed by Griewank and Reese, and studied extensively by

Naumann and students

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 11 / 21

slide-32
SLIDE 32

Vertex Elimination

Repeat ◮ Pick intermediate node vj ◮ For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k) ◮ Remove vj from V Until V has no intermediate vertices

v3 v1 v0

c(0, 3) c(1, 2) · c(2, 3) c(0, 1)

◮ Proposed by Griewank and Reese, and studied extensively by

Naumann and students

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 11 / 21

slide-33
SLIDE 33

Vertex Elimination

Repeat ◮ Pick intermediate node vj ◮ For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k) ◮ Remove vj from V Until V has no intermediate vertices

v3 v0

c(0, 3) +c(0, 1) · c(1, 2) · c(2, 3)

◮ Proposed by Griewank and Reese, and studied extensively by

Naumann and students

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 11 / 21

slide-34
SLIDE 34

Vertex Elimination

Repeat ◮ Pick intermediate node vj ◮ For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k) ◮ Remove vj from V Until V has no intermediate vertices

v3 v0

c(0, 3) +c(0, 1) · c(1, 2) · c(2, 3)

◮ Proposed by Griewank and Reese, and studied extensively by

Naumann and students

◮ Any elimination order will give the same final results. ◮ The time complexity (number of edge weights computed) varies with

the ordering. Minimizing the space complexity also is likely to be intractable.

◮ NP-hard to determine the optimal ordering. Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 11 / 21

slide-35
SLIDE 35

Vertex Elimination for Hessian

◮ The vertex elimination algorithm applies on G, gives ∇ · f . ◮ To evaluate the Hessian of f we need the computational graph of the

gradient Gg, i.e, the computational graph of evaluating ∇ · f .

◮ Gg can be constructed from first order non-incremental reverse mode Function evaluation :

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

First order (nonincremental) reverse mode :

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

vi≺vk ∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi≺vk{vj : vj ≺ vk} ∪ {vk})

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 12 / 21

slide-36
SLIDE 36

Vertex Elimination for Hessian

◮ The vertex elimination algorithm applies on G, gives ∇ · f . ◮ To evaluate the Hessian of f we need the computational graph of the

gradient Gg, i.e, the computational graph of evaluating ∇ · f .

◮ Gg can be constructed from first order non-incremental reverse mode Function evaluation :

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

First order (nonincremental) reverse mode :

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

vi≺vk ∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi≺vk{vj : vj ≺ vk} ∪ {vk})

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 12 / 21

slide-37
SLIDE 37

Vertex Elimination for Hessian

◮ The vertex elimination algorithm applies on G, gives ∇ · f . ◮ To evaluate the Hessian of f we need the computational graph of the

gradient Gg, i.e, the computational graph of evaluating ∇ · f .

◮ Gg can be constructed from first order non-incremental reverse mode Function evaluation :

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

First order (nonincremental) reverse mode :

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

vi≺vk ∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi≺vk{vj : vj ≺ vk} ∪ {vk})

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 12 / 21

slide-38
SLIDE 38

Vertex Elimination for Hessian

◮ The vertex elimination algorithm applies on G, gives ∇ · f . ◮ To evaluate the Hessian of f we need the computational graph of the

gradient Gg, i.e, the computational graph of evaluating ∇ · f .

◮ Gg can be constructed from first order non-incremental reverse mode Function evaluation :

for k = 1, 2, · · · , l vk = ϕk(vi){vi:vi≺vk}

First order (nonincremental) reverse mode :

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

vi≺vk ∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi≺vk{vj : vj ≺ vk} ∪ {vk})

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 12 / 21

slide-39
SLIDE 39

Computational Graph of the Gradient

v 3 v 2 v 1 v 0 v3 v2 v1 v0

Function evaluation : ◮ for k = 1, 2, · · · , l vk = ϕk(vi){vi :vi ≺vk } First order (nonincremental) reverse mode : ◮ Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 ◮ for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) Vg = V ∪ ¯ V , Eg = EG ∪ E ¯

G ∪ EC Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 13 / 21

slide-40
SLIDE 40

Computational Graph of the Gradient

v 3 v 2 v 1 v 0 v3 v2 v1 v0

Function evaluation : ◮ for k = 1, 2, · · · , l vk = ϕk(vi){vi :vi ≺vk } First order (nonincremental) reverse mode : ◮ Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 ◮ for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) Vg = V ∪ ¯ V , Eg = EG ∪ E ¯

G ∪ EC Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 13 / 21

slide-41
SLIDE 41

Computational Graph of the Gradient

v 3 v 2 v 1 v 0 v3 v2 v1 v0 c(0, 3) c(1, 2) c(2, 3) c(0, 1)

Function evaluation : ◮ for k = 1, 2, · · · , l vk = ϕk(vi){vi :vi ≺vk } First order (nonincremental) reverse mode : ◮ Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 ◮ for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) Vg = V ∪ ¯ V , Eg = EG ∪ E ¯

G ∪ EC

EG : (vi, vk) ∈ Eg ⇐ ⇒ vi ≺ vk vk = ϕk(vi){vi :vi ≺vk } c(i, k) = ∂vk

∂vi Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 13 / 21

slide-42
SLIDE 42

Computational Graph of the Gradient

v 3 v 2 v 1 v 0 c(0, 3) c(2, 1) c(3, 2) c(1, 0) v3 v2 v1 v0 c(0, 3) c(1, 2) c(2, 3) c(0, 1)

Function evaluation : ◮ for k = 1, 2, · · · , l vk = ϕk(vi){vi :vi ≺vk } First order (nonincremental) reverse mode : ◮ Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 ◮ for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) Vg = V ∪ ¯ V , Eg = EG ∪ E ¯

G ∪ EC

E ¯

G : (¯

vk, ¯ vi) ∈ Eg ⇐ ⇒ ¯ vk ≺ ¯ vi ⇐ ⇒ vi ≺ vk ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) c(¯ k, ¯ i) =

∂vj ∂vi Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 13 / 21

slide-43
SLIDE 43

Computational Graph of the Gradient

v 3 v 2 v 1 v 0 c(0, 3) c(2, 1) c(3, 2) c(1, 0) v3 v2 v1 v0 c(0, 3) c(1, 2) c(2, 3) c(0, 1) c(0, 0) c(1, 1) c(2, 2) c(0, 2) c(2, 0)

Function evaluation : ◮ for k = 1, 2, · · · , l vk = ϕk(vi){vi :vi ≺vk } First order (nonincremental) reverse mode : ◮ Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0 ◮ for i = l − 1, · · · , 1, 0, · · · , 1 − n ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk ¯ vi = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) Vg = V ∪ ¯ V , Eg = EG ∪ E ¯

G ∪ EC

EC : (vi, ¯ vj) ∈ Eg ⇐ ⇒ ∃vk, s.t, vi, vj ≺ vk ¯ vi =

  • vi ≺vk

∂vk ∂vi ¯

vk = ¯ ϕi(∪vi ≺vk {vj : vj ≺ vk} ∪ {vk}) c(i, ¯ j) =

  • vi ,vj ≺vk

∂2vk ∂vi ∂vj ¯

vk

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 13 / 21

slide-44
SLIDE 44

Equivalence

◮ Vertex elimination on the gradient graph Gg gives the Hessian

(combinatorial approach).

◮ Second order reverse mode gives the Hessian (analytical approach).

Second order reverse mode:

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0

for k = l, l − 1, · · · , 1 for each unordered pair (vi , vj ) hk (vi , vj ) = hk+1(vi , vj ) + ∂vk

∂vi hk+1(vj , vk ) + ∂vk ∂vj hk+1(vi , vk )

+ ∂vk

∂vi ∂vk ∂vj hk+1(vk , vk ) + ∂2vk ∂vi ∂vj ¯

vk

Vertex Elimination on Gg

Pick intermediate node vj

For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k)

Remove vj from V

Repeat until V has no intermediate vertices

Theorem

If vertex elimination is performed on Gg in a symmetric reverse topological ordering, i.e, (vk, ¯ vk) are eliminated in pairs, in the order k = l, l − 1, · · · , 1, then the two algorithms correspond step-by-step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 14 / 21

slide-45
SLIDE 45

Equivalence

◮ Vertex elimination on the gradient graph Gg gives the Hessian

(combinatorial approach).

◮ Second order reverse mode gives the Hessian (analytical approach).

Second order reverse mode:

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0

for k = l, l − 1, · · · , 1 for each unordered pair (vi , vj ) hk (vi , vj ) = hk+1(vi , vj ) + ∂vk

∂vi hk+1(vj , vk ) + ∂vk ∂vj hk+1(vi , vk )

+ ∂vk

∂vi ∂vk ∂vj hk+1(vk , vk ) + ∂2vk ∂vi ∂vj ¯

vk

Vertex Elimination on Gg

Pick intermediate node vj

For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k)

Remove vj from V

Repeat until V has no intermediate vertices

Theorem

If vertex elimination is performed on Gg in a symmetric reverse topological ordering, i.e, (vk, ¯ vk) are eliminated in pairs, in the order k = l, l − 1, · · · , 1, then the two algorithms correspond step-by-step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 14 / 21

slide-46
SLIDE 46

Equivalence

◮ Vertex elimination on the gradient graph Gg gives the Hessian

(combinatorial approach).

◮ Second order reverse mode gives the Hessian (analytical approach).

Second order reverse mode:

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0

for k = l, l − 1, · · · , 1 for each unordered pair (vi , vj ) hk (vi , vj ) = hk+1(vi , vj ) + ∂vk

∂vi hk+1(vj , vk ) + ∂vk ∂vj hk+1(vi , vk )

+ ∂vk

∂vi ∂vk ∂vj hk+1(vk , vk ) + ∂2vk ∂vi ∂vj ¯

vk

Vertex Elimination on Gg

Pick intermediate node vj

For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k)

Remove vj from V

Repeat until V has no intermediate vertices

Theorem

If vertex elimination is performed on Gg in a symmetric reverse topological ordering, i.e, (vk, ¯ vk) are eliminated in pairs, in the order k = l, l − 1, · · · , 1, then the two algorithms correspond step-by-step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 14 / 21

slide-47
SLIDE 47

Equivalence

◮ Vertex elimination on the gradient graph Gg gives the Hessian

(combinatorial approach).

◮ Second order reverse mode gives the Hessian (analytical approach).

Second order reverse mode:

Initialize : ¯ vl = 1.0, ¯ vl−1 = · · · = 0

for k = l, l − 1, · · · , 1 for each unordered pair (vi , vj ) hk (vi , vj ) = hk+1(vi , vj ) + ∂vk

∂vi hk+1(vj , vk ) + ∂vk ∂vj hk+1(vi , vk )

+ ∂vk

∂vi ∂vk ∂vj hk+1(vk , vk ) + ∂2vk ∂vi ∂vj ¯

vk

Vertex Elimination on Gg

Pick intermediate node vj

For all (i, k), s.t, i ≺ j ≺ k do c(i, k)+ = c(i, j) ∗ c(j, k)

Remove vj from V

Repeat until V has no intermediate vertices

Theorem

If vertex elimination is performed on Gg in a symmetric reverse topological ordering, i.e, (vk, ¯ vk) are eliminated in pairs, in the order k = l, l − 1, · · · , 1, then the two algorithms correspond step-by-step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 14 / 21

slide-48
SLIDE 48

Theorem

◮ The two algorithms perform the same computations, and thus

maintain the same intermediate results after each step.

◮ With two minor tweaks of vertex elimination on Gg

◮ Tweak one : parallel edges in EC

◮ Break the edge c(i, ¯

j) =

  • vi,vj≺vk

∂2vk ∂vi∂vj ¯

vk

◮ Into parallel edges ck(i, ¯

j) =

∂2vk ∂vi∂vj ¯

vk

◮ Tweak two : new set of edges EH :

◮ Rule 1 : all added edges are added into EH ◮ Rule 2 : After eliminating (vk, ¯

vk), move all ck(i, ¯ j) from EC to EH

◮ Claim : EH corresponds to the nonzeros in the Hessian of fk(Sk) after

each step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 15 / 21

slide-49
SLIDE 49

Theorem

◮ The two algorithms perform the same computations, and thus

maintain the same intermediate results after each step.

◮ With two minor tweaks of vertex elimination on Gg

◮ Tweak one : parallel edges in EC

◮ Break the edge c(i, ¯

j) =

  • vi,vj≺vk

∂2vk ∂vi∂vj ¯

vk

◮ Into parallel edges ck(i, ¯

j) =

∂2vk ∂vi∂vj ¯

vk

◮ Tweak two : new set of edges EH :

◮ Rule 1 : all added edges are added into EH ◮ Rule 2 : After eliminating (vk, ¯

vk), move all ck(i, ¯ j) from EC to EH

◮ Claim : EH corresponds to the nonzeros in the Hessian of fk(Sk) after

each step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 15 / 21

slide-50
SLIDE 50

Theorem

◮ The two algorithms perform the same computations, and thus

maintain the same intermediate results after each step.

◮ With two minor tweaks of vertex elimination on Gg

◮ Tweak one : parallel edges in EC

◮ Break the edge c(i, ¯

j) =

  • vi,vj≺vk

∂2vk ∂vi∂vj ¯

vk

◮ Into parallel edges ck(i, ¯

j) =

∂2vk ∂vi∂vj ¯

vk

◮ Tweak two : new set of edges EH :

◮ Rule 1 : all added edges are added into EH ◮ Rule 2 : After eliminating (vk, ¯

vk), move all ck(i, ¯ j) from EC to EH

◮ Claim : EH corresponds to the nonzeros in the Hessian of fk(Sk) after

each step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 15 / 21

slide-51
SLIDE 51

Theorem

◮ The two algorithms perform the same computations, and thus

maintain the same intermediate results after each step.

◮ With two minor tweaks of vertex elimination on Gg

◮ Tweak one : parallel edges in EC

◮ Break the edge c(i, ¯

j) =

  • vi,vj≺vk

∂2vk ∂vi∂vj ¯

vk

◮ Into parallel edges ck(i, ¯

j) =

∂2vk ∂vi∂vj ¯

vk

◮ Tweak two : new set of edges EH :

◮ Rule 1 : all added edges are added into EH ◮ Rule 2 : After eliminating (vk, ¯

vk), move all ck(i, ¯ j) from EC to EH

◮ Claim : EH corresponds to the nonzeros in the Hessian of fk(Sk) after

each step.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 15 / 21

slide-52
SLIDE 52

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-53
SLIDE 53

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-54
SLIDE 54

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-55
SLIDE 55

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-56
SLIDE 56

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-57
SLIDE 57

Discussion

◮ Second order reverse mode is equivalent to a special form of vertex

elimination on the computational graph of the gradient Gg.

◮ May not be the optimal form of vertex elimination due to the

structure of Gg. But, in practice it can be implemented with efficient storage and memory access.

◮ Second order reverse mode does not require the graph Gg to be formed. ◮ Can be implemented with a single reverse sweep. ◮ Can incorporate checkpointing to overcome memory/disk limits

◮ Possibilities of optimizing second order reverse mode by exploiting

structural properties

◮ Out-of-order processing of vk = ϕk(vi){vi:vi≺vk} ◮ Benefit must outweigh the optimization overhead Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 16 / 21

slide-58
SLIDE 58

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-59
SLIDE 59

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-60
SLIDE 60

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-61
SLIDE 61

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-62
SLIDE 62

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-63
SLIDE 63

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-64
SLIDE 64

Future Work : Broad Picture

◮ This work reveals the correspondence between analytical and

combinatorial points of view of AD algorithms.

◮ First order forward/reverse mode corresponds to edge elimination on G

with specific elimination ordering.

◮ Second order reverse mode corresponds to vertex elimination on Gg

with reverse symmetric elimination ordering.

◮ Is there a generalization to high orders?

◮ The analytical form of the high order reverse mode is the

implementation of high order chain rule.

◮ What is the generalization of the combinatorial form of high order

reverse mode?

◮ What is the computational graph of the Hessian GH? ◮ What is the elimination technique that we should perform on GH?

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 17 / 21

slide-65
SLIDE 65

References

◮ Griewank, Andreas, and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, 2008. ◮ Griewank, Andreas, and Shawn Reese. On the calculation of Jacobian matrices by the Markowitz rule. In Andreas Griewank and George F. Corliss, editors, Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pages 126-135. SIAM, Philadelphia, PA, 1991. ◮ Naumann, Uwe. Optimal Jacobian Accumulation is NP-complete. Mathematical Programming, 112(2):427-441, 2008. ◮ Gower, Robert Mansel, and Margarida P. Mello. Hessian matrices via automatic

  • differentiation. Universidade Estadual de Campinas, Instituto de Matemtica, Estatstica e

Computao Cientfica, 2010. ◮ Wang, Mu, Assefaw Gebremedhin, and Alex Pothen. Capitalizing on live variables: new algorithms for efficient Hessian computation via automatic differentiation. Mathematical Programming Computation (2016): 1-41. ◮ Wang, Mu, Alex Pothen and Paul Hovland. Edge Pushing is Equivalent to Vertex Elimination for Computing Hessians. SIAM CSC16. ◮ Wang, Mu and Alex Pothen. Evaluating High Order Derivative Tensors in Reverse Mode

  • f Automatic Differentiation. AD2016

◮ Wang, Mu, and Alex Pothen. High Order Reverse Mode of AD : Theory and

  • Implementation. In preparation.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 18 / 21

slide-66
SLIDE 66

Backup Slides

placeholder

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 19 / 21

slide-67
SLIDE 67

Vertex Elimination as Gaussian Elimination

◮ We can build a matrix as C = [cij]1−n≤i,j≤l.

◮ cij = ∂vi

∂vj as the edge weight in G,when vj ≺ vi

◮ cii = −1, diagonal elements ◮ Other elements are zero

C =    n l − m m n −I l − m B L − I m R T −I   

◮ C is a lower triangular matrix ◮ The Jacobian ∇ · f = R + T · (L − I)−1 · B is the Schur complement ◮ Can use a Gaussian elimination procedure to compute it.

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 20 / 21

slide-68
SLIDE 68

Adjacency Matrix for Gg

H =         n l − m m m l − m n n −I l − m B L − I m R T −I m −I l − m Z Y T′ L′ − I n X Z′ R′ B′ −I        

◮ C′ is the transpose of C along the antidiagonal. ◮ The Hessian is the Schur complement of X with the rest of the matrix

Wang et.al (Purdue University) Second Order Reverse AD October 10, 2016 21 / 21