automatic differentiation of programs and its
play

Automatic Differentiation of programs and its applications to - PowerPoint PPT Presentation

Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco


  1. Automatic Differentiation of programs and its applications to Scientific Computing Laurent Hasco¨ et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Manchester, april 7 th , 2005 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 1 / 67

  2. Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 2 / 67

  3. So you need derivatives ?... Given a program P computing a function F R m R n F : I → I X �→ Y we want to build a program that computes the derivatives of F . Specifically, we want the derivatives of the dependent, i.e. some variables in Y , with respect to the independent, i.e. some variables in X . Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 3 / 67

  4. Which derivatives do you want? Derivatives come in various shapes and flavors: � � ∂ y j Jacobian Matrices: J = ∂ x i Directional or tangent derivatives, differentials: dY = ˙ Y = J × dX = J × ˙ X Gradients: � � ∂ y When n = 1 output : gradient = J = ∂ x i t × J When n > 1 outputs: gradient = Y Higher-order derivative tensors Taylor coefficients Intervals ? Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 4 / 67

  5. Divided Differences Given ˙ X , run P twice, and compute ˙ Y Y = P ( X + ε ˙ X ) − P ( X ) ˙ ε Pros: immediate; no thinking required ! Cons: approximation; what ε ? ⇒ Not so cheap after all ! Optimization wants inexpensive and accurate derivatives. ⇒ Let’s go for exact, analytic derivatives ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 5 / 67

  6. AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

  7. AD Example: analytic tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

  8. AD Example: analytic tangent differentiation by Program transformation • SUBROUTINE FOO(v1, v1d, v2, v2d, v4, v4d, p1) REAL v1d,v2d,v3d,v4d REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Just inserts “differentiated instructions” into FOO Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 6 / 67

  9. Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 7 / 67

  10. Take control away! We differentiate programs. But control ⇒ non-differentiability! Freeze the current control: ⇒ the program becomes a simple sequence of instructions ⇒ AD differentiates these sequences: CodeList 1 Diff(CodeList 1) Control 1: Control 1 CodeList 2 Diff(CodeList 2) Program Diff(Program) Control N: Control N CodeList N Diff(CodeList N) Caution: the program is only piecewise differentiable ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 8 / 67

  11. Computer Programs as Functions Identify sequences of instructions { I 1 ; I 2 ; . . . I p − 1 ; I p ; } with composition of functions. Each simple instruction I k : v4 = v3 + v2/v3 R q → I R q where is a function f k : I The output v4 is built from the input v2 and v3 All other variable are passed unchanged Thus we see P : { I 1 ; I 2 ; . . . I p − 1 ; I p ; } as f = f p ◦ f p − 1 ◦ · · · ◦ f 1 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 9 / 67

  12. Using the Chain Rule f = f p ◦ f p − 1 ◦ · · · ◦ f 1 We define for short: W 0 = X and W k = f k ( W k − 1 ) The chain rule yields: f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 10 / 67

  13. Tangent mode and Reverse mode Full J is expensive and often useless. We’d better compute useful projections of J. tangent AD : Y = f ′ ( X ) . ˙ ˙ 1 ( W 0 ) . ˙ X = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . f ′ X reverse AD : X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . . f ′ t p − 1 ( W p − 2 ) . f ′ t p ( W p − 1 ) . Y Evaluate both from right to left: ⇒ always matrix × vector Theoretical cost is about 4 times the cost of P Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 11 / 67

  14. Costs of Tangent and Reverse AD [ ( ) R m → I R n F : I m inputs ] Gradient n outputs Tangent J costs m ∗ 4 ∗ P using the tangent mode Good if m < = n J costs n ∗ 4 ∗ P using the reverse mode Good if m >> n (e.g n = 1 in optimization) Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 12 / 67

  15. Back to the Tangent Mode example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Elementary Jacobian matrices:   1   1 1 1   f ′ ( X ) = ...    ...     1 2 0      1 − p 1 ∗ v 2 p 1 0 0 1 v 2 v 3 3 v 3 = 2 ∗ ˙ ˙ v 1 v 3 ∗ (1 − p 1 ∗ v 2 / v 2 v 4 = ˙ ˙ 3 ) + ˙ v 2 ∗ p 1 / v 3 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 13 / 67

  16. Tangent Mode example continued Tangent AD keeps the structure of P : ... v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3 v4 = v3 + p1*v2/v3 ... Differentiated instructions inserted into P ’s original control flow. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 14 / 67

  17. Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 15 / 67

  18. Multi-directional mode and Jacobians If you want ˙ Y = f ′ ( X ) . ˙ X for the same X and several ˙ X either run the tangent differentiated program several times, evaluating f several times. or run a “Multi-directional” tangent once, evaluating f once. Same for X = f ′ t ( X ) . Y for several Y . In particular, multi-directional tangent or reverse is good to get the full Jacobian. Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 16 / 67

  19. Sparse Jacobians with seed matrices When sparse Jacobian, use “seed matrices” to propagate fewer ˙ X or Y Multi-directional tangent mode:       a b 1 a b c 1 c        ×  =       d 1 d     e f g 1 e f g Multi-directional reverse mode:   a b � 1 � a � � 1 c c b   ×  =   1 1 d e f d g  e f g Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 17 / 67

  20. Outline Introduction 1 Formalization 2 ......... Multi-directional 3 Reverse AD 4 ......... Alternative formalizations 5 Reverse AD for Optimization 6 Performance issues 7 ......... Static Analyses in AD tools 8 Some AD Tools 9 10 Validation 11 ......... Expert-level AD 12 Conclusion Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 18 / 67

  21. Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

  22. Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

  23. Focus on the Reverse mode X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . f ′ t p ( W p − 1 ) . Y I 1 ; ... I p − 2 ; I p − 1 ; W = Y ; W = f ′ t p ( W p − 1 ) * W ; Restore W p − 2 before I p − 2 ; W = f ′ t p − 1 ( W p − 2 ) * W ; ... Restore W 0 before I 1 ; W = f ′ t 1 ( W 0 ) * W ; X = W ; Instructions differentiated in the reverse order ! Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 19 / 67

  24. Reverse mode: graphical interpretation I I I I I 1 2 3 p-2 p-1 I p I time p-1 I I 2 1 Bottleneck: memory usage (“Tape”). Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 20 / 67

  25. Back to the example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Transposed Jacobian matrices:   1 0  1 2  p 1 1 1   f ′ t ( X ) = ...   v 3  ...   1 1 − p 1 ∗ v 2   0     v 2  3 1 0 v 2 = v 2 + v 4 ∗ p 1 / v 3 ... v 1 = v 1 + 2 ∗ v 3 v 3 = 0 Manchester, april 7 th , 2005 Laurent Hasco¨ et () AD 21 / 67

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend