derivative evaluation by automatic differentiation of
play

Derivative Evaluation by Automatic Differentiation of Programs - PowerPoint PPT Presentation

Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Ecole d et e CEA-EDF-INRIA, Juillet 2005 Laurent Hasco et () Automatic


  1. Derivative Evaluation by Automatic Differentiation of Programs Laurent Hasco¨ et Laurent.Hascoet@sophia.inria.fr http://www-sop.inria.fr/tropics Ecole d’´ et´ e CEA-EDF-INRIA, Juillet 2005 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 1 / 88

  2. Outline Introduction 1 Formalization 2 Reverse AD 3 Alternative formalizations 4 Memory issues in Reverse AD: Checkpointing 5 Multi-directional 6 Reverse AD for Optimization 7 AD for Sensitivity to Uncertainties 8 Some AD Tools 9 10 Static Analyses in AD tools 11 The TAPENADE AD tool 12 Validation of AD results 13 Expert-level AD 14 Conclusion Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 2 / 88

  3. So you need derivatives ?... Given a program P computing a function F R m R n F : I → I X �→ Y we want to build a program that computes the derivatives of F . Specifically, we want the derivatives of the dependent, i.e. some variables in Y , with respect to the independent, i.e. some variables in X . Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 3 / 88

  4. Which derivatives do you want? Derivatives come in various shapes and flavors: � � ∂ y j Jacobian Matrices: J = ∂ x i Directional or tangent derivatives, differentials: dY = ˙ Y = J × dX = J × ˙ X Gradients: � � ∂ y When n = 1 output : gradient = J = ∂ x i t × J When n > 1 outputs: gradient = Y Higher-order derivative tensors Taylor coefficients Intervals ? Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 4 / 88

  5. Divided Differences Given ˙ X , run P twice, and compute ˙ Y Y = P ( X + ε ˙ X ) − P ( X ) ˙ ε Pros: immediate; no thinking required ! Cons: approximation; what ε ? ⇒ Not so cheap after all ! Most applications require inexpensive and accurate derivatives. ⇒ Let’s go for exact, analytic derivatives ! Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 5 / 88

  6. Automatic Differentiation Augment program P to make it compute the analytic derivatives P: a = b*T(10) + c The differentiated program must somehow compute: P’: da = db*T(10) + b*dT(10) + dc How can we achieve this? AD by Overloading AD by Program transformation Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 6 / 88

  7. AD by overloading Tools: adol-c , adtageo ,... Few manipulations required: DOUBLE → ADOUBLE ; link with provided overloaded +,-,* ,. . . Easy extension to higher-order, Taylor series, intervals, . . . but not so easy for gradients. Anecdote?: real → complex x = a*b → (x , dx) = (a*b-da*db , a*db+da*b) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 7 / 88

  8. AD by Program transformation Tools: adifor , taf , tapenade ,... Complex transformation required: Build a new program that computes the analytic derivatives explicitly. Requires a compiler-like, sophisticated tool PARSING, 1 ANALYSIS, 2 DIFFERENTIATION, 3 REGENERATION 4 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 8 / 88

  9. Overloading vs Transformation Overloading is versatile, Transformed programs are efficient: Global program analyses are possible and most welcome ! The compiler can optimize the generated program. Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 9 / 88

  10. Example: Tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

  11. Example: Tangent differentiation by Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

  12. Example: Tangent differentiation by Program transformation • SUBROUTINE FOO(v1, v1d, v2, v2d, v4, v4d, p1) REAL v1d,v2d,v3d,v4d REAL v1,v2,v3,v4,p1 v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3) v4 = v3 + p1*v2/v3 END Just inserts “differentiated instructions” into FOO Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 10 / 88

  13. Outline Introduction 1 Formalization 2 Reverse AD 3 Alternative formalizations 4 Memory issues in Reverse AD: Checkpointing 5 Multi-directional 6 Reverse AD for Optimization 7 AD for Sensitivity to Uncertainties 8 Some AD Tools 9 10 Static Analyses in AD tools 11 The TAPENADE AD tool 12 Validation of AD results 13 Expert-level AD 14 Conclusion Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 11 / 88

  14. Dealing with the Programs’ Control Programs contain control: discrete ⇒ non-differentiable. if (x <= 1.0) then printf("x too small"); else { y = 1.0; while (y <= 10.0) { y = y*x; x = x+0.5; } } Not differentiable for x=1.0 Not differentiable for x=2.9221444 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 12 / 88

  15. Take control away! We differentiate programs. But control ⇒ non-differentiability! Freeze the current control: For one given control, the program becomes a simple list of instructions ⇒ differentiable: printf("x too small"); y = 1.0; y = y*x; x = x+0.5; AD differentiates these lists of instructions: CodeList 1 Diff(CodeList 1) Control 1: Control 1 CodeList 2 Diff(CodeList 2) Program Diff(Program) Control N: Control N CodeList N Diff(CodeList N) Caution: the program is only piecewise differentiable ! Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 13 / 88

  16. Computer Programs as Functions Identify sequences of instructions { I 1 ; I 2 ; . . . I p − 1 ; I p ; } with composition of functions. Each simple instruction I k : v4 = v3 + v2/v3 R q → I R q where is a function f k : I The output v4 is built from the input v2 and v3 All other variable are passed unchanged Thus we see P : { I 1 ; I 2 ; . . . I p − 1 ; I p ; } as f = f p ◦ f p − 1 ◦ · · · ◦ f 1 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 14 / 88

  17. Using the Chain Rule We see program P as: f = f p ◦ f p − 1 ◦ · · · ◦ f 1 We define for short: W 0 = X and W k = f k ( W k − 1 ) The chain rule yields: f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 15 / 88

  18. The Jacobian Program f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) translates immediately into a program that computes the Jacobian J: I 1 ; /* W = f 1 ( W ) */ I 2 ; /* W = f 2 ( W ) */ ... I p ; /* W = f p ( W ) */ Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 16 / 88

  19. The Jacobian Program f ′ ( X ) = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . . . f ′ 1 ( W 0 ) translates immediately into a program that computes the Jacobian J: W = X ; J = f ′ 1 ( W ) ; I 1 ; /* W = f 1 ( W ) */ J = f ′ 2 ( W ) ∗ J ; I 2 ; /* W = f 2 ( W ) */ ... J = f ′ p ( W ) ∗ J ; I p ; /* W = f p ( W ) */ Y = W ; Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 16 / 88

  20. Tangent mode and Reverse mode Full J is expensive and often useless. We’d better compute useful projections of J. tangent AD : Y = f ′ ( X ) . ˙ ˙ 1 ( W 0 ) . ˙ X = f ′ p ( W p − 1 ) . f ′ p − 1 ( W p − 2 ) . . . f ′ X reverse AD : X = f ′ t ( X ) . Y = f ′ t 1 ( W 0 ) . . . . f ′ t p − 1 ( W p − 2 ) . f ′ t p ( W p − 1 ) . Y Evaluate both from right to left: ⇒ always matrix × vector Theoretical cost is about 4 times the cost of P Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 17 / 88

  21. Costs of Tangent and Reverse AD [ ( ) R m → I R n F : I m inputs ] Gradient n outputs Tangent J costs m ∗ 4 ∗ P using the tangent mode Good if m < = n J costs n ∗ 4 ∗ P using the reverse mode Good if m >> n (e.g n = 1 in optimization) Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 18 / 88

  22. Back to the Tangent Mode example v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 Elementary Jacobian matrices:   1   1 1 1   f ′ ( X ) = ...    ...     1 2 0      1 − p 1 ∗ v 2 p 1 0 0 1 v 2 v 3 3 v 3 = 2 ∗ ˙ ˙ v 1 v 3 ∗ (1 − p 1 ∗ v 2 / v 2 v 4 = ˙ ˙ 3 ) + ˙ v 2 ∗ p 1 / v 3 Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 19 / 88

  23. Tangent Mode example continued Tangent AD keeps the structure of P : ... v3d = 2.0*v1d v3 = 2.0*v1 + 5.0 v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3 v4 = v3 + p1*v2/v3 ... Differentiated instructions inserted into P ’s original control flow. Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 20 / 88

  24. Outline Introduction 1 Formalization 2 Reverse AD 3 Alternative formalizations 4 Memory issues in Reverse AD: Checkpointing 5 Multi-directional 6 Reverse AD for Optimization 7 AD for Sensitivity to Uncertainties 8 Some AD Tools 9 10 Static Analyses in AD tools 11 The TAPENADE AD tool 12 Validation of AD results 13 Expert-level AD 14 Conclusion Laurent Hasco¨ et () Automatic Differentiation CEA-EDF-INRIA 2005 21 / 88

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend