on automatic differentiation of computer codes
play

On Automatic Differentiation of Computer Codes Presented to The - PowerPoint PPT Presentation

magenta www.csd.abdn.ac.uk/etadjoud On Automatic Differentiation of Computer Codes Presented to The Institute of Cybernetics, Tallinn, 3 September 2007 Emmanuel M. Tadjouddine Computing Sciences Department Aberdeen University Aberdeen, AB24


  1. magenta www.csd.abdn.ac.uk/˜etadjoud On Automatic Differentiation of Computer Codes Presented to The Institute of Cybernetics, Tallinn, 3 September 2007 Emmanuel M. Tadjouddine Computing Sciences Department Aberdeen University Aberdeen, AB24 3UE e.tadjouddine@abdn.ac.uk www.csd.abdn.ac.uk/˜etadjoud Automatic Differentiation – p.1/30

  2. Outline A utomatic Differentiation F ormalisation T he Forward Mode T he Reverse Mode A pplication: Optimisation of a Satellite Boom Structure C omputational Issues P rogram Analyses G raph Elimination N umerical Results S ome Open Problems Automatic Differentiation – p.2/30

  3. Evaluating Derivatives [2] Problem Statement Given a program P computing a numerical value function F R n → R m : F x �→ y build up a program that computes F and its derivatives . What derivatives? Precisely, we want derivatives of the dependents , e.g., some variables in the outputs y w.r.t. the independents , e.g., some variables in the inputs x . This may be The Jacobian matrix J = ∇ F = ∂ y i ∂ x j . A directional derivative ˙ y = J ∗ ˙ x . Gradients when m = 1 as well as higher order derivatives Automatic Differentiation – p.3/30

  4. Finite Differencing (FD) Given a directional derivative ˙ x , run P twice to compute y = P ( x + h ˙ x ) − P ( x ) ˙ h where h is a small non negative number. (+) Easy to implement (-) Aproximation: What step size h ? (-) May be expensive but not always! Accurate derivatives are needed in Optimisation for example. Automatic Differentiation – p.4/30

  5. Automatic Differentiation (AD) A semantics augmentation framework ⇒ ˙ P ( x �→ y ) = P ( x , ˙ x �→ y , ˙ y ) using the chain rules of calculus to elementary operations in an automated fashion. A simple example proc foo(a,b,s) proc food(a,da,b,db,s,ds) REAL a,b,s REAL a,b,s REAL da,db,ds ds = a*db+da*b = ⇒ s = a*b s = a*b end proc foo end proc food Automatic Differentiation – p.5/30

  6. The AD Framework Source Code� Parsing� Inter. Repr.� (e.g., AST)� Graphs/Analyses� +� Transformation� Augmented AST� Unparsing� Derivative Code� Some AD Tools: A DIFOR , Tapenade, TAF, etc. Automatic Differentiation – p.6/30

  7. Differentiating Programs? (1) AD relies on the assumption that the input program is piecewise differentiable. To implement AD, Freeze the control of the input program View the program as a sequence of simple instructions Differentiate the sequence of instructions Caution: Some programs may not be piecewise differentiable while representing a differentiable function! Automatic Differentiation – p.7/30

  8. Differentiating Programs? (2) A program P is viewed as a sequence of instructions P : I 1 , I 2 , . . . , I p − 1 , I p where each I i represents a function φ i � � I i : v i = φ i { v j } j ≺ i , i = 1 , . . . , p. computes the value of v i in terms of previously defined v j . P is a composition of functions φ = φ p ◦ φ p − 1 ◦ . . . ◦ φ 2 ◦ φ 1 Differentiating φ yields � φ ′ ( x ) = φ ′ p ( v p − 1 ) × φ ′ p − 1 ( v p − 2 ) × . . . × φ ′ x ) 1 ⇒ a chain of matrix multiplications = Automatic Differentiation – p.8/30

  9. Forward and Reverse Modes Forward mode � y = φ ′ ( x ) × ˙ x = φ ′ p ( v p − 1 ) × φ ′ p − 1 ( v p − 2 ) × . . . × φ ′ ˙ x ) × ˙ x 1 The cost of computing ∇ F is about 3 n times the cost of computing F Reverse mode x = φ ′ ( x ) T × ¯ 1 ( x ) T × φ ′ 2 ( v 1 ) T × . . . × φ ′ p ( v p − 1 ) T × ¯ y = φ ′ ¯ y The cost of computing ∇ F is about 3 m times the cost of computing F but the memory requirement may explode. Gradients are cheaper by reverse mode AD. Automatic Differentiation – p.9/30

  10. The Reverse Mode AD 1 ( x ) T × φ ′ 2 ( v 1 ) T × . . . × φ ′ p ( v p − 1 ) T × ¯ x = φ ′ ¯ y � x = ¯ ¯ y    Restore v p − 1 before I p − 1 I 1 : v 1 = φ 1 ( x )     p ( v p − 1 ) T ∗ ¯ ¯  x = φ ′ I 2 : v 2 = φ 2 ( v 1 , x ) I p ¯ x      . . . . . .     I p v p = φ p ( v 1 , .., v p − 1 , x ) Restore x before I 1  �  1 ( x ) T ∗ ¯ ¯  I 1 x = φ ′ ¯ x  Instructions are differentiated in reverse order Either Recompute or Store the required values. The memory/execution time usage is a bottleneck! Automatic Differentiation – p.10/30

  11. Structural Design [3] Lightweight cantilever structure for suspending scientific instruments away from satellite. Wish to minimise trans- mission of vibration through structure from satellite to instrument NASA Photo ID: STS61B-120-052 Automatic Differentiation – p.11/30

  12. Optimisation of the Structure Meta-Lamarckian Approach ✟ ✯ ✻ ✟✟✟✟✟✟✟✟ 1) Periodic forcing at this end ❄ 3) Use GA to minimise the power 2) Measure transmitted power at this end Automatic Differentiation – p.12/30

  13. Memetic Algorithm [4] Not a huge problem, the number of independents n =453 Use of GA with population of 100 gives slow convergence Run times of 83 CPU days for 10 generations [3]. Wish to speed convergence using Meta-Lamarckian approach [4] Lamarckian evolution - Inheritance of acquired characteristics Couple gradient descent with the GA But gradient of transmitted power expensive to approximate with FD Need the reverse mode AD Automatic Differentiation – p.13/30

  14. Improved Performance AD coupled with hand-coded optimisations gave the following results: Method CPU ( ∇ F ) (s) CPU ( ∇ F ) / CPU ( F ) A DIFOR (reverse) 192.0 8.2 FD (1-sided) 10912.7 464.4 Gradient obtained now for cost equivalent to 8.2 function evaluations 56 times faster than FD and without truncation error Memory requirement of just 0.3 GB Automatic Differentiation – p.14/30

  15. Improving the AD process Avoiding useless memory storage and computations Data flow analyses (e.g., dependencies between program variables) Undecidability = ⇒ conservative decisions Abstract Interpretation using conservative approximations of semantics of computer programs over lattices. Exploiting the processor architecture Code reordering techniques Heuristics on code tuning à la ATLAS project (J. Dongara et al). Automatic Differentiation – p.15/30

  16. Program analyses Activity analysis: determine the set of active variables, e.g., those that depend on an independent and that impact a dependent. Common subexpression elimination: reduce the number of floating-point operations (FLOPs). Tape size: Minimise the set of variables to be stored or recomputed for the reverse mode. Sparse computations: Dynamic exploitation via a sparse matrix library (as in the A DIFOR tool). Static exploitation via array region analysis to detect sparse derivative objects, select an appropriate data structure and generate codes accordingly [5]. Graph elimination techniques to pre-accumulate local derivatives at basic block level [1, 6]. Automatic Differentiation – p.16/30

  17. AD by Vertex Elimination (1) Consider the code (left) and its computational graph (right): 6 c 6 , 4 c 6 , 5 5 4 v 3 = φ 3 ( v 1 , v 2 ) c 5 , 3 c 4 , 3 3 v 4 = φ 4 ( v 2 , v 3 ) v 5 = φ 5 ( v 1 , v 3 ) v 6 = φ 6 ( v 4 , v 5 ) c 3 , 1 c 3 , 2 c 4 , 2 c 5 , 1 1 2 wherein c i,j = ∂φ i /∂x j . ∂v 6 The derivative ∂ ( v 1 ,v 2 ) is obtained by eliminating vertices 3 , 4 , 5 . Automatic Differentiation – p.17/30

  18. AD by Vertex Elimination (2) The Vertex Elimination (VE) approach consists in Building up explicitly the computational graph from the input code Finding a vertex elimination sequence using heuristics (forward, reverse, or cross-country orderings) Then, generating the derivative code This approach enables us to Re-use expertise from sparse linear algebra Exploit the sparsity of the computation at compilation time Pre-accumulate local Jacobians at basic block level in a hierarchical manner. Automatic Differentiation – p.18/30

  19. Hierarchical AD To pre-accumulate a basic block, perform an in-out analysis The inputs and outputs of basic blocks are determined using a read and write analysis. The inputs are those active variables that are written before, and read within, the basic block. The outputs are those active variables that are written in the basic block and read thereafter. The overall differentiation is carried out by hierarchically pre- accumulating each basic block. Automatic Differentiation – p.19/30

  20. Numerical Results (1) The Osher scheme is an approximate Riemann solver. It is used to evaluate the inviscid flux normal to a surface interface. The Jacobian is a 5 × 10 matrix. technique cpu( ∇ F )/cpu(F) error( ∇ F ) Hand-Coded 2.646 0.0E+00 FD 10.824 1.6E-03 A DIFOR (fwd) 6.373 5.7E-14 A DIFOR (rev) 23.235 7.4E-14 Tapenade(fwd) 4.784 6.8E-14 VE method (best) 3.788 5.7E-14 Automatic Differentiation – p.20/30

  21. Numerical Results (2) The Roe scheme is an approximate Riemann solver. It computes numerical fluxes of mass, energy and momentum across a cell face in a finite-volume compressible flow calculation. The Jacobian is a 5 × 20 matrix. technique cpu( ∇ F )/cpu(F) error( ∇ F ) A DIFOR (fwd) 18 . 5 0 . 0 O (10 − 16 ) A DIFOR (rev) 9 . 1 O (10 − 6 ) FD 24 . 1 O (10 − 15 ) VE method (best) 4 . 7 Automatic Differentiation – p.21/30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend