Algorithmic Differentiation of Structured Mesh Applications G abor - PowerPoint PPT Presentation

Algorithmic Differentiation of Structured Mesh Applications G´ abor D´ aniel Balogh Supervisor: dr. Istv´ an Reguly P´ azm´ any P´ eter Catholic University Faculty of Information Technology and Bionics October 20, 2020 G. D. Balogh (PPCU – ITK) October 20, 2020 1 / 18

Algorithmic Differentiation - AD Algorithmic Differentiation is used to evaluate derivatives of the function which defined by a computer program G. D. Balogh (PPCU – ITK) October 20, 2020 2 / 18

Algorithmic Differentiation - AD Algorithmic Differentiation is used to evaluate derivatives of the function which defined by a computer program Why AD? Numerical methods require derivatives There are three main ways of automating computation of derivatives Finite differentiation - slow for high dimension, lower accuracy Symbolic differentiation - cannot handle some programming constructs Algorithmic Differentiatoin - exact solution, fast G. D. Balogh (PPCU – ITK) October 20, 2020 2 / 18

Algorithmic Differentiation - AD Our program: f : R n → R m , from som input u ∈ R n generates some output w ∈ R m G. D. Balogh (PPCU – ITK) October 20, 2020 3 / 18

Algorithmic Differentiation - AD Our program: f : R n → R m , from som input u ∈ R n generates some output w ∈ R m Our goal is to get the Jacobian J (or a part of it): J ij = ∂ f i ∂ x j G. D. Balogh (PPCU – ITK) October 20, 2020 3 / 18

Algorithmic Differentiation - AD Our program: f : R n → R m , from som input u ∈ R n generates some output w ∈ R m Our goal is to get the Jacobian J (or a part of it): J ij = ∂ f i ∂ x j But how to get there? G. D. Balogh (PPCU – ITK) October 20, 2020 3 / 18

Algorithmic Differentiation - AD Assume that we can write f as a composite of K functions: f = f K ◦ f K − 1 ◦ . . . ◦ f 1 G. D. Balogh (PPCU – ITK) October 20, 2020 4 / 18

Algorithmic Differentiation - AD Assume that we can write f as a composite of K functions: f = f K ◦ f K − 1 ◦ . . . ◦ f 1 Then we can write the Jacobian as: J = J L · J L − 1 · . . . · J 1 G. D. Balogh (PPCU – ITK) October 20, 2020 4 / 18

Algorithmic Differentiation - AD Assume that we can write f as a composite of K functions: f = f K ◦ f K − 1 ◦ . . . ◦ f 1 Then we can write the Jacobian as: J = J L · J L − 1 · . . . · J 1 There are two mode of AD: Forward (tangent) mode computes J · u = J L · J L − 1 · · · · · J 1 · u , for u ∈ R n Backward (adjoint) mode computes J T · w = J T 1 · J T 2 · · · · · J T K · w , for w ∈ R m G. D. Balogh (PPCU – ITK) October 20, 2020 4 / 18

Adjoint mode AD Backward (adjoint) mode computes J T · w = J T 1 · J T 2 · · · · · J T K · w , for w ∈ R m Use w such that the i th element of w is 1, and the others are 0. J T · w will produce the i th row of J G. D. Balogh (PPCU – ITK) October 20, 2020 5 / 18

Adjoint mode AD Backward (adjoint) mode computes J T · w = J T 1 · J T 2 · · · · · J T K · w , for w ∈ R m Use w such that the i th element of w is 1, and the others are 0. J T · w will produce the i th row of J Evaluate it one step at a time Only need the derivative of one function of the chain If we choose the f i decomposition carefully, we can implement AD efficiently G. D. Balogh (PPCU – ITK) October 20, 2020 5 / 18

Adjoint mode AD Backward (adjoint) mode computes J T · w = J T 1 · J T 2 · · · · · J T K · w , for w ∈ R m Use w such that the i th element of w is 1, and the others are 0. J T · w will produce the i th row of J Evaluate it one step at a time Only need the derivative of one function of the chain If we choose the f i decomposition carefully, we can implement AD efficiently But to get J i we need the state of the program at f i G. D. Balogh (PPCU – ITK) October 20, 2020 5 / 18

Adjoint mode AD Backward (adjoint) mode computes J T · w = J T 1 · J T 2 · · · · · J T K · w , for w ∈ R m Commonly implemented with operator overloading f i is an elementary operation, easy to compute J T · w ′ i But we produce enormous chains, and need to store every state of every variable G. D. Balogh (PPCU – ITK) October 20, 2020 6 / 18

DSLs for future proof HPC applications Fast-changing hardware infeasible to maintain code to support all of them Embedded Domain Specific Languages (eDSL) can hide platform specific details future proof, perforamnce portable solutions for application developers G. D. Balogh (PPCU – ITK) October 20, 2020 7 / 18

Oxford Parallel library for Structured mesh solvers OPS ( O xford P arallel library for S tructured mesh solvers) C++ library with high-level API calls for structured mesh applications High level concepts grids data on grids loops over subgrid accessing data generate parallel implementations of loops for all hardware G. D. Balogh (PPCU – ITK) October 20, 2020 8 / 18

OPS + AD Each loop that performs computation must be a call of ops par loop takes the loop body as a function descriptors of datasets: access type, stencil of access OPS generates the implementation for all ops par loop G. D. Balogh (PPCU – ITK) October 20, 2020 9 / 18

OPS + AD Each loop that performs computation must be a call of ops par loop takes the loop body as a function descriptors of datasets: access type, stencil of access OPS generates the implementation for all ops par loop If we have all these information, can we do AD? G. D. Balogh (PPCU – ITK) October 20, 2020 9 / 18

OPS + AD If we have all these information, can we do AD? OPS already has control over the sequence of parallel loops. If we have derivatives of these loops instead of the elementary operations we can evaluate the chain rule on the loop level . We assume that either the user will supply the derivative or we can use source transformation to get it G. D. Balogh (PPCU – ITK) October 20, 2020 10 / 18

OPS + AD If we have all these information, can we do AD? OPS already has control over the sequence of parallel loops. If we have derivatives of these loops instead of the elementary operations we can evaluate the chain rule on the loop level . We assume that either the user will supply the derivative or we can use source transformation to get it Features missing to perform the backward sweep: store intermediate states reversing data flow inside loops G. D. Balogh (PPCU – ITK) October 20, 2020 10 / 18

Intermediate state storage - tape Initilise OPS and Generated code registers loops and create Finish variables Write Load checkpoints checkpoints copies of overwritten data to a data ops_par_loop_1 ops_par_loop_1_a1s structure (tape). ops_par_loop_2 ops_par_loop_2_a1s ops_par_loop_3 ops_par_loop_3_a1s OPS tape In the backward sweep we can execute the adjoints of each loop and load intermediate states of datasets. ops_par_loop_N-1 ops_par_loop_N-1_a1s ops_par_loop_N ops_par_loop_N_a1s ops_interpret_adjoints G. D. Balogh (PPCU – ITK) Init October 20, 2020 11 / 18 adjoints

Reversing data flow The second problem is that we need to parallelise the adjoints of the stencil loops as well. Figure 1: Example computational step in OPS given by the user (a) for compute results and (b) to compute the derivatives backwards 1 inline void mean_kernel_adjoint ( 2 const OPS_ACC <double> & u , 1 inline void mean_kernel ( 3 OPS_ACC <double> & u_a1s , const OPS_ACC <double> & u , 2 const OPS_ACC <double> & u_2 , 4 3 OPS_ACC <double> & u_2 ) { 5 OPS_ACC <double> & u_2_a1s ) { 4 u_2 (0, 0) = ( u (-1, 0) + u (1, 0) u_a1s (-1,0) += 0.25 * u_2_a1s (0, 0); 6 + u (0, -1) + u (0, 1)) * 0.25; 5 7 u_a1s (1, 0) += 0.25 * u_2_a1s (0, 0); 6 } 8 u_a1s (0,-1) += 0.25 * u_2_a1s (0, 0); u_a1s (0, 1) += 0.25 * u_2_a1s (0, 0); 9 a: Compute the mean of neighbours for each 10 } grid point b: The corresponding adjoint kernel G. D. Balogh (PPCU – ITK) October 20, 2020 12 / 18

Reversing data flow Writing pattern in forward code. Reversed stencil for adjoints. Race conditions No race condition allowed. on each adjoint. G. D. Balogh (PPCU – ITK) October 20, 2020 13 / 18

Avoiding race conditions CPU: Red black colouring creating red and black stripes for each thread Thr 0 Thr 1 CUDA: overloading operators to use atomics for adjoints G. D. Balogh (PPCU – ITK) October 20, 2020 14 / 18

Performance: CPU Our best performance on example apps produce derivatives under less then 6 × of the primal code, which is close to the theoretical optimum on a representative code from finance. G. D. Balogh (PPCU – ITK) October 20, 2020 15 / 18

Performance: V100 Naive GPU implementation of the adjoint loops on typical problem sizes takes 10 − 25 × of the primal. G. D. Balogh (PPCU – ITK) October 20, 2020 16 / 18

Current solution: Memory Another problem of the current implementation is that checkpointing requires too much memory. Memory (GB) With checkpointing (GB) iteration count Grid Size primal 10 100 200 512 × 256 0.025 0.292 1.788 3.792 1024 × 512 0.094 0.892 6.902 12.90 1024 × 1024 0.188 1.946 13.04 26.04 G. D. Balogh (PPCU – ITK) October 20, 2020 17 / 18

Algorithmic Differentiation of Structured Mesh Applications G abor - PowerPoint PPT Presentation

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh Supervisor: dr. Istv an Reguly P azm any P eter Catholic University Faculty of Information Technology and Bionics October 20, 2020 G. D. Balogh

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Mesh Basics Mesh Basics 1 Spring 2010 Definitions: Definitions: 1/2 Definitions:

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Algorithmic Differentiation of a Basket Option Code Results GPU Accelerated Application Race

Mesh Networks | Hacking The T3lc0 Model http://arig.org.il What's a Mesh Anyway ? Mesh =

A Service Mesh Is Easy To Swallow In Small Pieces Andrew Jenkins Eng Lead, Aspen Mesh

W ir eless Mesh Netw or k W ir eless Mesh Netw or k Technical Overview Technical Overview Danny

What Makes for a Good Mesh? CS101 Meshing Winter 2007 1 Mesh Quality What makes a mesh

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Overview of Unstructured Mesh Generation Methods Structured Meshes local mesh points and

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

unstructured mesh framework Graham Markall, Florian Rathgeber, Nicolas Loriant, Georghe-Teodor

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

The search for Majorana neutrinos with a background-free gaseous Xenon TPC at the tonne scale F

Javad Lavaei Department of Electrical Engineering Columbia University Acknowledgements Caltech:

Impact of Airborne Heavy Metals on Lung Impact of Airborne Heavy Metals on Lung Disease and the

MTLE-6120: Advanced Electronic Properties of Materials Metal-metal junctions, Seebeck effect,

Algorithmic Differentiation of Structured Mesh Applications G abor - PowerPoint PPT Presentation

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh Supervisor: dr. Istv an Reguly P azm any P eter Catholic University Faculty of Information Technology and Bionics October 20, 2020 G. D. Balogh

Thomas Hhn 4. Juni 2009 TU-Berlin, Berlin Why to How to Worksheets mesh ? mesh ? Outline

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Mesh Basics Mesh Basics 1 Spring 2010 Definitions: Definitions: 1/2 Definitions:

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Algorithmic Differentiation of a Basket Option Code Results GPU Accelerated Application Race

Mesh Networks | Hacking The T3lc0 Model http://arig.org.il What's a Mesh Anyway ? Mesh =

A Service Mesh Is Easy To Swallow In Small Pieces Andrew Jenkins Eng Lead, Aspen Mesh

W ir eless Mesh Netw or k W ir eless Mesh Netw or k Technical Overview Technical Overview Danny

What Makes for a Good Mesh? CS101 Meshing Winter 2007 1 Mesh Quality What makes a mesh

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Differentiation Differentiation stems from beliefs about differences among learners, how they

JUST THE MATHS SLIDES NUMBER 10.3 DIFFERENTIATION 3 (Elementary techniques of

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &amp;

4.4. Vertical Differentiation Matilde Machado Industrial Organization- Matilde Machado Vertical

Beautiful differentiation Conal Elliott LambdaPix 1 September, 2009 ICFP Conal Elliott

Overview of Unstructured Mesh Generation Methods Structured Meshes local mesh points and

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&amp;M

* Dr. Axel Voigt (voigt@caesar.de) research center caesar crystal growth group

unstructured mesh framework Graham Markall, Florian Rathgeber, Nicolas Loriant, Georghe-Teodor

OpenMP Kenjiro Taura 1 / 74 Contents 1 Overview 2 A Running Example: SpMV 3 parallel pragma 4

The search for Majorana neutrinos with a background-free gaseous Xenon TPC at the tonne scale F

Javad Lavaei Department of Electrical Engineering Columbia University Acknowledgements Caltech:

Impact of Airborne Heavy Metals on Lung Impact of Airborne Heavy Metals on Lung Disease and the

MTLE-6120: Advanced Electronic Properties of Materials Metal-metal junctions, Seebeck effect,

JUST THE MATHS SLIDES NUMBER 10.4 DIFFERENTIATION 4 (Products and quotients) &

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M