Partitioning and numbering meshes for efficient MPI-parallel - PowerPoint PPT Presentation

Partitioning and numbering meshes for efficient MPI-parallel execution in PyOP2 Lawrence Mitchell, Mark Filipiak 1 Tuesday 18th March 2013 1 lawrence.mitchell@ed.ac.uk, mjf@epcc.ed.ac.uk 1

Outline Numbering to be cache friendly Numbering for parallel execution Hybrid shared memory + MPI parallelisation 2

Modern hardware ◮ Latency to RAM is 100s of clock cycles ◮ Multiple caches to hide this latency ◮ memory from RAM arrives in cache lines (64 bytes, 128 bytes on Xeon Phi) ◮ hardware prefetching attempts to predict next memory access 3

Exploiting hardware caches in FE assembly ◮ Direct loops over mesh entities are cache-friendly ◮ indirect loops may not be ◮ can we arrange them to be cache friendly? 4

A mesh 5

Cache friendly visit order (default numbering) 6

Mesh entity numbering is critical ◮ arrange for “connected” vertices to have a good numbering (close to each other) ◮ given this good numbering ◮ derive numberings for other entities 7

Numbering dofs ◮ Cover mesh with space-filling curve ◮ vertices that are close to each other get close numbers 8

Other entities ◮ construct additional entities with some numbering ◮ sort them and renumber lexicographically keyed on sorted list of vertices they touch ◮ do this every time the mesh topology changes ◮ (doesn’t work yet) 9

Comparing 10

Does it work? ◮ In Fluidity ◮ P1 problems get around 15% speedup ◮ In PyOP2 ◮ GPU/OpenMP backends get 2x-3x speedup (over badly numbered case) ◮ Fluidity kernels provoke cache misses in other ways 11

Iteration in parallel ◮ Mesh distributed between MPI processes ◮ communicate halo data ◮ would like to overlap computation and communication 12

Picture 13

Comp/comms overlap ◮ entities that need halos can’t be assembled until data has arrived ◮ can assemble the other entities already start_halo_exchanges() for e in entities: if can_assemble(e): assemble(e) finish_halo_exchanges() for e in entities: if still_needs_assembly(e): assemble(e) 14

Making this cheap ◮ separate mesh entities into groups start_halo_exchanges() for e in core_entities: assemble(e) finish_halo_exchanges() for e in additional_entities: assemble(e) 15

PyOP2 groups ◮ Core entities ◮ can assemble these without halo data ◮ Owned entities ◮ local, but need halo data ◮ Exec halo ◮ off-process, but redundantly executed over (touch local dofs) ◮ Non-exec halo ◮ off-process, needed to compute exec halo 16

Why like this? ◮ GPU execution ◮ launch separate kernels for core and additional entities ◮ no branching in kernel to check if entity may be assembled ◮ Defer halo exchange as much as possible (lazy evaluation) 17

How to separate the entities ◮ separate data structures for different parts ◮ possible, but hurts direct iterations, and is complicated ◮ additional ordering constraint ◮ core, owned, exec, non-exec ◮ implemented in Fluidity/PyOP2 ◮ each type of mesh entity stored contiguously, obeying this ordering 18

Hybrid shared memory + MPI parallelisation ◮ On boundary, assembling off-process entities can contribute to on-process dofs ◮ how to deal with this? ◮ use linear algebra library that can deal with it ◮ e.g. PETSc allows insertion and subsequent communication of off-process matrix and vector entries ◮ Not thread safe 19

Solution ◮ Do redundant computation ◮ this is the default PyOP2 computation model ◮ Maintain larger halo ◮ assemble all entities that touch local dofs ◮ turn off PETSc off-process insertion 20

Picture 21

Multiple gains ◮ You probably did the halo swap anyway ◮ this makes form assembly non-communicating ◮ we’ve seen significant (40%) benefit on 1000s of processes (Fluidity only) ◮ thread safety! 22

Thread safety ◮ Concurrent insertion into MPI PETSc matrices is thread safe if: ◮ there’s no off-process insertion caching ◮ user deals with concurrent writes to rows ◮ colour the local sparsity pattern 23

Corollary ◮ It is possible to do hybrid MPI/OpenMP assembly with existing linear algebra libraries ◮ implemented (and tested!) in PyOP2 ◮ Ongoing work to add more shared memory parallisation in kernels in PETSc ◮ PETSc team ◮ Michael Lange (Imperial) 24

Conclusions ◮ With a bit a of work, we can make unstructured mesh codes reasonably cache friendly ◮ For good strong scaling, we’d like to overlap computation and communication as much as possible, but cheaply ◮ We think the approaches here work, and are implemented in Fluidity/PyOP2 25

Acknowledgements ◮ Hilbert reordering in Fluidity: ◮ Mark Filipiak (EPCC) [a dCSE award from EPSRC/NAG] ◮ Lexicographic mesh entity numbering and ordering in Fluidity: ◮ David Ham (Imperial), and me (prodding him along the way) ◮ PyOP2 MPI support: ◮ me (EPCC) [EU FP7/277481 (APOS-EU)] ◮ ideas from Mike Giles and Gihan Mudalige (Oxford) ◮ MAPDES team: ◮ funding (EPSRC grant EP/I00677X/1, EP/I006079/1) 26

Partitioning and numbering meshes for efficient MPI-parallel - PowerPoint PPT Presentation

Partitioning and numbering meshes for efficient MPI-parallel execution in PyOP2 Lawrence Mitchell, Mark Filipiak 1 Tuesday 18th March 2013 1 lawrence.mitchell@ed.ac.uk, mjf@epcc.ed.ac.uk 1 Outline Numbering to be cache friendly Numbering for

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Progressive Meshes (96) Hugues Hoppe and Efficient Implementation of P-Meshes (98) Hugues Hoppe

Numbering for eCall - capacity, efficiency, sustainability HeERO International Conference 27

CS293S GCSE Yufei Ding Review So far, we have seen Local Value Numbering Finds

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Best Practices Workshop: Overset Meshing Overview Introduction to Overset Meshes Range of

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Any-Code Completion public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null)

Learning Morphophonology From Morphology and MDL John A Goldsmith The University of Chicago

Generative Adversarial Networks Benjamin Striner CMU 11-785 March 21, 2018 Benjamin Striner

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

Performance evaluation and optimization of Geant4 on GPUs Azamat Mametjanov LANS

Outline What is the proposed e-Science Desktop Peer and why. P2P-DVM, a prototype of

Structured matrix methods for polynomial computations Joab R. Winkler Department of Computer

Polynomial Optimzation in Quantum Information Theory Sabine Burgdorf University of Konstanz

Partitioning and numbering meshes for efficient MPI-parallel - PowerPoint PPT Presentation

Partitioning and numbering meshes for efficient MPI-parallel execution in PyOP2 Lawrence Mitchell, Mark Filipiak 1 Tuesday 18th March 2013 1 lawrence.mitchell@ed.ac.uk, mjf@epcc.ed.ac.uk 1 Outline Numbering to be cache friendly Numbering for

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Progressive Meshes (96) Hugues Hoppe and Efficient Implementation of P-Meshes (98) Hugues Hoppe

Numbering for eCall - capacity, efficiency, sustainability HeERO International Conference 27

CS293S GCSE Yufei Ding Review So far, we have seen Local Value Numbering Finds

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Best Practices Workshop: Overset Meshing Overview Introduction to Overset Meshes Range of

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Any-Code Completion public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null)

Learning Morphophonology From Morphology and MDL John A Goldsmith The University of Chicago

Generative Adversarial Networks Benjamin Striner CMU 11-785 March 21, 2018 Benjamin Striner

Par arall llel Performan ance Optim imiz ization and Productiv ivity EU H2020 Centre of of

Performance evaluation and optimization of Geant4 on GPUs Azamat Mametjanov LANS

Outline What is the proposed e-Science Desktop Peer and why. P2P-DVM, a prototype of

Structured matrix methods for polynomial computations Joab R. Winkler Department of Computer

Polynomial Optimzation in Quantum Information Theory Sabine Burgdorf University of Konstanz

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards