Dzung Nguyen, T. Liepert, M. Reisenbchler, K. Kaveh, M. D. Bui, P . - - PDF document

dzung nguyen t liepert m reisenb chler k kaveh m d bui p
SMART_READER_LITE
LIVE PREVIEW

Dzung Nguyen, T. Liepert, M. Reisenbchler, K. Kaveh, M. D. Bui, P . - - PDF document

An Efficient Implementation of Parallelization in The Domain Decomposition of TELEMAC Dzung Nguyen, T. Liepert, M. Reisenbchler, K. Kaveh, M. D. Bui, P . Rutschmann Chair of Hydraulic and Water Resources Engineering Technical University of


slide-1
SLIDE 1

An Efficient Implementation of Parallelization in The Domain Decomposition of TELEMAC

Dzung Nguyen, T. Liepert, M. Reisenbüchler, K. Kaveh, M. D. Bui, P . Rutschmann

Chair of Hydraulic and Water Resources Engineering Technical University of Munich Arcisstr.21. 80333 Munich dzung.nguyen@tum.de

Abstract—In this study, we present an MPI-parallelized implementation of the domain decomposer for TELEMAC. The decomposer, called PARTEL, splits the computational domain into several partitions, which forms the basis of parallel

  • simulations. In current TELEMAC releases (version 8), the only

serial mode has been implemented in PARTEL, which gives room for speeding up the code with a parallel implementation together with a specific enhancement for the internal data structures used by the code. In this work, we fully parallelized the domain decomposition using MPI and utilized different implementations of Hash table for representing several data structures in PARTEL. This approach allows us to decompose a huge computational domain consisting of a hundred million elements into ten thousand partitions on an ordinary

  • workstation. The benchmark also revealed that the speed-up by

a factor of 4 is readily obtained compared to the original PARTEL before the enhancement.

I. INTRODUCTION Numerical simulations have proven to be an efficient way to study the characteristics of river hydro- and morph- dynamics processes. Especially for flood protection, numerical computations are indispensable. The VieWBay- project, part of the strategy funded by the Bavarian State Ministry of the Environment and Consumer Protection called "Wasser-Zukunft-Bayern", aims to develop an integrative approach combining the fields

  • f

hydrology and hydromorphodynamics under the framework of HPC applying it to whole Bavaria. The hydromorphodynamics model represents the river network including the propagation of waves and morphological evolution of the streams. The intended size of the river system, which has to be modeled for simulations, should be approximately 10000 kilometers. Therefore, the demands on the simulation software resulting from the project are high. A preliminary study [3] shows the TELEMAC fits best to the needs of the project, regarding the flexibility, scalability, and accuracy. TELEMAC, however, has a limit in handling mesh sizes as we first notice in version v7p2r3. This version is capable of dealing with domain sizes of a few million elements and scales well by default [1,2,4]. However, the aim of VieWBay regarding the mesh size goes beyond that. The intended mesh size will range between 40 and100 million elements. Deployed

  • n the CoolMuc-2, a Linux cluster at the Leibniz

Supercomputing Center - LRZ [5], TELEMAC v7p2r3 fails to run simulations with a greatly larger mesh, for instance a mesh

  • f more than 5 million elements. Therefore, it was urgent for
  • ur development team to improve the version of TELEMAC.

An in-depth analysis later reveals that memory exceedance was the reason for the failure of the software. CoolMUC-2 has imposed a limit on computer memory (RAM) that we thrive to overcome for TELEMAC. The cluster consists of 384 nodes of 28-core Haswell architecture, all of which are Infiniband interconnected. Each node is equipped with 64 GB RAM shared by the cores of the same

  • node. In practice, the actual available RAM is reduced by

about 10 %, due to runtime system requirements [5]. Under the assumption of an evenly distributed memory usage, each core has about 2 GB RAM to work. In case of a lack of memory, the number of activated cores might be reduced in

  • rder to raise the available RAM per core.

By tracing the workflow of TELEMAC v7p2r3, we found that the domain decomposer, called PARTEL, was the main cause of the memory exceedance. By design, PARTEL is a standalone program always running ahead of the simulation code in the parallel mode to split the computational domain into the desired number of partitions. Our performance analysis revealed that PARTEL exceeds the memory limit when dealing with a mesh of larger 5 million elements, thus fails to complete. The failure, therefore, causes the whole simulation terminated. A number of studies that attempted to solve the memory can be found in [1, 2, 3, 4]. The solution in [2] has only resolved a part of the problem by optimizing the data

  • structures. With that, TELEMAC was able to work with a large

mesh of up to 25 million elements before being subjected to memory and runtime limits. Among the solutions, [1] is the most complete parallel implementation PARTEL. As such, TELEMAC worked successfully with an extremely large mesh (up to 200 million elements) in an acceptable time. It, however, comes with great complexity in implementation. We first mentioned the memory problem in [4] and came up with

slide-2
SLIDE 2

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 a temporary solution, which has not treated the problem properly until this paper. The most efficient solution is given by [3], with which the implementation not only solved the memory problem effectively but also ran reasonably fast. The core of this improvement lays in utilizing a Hash table that was used to replace several memory critical variables in the old

  • code. At this stage, as an ordinary user who usually works with

a moderately large mesh would be satisfied with the solution provided by [3]. In this work, we tried to push the boundary of this improvement to the point that we are able to gain a better speed up while not losing the advantages of memory efficiency. Our first work was based on TELEMAC v7p2r3. We had a parallel version of PARTEL that fulfilled the need for the ViewBay project. This work however rendered obsolete as [2] and our latest work in this paper came out. Our latest work, therefore, focused on improving PARTEL in TELEMAC v8p0r0 by pushing the upper bound of domain decomposition in terms of performance. As the original PARTEL code only runs in serial mode, it first seemed obvious to speed up the code by making it parallel. Second, we applied different implementation for the Hash table used in the

  • riginal PARTEL to find out the fastest, thus providing user

with several options to Hash table implementations that best suite their needs. By combining parallelization and data structure enhancement by an efficient Hash table, in our PARTEL code each MPI process maintains a low memory consumption while achieving a significant gain in

  • performance. Our PARTEL is in fact made embarrassingly

parallel, therefore, theoretically possessing optimal scaling characteristics. II. METHOD

  • A. TELEMAC workflow

Our aim is to finish the following two tasks: (1) to translate the serial code of original PARTEL to an equivalent version that speeds up the process. Parallelizing PARTEL is the optimal choice as it reflects the nature of the original code. This requires that the parallelized PARTEL must output the same as what it is expected from the serial counterpart. We call this output consistency. (2) to apply an efficient implementation for the Hash table in the original PARTEL that possibly speed up the process.

  • Fig. 1: The workflow of TELEMAC

Parallelizing PARTEL in TELEMAC is challenging, due to its convoluted structure. The lack of documentation makes the code even more difficult to comprehend in order to identify which part of the code has the potential to be made parallel. We start with finding the key data structures, their locations, and their meaning as well as their influences in the rest of code. After the analysis is done, we separate code that can be made parallel and adjust for the impacts of changes. As an overview, a simulation in TELEMAC is done in three steps Fig.1: (1) decomposing the domain, which is carried out by PARTEL; (2) running the simulation code on each domain partition; (3) merging results from partitions. These steps are controlled by the TELEMAC wrappers that are several standalone Python code controlling the workflow. The core of TELEMAC is, however, written in FORTRAN.

  • B. Domain Decomposition in PARTEL

In parallel mode, TELEMAC applies the domain decomposition scheme with which the data associated with a problem such as the representation of geometry, boundary condition of the domain are decomposed. Each parallel task then works on a partition of the domain data Domain partitions are the direct results of graph

  • partitioning. Since a mesh representing a 2D domain is

topologically a planar graph, it can be decomposed efficiently by a graph partitioning algorithm. For that reason, the graph partitioning code always runs first to create partition data. PARTEL depends on several third-party tools for graph partitionings such as METIS, and SCOTCH [7,9]. While both tools are actively supported, METIS is selected by default. The tool is a set of serial programs for partitioning graphs as in Fig. 2, partitioning finite element meshes and producing fill reducing orderings for sparse matrices. It is worth mentioning that ParMETIS is an MPI-based parallel library that extends the functionality provided by METIS and includes routines that are especially suited for parallel computations and large scale numerical simulations. ParMETIS is, however, not usable in release versions of TELEMAC. The result of the graph partitioning is then used by following codes to create additional partition data such as boundary conditions, and parallel information. Finally, all

slide-3
SLIDE 3

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 results are written in either TELEMAC standard format, e.g. Selafin (for geometry files) or text format (parallel files).

  • Fig. 2: Domain decomposition in DragForce, a built-in example of

TELEMAC, done by serial PARTEL

Our work requires an understanding of the internal representations of these formats, especially for those parallel

  • files. The internal representations of those files, however, are

not standardized or even not documented partially because they strongly depend on the parallel solvers, which might subject to change in future releases. This adds difficulty to our development of PARTEL.

  • C. Refactoring PARTEL code

PARTEL, as a program, in fact, calls subroutine PARTEL to carry out the actual decomposition task. We, therefore, try not to distinguish the program PARTEL from the subroutine PARTEL in this paper. For that reason, our development is thus primarily made on the code of PARTEL subroutine. The workflow of PARTEL can be logically divided into three tasks as in Fig. 4: First, the graph partitioning with METIS (green block). The partition data result is stored in an array, called EPART. Next the code for writing parallel files (PAR files) (red block) is called. Finally, the BC block (in blue) is invoked for writing the decomposed boundary condition files in standard Serafin format. This is followed by the task GEO to generate geometry files for each partition that is also written in the Serafin format. As with PARTEL_PARA a parallel version of PARTEL already exists, it seems at first to be a promising alternative for

  • PARTEL. The PARTEL_PARA code performs most of the

tasks except for the auxiliary files (PAR). The parallel files are therefore missing or incomplete because the discontinuity of

  • development. Because of this, the simulation code of

TELEMAC cannot directly run with the output of PARTEL_PARA. We attempted to complete PARTEL_PARA by adding the missing code, but we later learned that the program produces the wrong output even for geometry data files as shown in Fig. 3. Therefore, we proceeded writing our

  • wn

a parallel version

  • f

PARTEL while using PARTEL_PARA as an inspiration. As mentioned our process of making PARTELL run in parallel faced difficulties, as we could not find the format descriptions of result files generated by each logical task mentioned above in the TELEMAC official development

  • manual. We must, therefore, relied heavily on reverse-

engineering the PARTEL_PARA and internal data structures

  • f METIS such as EPART and auxiliary parallel files (PAR

files) generated by the original serial code of PARTEL PARTEL code (PARTEL subroutine) is then refactored to increase readability with which the code is easier to be

  • decoupled. PARTEL originally is long file that makes use of

global variables intensively. We simplified it by grouping its code in a few functions. This proves to be useful to accelerate the parallelization of PARTEL.

  • Fig. 3: PARTEL PARA failed to decompose the domain geometry data in

DragForce example

  • Fig. 4: Conceptual Flowchart of serial PARTEL
slide-4
SLIDE 4

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019

  • D. Making P

ARTEL Parallel Which code to make parallel? Since the original code of PARTEL is complicated, in order to make it work in parallel we try to identify two different portions of code: (1) the critical serial code that cannot or too difficult to make parallel and (2) the part of code that has high potential of being made parallel. Regarding the critical serial code, it can be seen that the code that is responsible for the calculation of variable PART_P is essential serial. What is stored in PART_P is the information that tells how many partitions are sharing a given node. If PART_P is computed in parallel, the parallel code must involve a certain synchronization among MPI processes, which undoubtedly add complexity to the process of parallelizing PARTEL. In addition, through our benchmark for the original serial code of PARTEL (as we will discuss in the result session) we notice that time spent on graph decomposition (green block Fig. 4 and Fig. 5), on computation of PART_P and on generating parallel files make up a small portion of the processing of PARTEL as a whole. As measured, the time is less than 5 percent of the whole. As a consequence, PARTEL spends most

  • f time generating and writing boundary condition files as well

as geometry files (blue block in Fig. 4 and Fig. 5) As generating, writing boundary condition files and geometry file taking up a significant portion of processing time we focus on parallelizing the portion of code.

  • Fig. 6: The implementation concept of embarrassing parallelization of

Partition Solution module

How to make the code parallel? We notice that Partition Solution (blue block in Fig. 4 and

  • Fig. 5) can be made parallel by taking advantage of the

embarrassing characteristics of parallelization of the original PARTEL code

  • Fig. 5: Embarrassing parallelization of Partition Solution module

Our algorithm is that we nested an internal loop inside the

  • uter loop as shown in the Fig. 5. The internal loop has the

duty of assigning a partition i to a process j in a certain way, which we call partition – process mapping. The rule of mapping is given as follows: Let i = 0, …, NPARTS – 1 denote partitions; j = 0, …, N – 1 denote MPI processes then a partition i is assigned to process j if and only if j == mod (i, N); where NPARTS is the number of partition and N is the number of MPI

  • processes. We have an example as follows:

In the example, we have N = 4 the number of processes and NPARTS = 10, the number of partitions. The processes are indexing from 0 to 3 and the partitions from 0 the 9. The partitions are grouped into different sets identified by the process id, for instance, partitions {0,4,8} belong to process 0 and so on. It is worth mentioning that N is not necessarily equal to

  • NPARTS. And because of the embarrassing parallelization, the

bigger value of N the higher speed up one can gain. In other

slide-5
SLIDE 5

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 words, the parallelized PARTEL has a good scaling characteristic. Changes to the code First, we must change the command line that launches the PARTEL in the TELEMAC configuration file (systel.cfg) so that PARTEL can run in parallel: Before After We can see that the parallelized PARTEL is launched with an MPI launcher, i.e mpiexec. One can see also that the PARTEL.PAR is no longer redirected to PARTEL after the

  • changes. Instead, the file is read by each MPI process
  • independently. The reason for doing this is that, keeping

PARTEL.PAR, as redirected into MPI processes, will make MPI code fail, thus we change the entry code of PARTEL, file homere_partel.f accordingly as follows: Second, we add code that implements the partition-process mapping in file mod_write_solutions.f

  • E. Enhancing P

ARTEL internal data structures Applying a Hash table As of TELEMAC v8 the memory problem has been solved by introducing an efficient data structure for memory critical variables CUT_P, GELEGL and KNOGL. In the previous version of TELEMAC they are 2-dimensional arrays as bellow  CUT_P(NPOIN2, NPARTS): storing data that help generate hallo cells, therefore, the parallel files.  GELEGL(NELEM2, NPARTS): a mapping of a global to local index of an element mesh in a given domain partition.  KNOGL(NPOIN2, NPARTS): a mapping of a global to local index of a mesh node. As we can see the variables will be taking a significant amount of memory as being allocated. As mentioned in the latest paper that improved PARTEL [3] most of the content of those variables is empty. The improvement that reduces memory usage significantly is guaranteed by replacing 2D arrays implementation with equivalent Hash tables as shown in the diagram below: The Hash table is implemented in FORTRAN in the

  • riginal code of PARTEL. Since the implementation of the

Hash table has a significant impact on the client code that uses it, in this work, we implemented an interface code that allows us to test the PARTEL with different implementations of Hash table even the Hash table is implemented in a language different than FORTRAN. From this we will be more certain in selecting the best Hash table implementation for our prallelized PARTEL code. Our idea is to test PARTEL with a different implementation of Hash tables and base on that we provide the user with a better choice for Hash table code. As C++ is well known for its strength in the implementation of generic algorithms we decided to bridge the existing FORTRAN code with C++. For this purpose, we write an interface module FORTRAN that is responsible for creating, manipulating and destroying a C++ data structure

  • bject. In fact, there is not yet a standard mechanism for

FORTRAN/C++ interoperability, however, we can work around this issue efficiently by using a C module as a bridge as follows: As a result, we ended up with writing two interface modules: mod_hash_table_cpp.f90: FORTRAN/C interaction hash_table.h: C/C++ interaction The core implementation of Hash tables is carried out in a C++ modules, i.e hash_table.cpp that make use of different implementations of Hash tables. We then compile the C++ code into a shared library (*.so) then link TELEMAC against the library during the normal compilation process. In the end we were able to summon a C++ object from a C++ library, manipulate it and destroy it as

use mod_hash_table … type(hash_table)::CUT_P … type(hash_table)::KNOGL … type(hash_table)::GELEGL … module hash_table …

FORTRAN C C++

par_cmdexec:<config>/partel < PARTEL.PAR >> <partel.log> par_cmdexec: mpiexec -n <nprocs> <config>/partel < PARTEL.PAR >> <partel.log>

  • pen(unit=fd, file="PARTEL.PAR", action="read")

DO I = 0,NPARTS-1 IF ( MOD(I, NPROCS) == MYRANK ) THEN GENERATE_CLI_FILE(I,…) GENERATE_GEO_FIE(I,…) END IF END DO

slide-6
SLIDE 6

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 PARTEL is done. The implementation is made in such a way that C++ Hash table object is a drop-in for the original Hash table written in FORTRAN. We can see the altered workflow as bellow: Several implementation of Hash table has been used for the benchmark Table. 2:

  • Fig. 7: An extremely simple example for inspecting serial PARTEL: a

water bump falling down due to gravity in a rectangular pool. The partitions are in different colors. These simple partitions are used to verify the output

  • f parallel PARTEL

III. RESULTS

  • A. Code Correctness

Since the parallel PARTEL has resulted from our reverse engineering process, it is important to verify the correctness of the code. We use the following strategies: 1) Using extremely simple examples (Fig. 7): to inspect the internal representation of the partition data. Due to the lack

  • f proper documentation of serial PARTEL, we can just rely
  • n is the code itself and a large number of comments. While

these comments are helpful most of the time, they are sometimes misleading because of not being properly updated. 2) Comparing with a set of built-in examples of TELEMAC: the tests were performed correctly, which confirms the correctness of the parallel PARTEL code. In the next step, we will be investigating the performance for domains, which are more realistic and closer to the desired extent including millions of cell meshes.

  • B. Memory and Performance

The original PARTEL code is fully parallelized and incorporated with different implementations of Hash table. We consider three different test cases as shown in Table 1, from which our implementations of PARTEL will be tested with several mesh sizes. Our benchmark is done for six implementations as shown in Table 2 on an ordinary workstation of 04 physical cores of 2.0 GHz and 32 GB RAM.

TABLE 1. DIFFERENT MESH SIZES FOR DECOMPOSITION

Case # 1 2 3

  • No. Elements

20M 40M 100M

TABLE 2. DIFFERENT IMPLEMENTATIONS FOR HASH TABLE

Author Hash table library Language I C++ STL

std:: unordered_map C++

II TELEMAC

GoogleFarm Hash table FORTRAN

III Google [10]

abseil:: flat_hash_table C++

IV Martin Ankerl [11]

robin_hood:: unordered_map C++

V Malte Skarupke [12]

ska::flat_hash_table C++

VI

ska::flat_hash_table + Parallelization FORTRAN/C++

  • Fig. 8: PARTEL implemenation benchmark

As we can see that implementation (V) of PARTEL gives the best performance. Compared with the original implementation II of PARTEL, V is about twice as faster for case 1 and case 2 and about 15% faster for case 3. It is also worth noting that for case 1 and case 2, implementations III, IV give also very good speed up that is roughly faster implementation II by factor of 2 but slower as dealing with the biggest case 3. From this, one can conclude that it is

slide-7
SLIDE 7

XXVIth Telemac & Mascaret User Club Toulouse, FR, 16-17 October, 2019 reasonable to stay with the default implementation of PARTEL for a moderate large mesh, in case of an extremely large mesh

  • ne should consider using the implementation V of PARTEL.

Finally, implementation VI is the combination of V with the parallelization of PARTEL. In the benchmark we only tested for the parallelization of two MPI processes and the result is

  • bvious telling that the implementation is the most efficient
  • ne. It is about 4 times faster in case 1 and case 2. Although,

we were not able to carry out the result for case 3 because of the memory constraints on our workstation, but from the embarrassing characteristic of the implementation of the parallelization we can actually get a significant better speed up as we increase the number of processes for the parallelized PARTEL. IV. CONCLUSION AND FUTURE WORK In this study, we provide several parallel implementations

  • f the domain decomposer, PARTEL, in TELEMAC. The
  • riginal serial version of PARTEL has been parallelized and it

is successful to maintain low memory consumption while

  • btaining a good speedup. As a result, we can effortlessly run

simulations with significantly larger meshes, for example 100 million element mesh. The parallel implementations, however, still has a limitation running on a single computational machine node, since the total memory used by all MPI process

  • f PARTEL is high. This leads to a consequence that, for a

significantly large mesh (hundreds of millions of elements) the code must run on multiple computational nodes as it is no longer suitale for a workstation. In this case, we suggest a new scheme for future development of PARTEL that can be either implemented with OpenMP for

  • ne-sided

MPI communication, with which we enjoy the benefit of parallelization and memory efficiecy

  • f

domain decomposition of PARTEL on a single computational node. This implementation should be made optional so that user can select on their suitable scenarios. ACKNOWLEDGMENT The authors gratefully acknowledge the support of the Leibnitz Supercomputing Center, Garching, Germany. We would like to thank the KONWIHR initiative (Bavarian Competence Network for Technical and Scientific HPC) for funding this study. REFERENCES

[1] Adouin Yoann, Moulinec Charles, Barber Robert, Sunderland Andrew (2011), "Preparing TELEMAC-2D for extremely large simulations". In: Violeau, Damien; Hervouet, Jean-Michel;Razafindrakoto, Emile; Denis, Christophe (Hrsg.): XVIIIth TELEMAC & MASCARET User Club 2011. Chatou: Bundesanstalt für Wasserbau (BAW). S. 41-48 [2] Moulinec, Charles, et al. "TELEMAC: An efficient hydrodynamics suite for massively parallel architectures." Computers & Fluids 51.1 (2011): 30-34 [3] Judicael Grasset, Yoann Audouin, Jacques Fontaine, Charles Moulinec, David R. Emerson, “Improving TELEMAC system pre-processing and IO stages”. TELEMAC User Conference 2018 [4] Reisenbüchler M, Liepert T, Nguyen ND, Bui MD, Rutschmann P (2018), “Preliminary study on a Bavaria-wide coupled hydrological and hydromorphological model,” Environmental Informatics: Techniques and Trends. Adjunct Proceedings of the 32nd edition of the EnviroInfo – the long-standing and established international and interdisciplinary conference series on leading environmental information and communication technologies. (eds H-J Bungartz, D Kranzlmüller, V Weinberg, J Weismüller, V Wohlgemuth. ISBN: 978-3-8440-6138-3, DOI: http://doi.org/10.2370/9783844061383, Part IV Disaster Management for Resilience and Public Safety, pp. 145-148, Herzogenrath, Germany [5] Reisenbüchler, M; Nguyen N.D.; Liepert, T.; Bui, M.D.; Rutschmann,

  • P. (2018) TELEMAC - A hydrodynamic solver for HPC?. InSiDE

Magazine Spring 2018, Vol. 16, p 95:97 [6] Leibniz-Rechenzentrum (LRZ), https://www.lrz.de/ [7] METIS, http://glaros.dtc.umn.edu/gkhome/metis/metis/overview [8] Open TELEMAC-MASCARET, http://www.opentelemac.org/ [9] SCOTCH, https://www.labri.fr/perso/pelegrin/scotch/ [10] https://github.com/abseil/abseil-cpp [11] https://github.com/martinus/robin-hood-hashing [12] https://github.com/skarupke/flat_hash_map

slide-8
SLIDE 8