Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - PowerPoint PPT Presentation

3D Particle Methods Code Parallelization Fabrice Schlegel

Introduction Goal: Efficient parallelization and memory optimization of a  CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab cluster, Pharos, that consists of 60 Intel  Xeon -Harpertown nodes, each consisting of dual quad-core CPUs (8 processors) of 2.66 GHz speed. 2) Shaheen, a 16-rack IBM Blue Gene/P system owned by KAUST University. Four 850 MHz processors are integrated on each Blue Gene/P chip. A standard Blue Gene/P houses 4,096 processors per rack. Parallel library: MPI 

Outline Brief overview of Lagrangian vortex methods  Comparison of the Ring and Copy parallelization algorithm  – Numerical results in terms of speed and parallel efficiency Previous code modifications to account for the new data  structure Applications of the ring algorithm to large transverse jet  simulations

Vortex Element Methods   ω u Vortex simulations    1 Gr             . . u u g Vorticity ω instead of velocity u :  2 r Re Re t more compact support N       ( , ) ( ) ( ( )) x t W t f x t  i i 1  ω d  u  i ( ,t) i dt u d W    i ( ). ( ,t) W t u i i dt  An element described by a  i discrete node point {c i } & {W i }  i  1  i  1  Efficient utilization of computational elements ฀  ฀  ฀ 

Fast summation for the velocity evaluation  Serial treecode [Lindsay and Krasny 2001] – Velocity from Rosenhead-Moore kernel: N   1 x y    RM   ( ) ( , ) u x K  x y RM ( , ) K x y   j j i i   3/2 4    2 2  | | 1 i x y – Constructs an adaptive oct-tree:  use Taylor expansion of kernel in Cartesian coordinates: N 1 y  y c     p p c ( ) ( , )( ) i u x D K x y y y  arg arg t et y t et c i c i ! p  i 1 p x arg t et Taylor coefficients, computed Cell moments, with recurrence relation stored for re-use

Clustering of particle for optimal load balancing Distribution of particles 2 000 000 Particles 256 Processors/Clusters A k-mean clustering algorithm is used.

Copy vs. Ring Algorithm P1 P2 M1 P3-4 M1 M2 M2 M3 M3 M4 M4 Copy Algorithm: each processor keeps a copy of all the particles involved, then they communicate their results

Copy vs. Ring Algorithm P3 P3 P2 P2 P4 P4 P1 Ring: each processor keeps track of its own P1 particles and communicates the sources with the others whenever needed

Copy vs. Ring Algorithm P4 P3 P3 P2 P4 P1 P1 P2

Parallel Performance (Watson-Shaheen) CPU time for advection vs. Number of processors, 2 000 000 particles Strong Scaling 632 532 CPU Time (s) 432 ring 332 copy 232 132 32 64 256 1024 Number of processors

Parallel Performance (Watson-Shaheen) Parallel Efficiency, 2 000 000 particles Normalized Strong Scaling 1.2 1 0.8 Parallel 0.6 efficiency Ring Copy 0.4 0.2 0 64 256 1024 Number of processors

Comparison Ring algorithm Copy algorithm • Allows for bigger number of • Less communication: computational points. Recommended for high numbers of processors • Too much data circulation, its efficiency decreases for high • Memory limited number of processors

Parallel implementation using the ring algorithm Resolution of the N-Body problem • Parallel implementation using the ring algorithm (Pure MPI): • Performed simulation with 60 millions particles on 1 node, i.e., 8 processors (Pharos) 16GB. Exepted results on Shaheen: 1.8 Millions particles/processors 1.8 billions on 1024 processors New implementation of the clustering algorithm

Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P1 P3 P4

Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality P2 P3 P1 P4

Particles redistribution New implementation of the clustering algorithm • Assign new Membership to each particle • Heap Sort of the particles inside for each processor in function of their membership. • Redistribution of all the particles to their respective processors using MPI, for a better locality

Transverse jet: Motivations Wide range of applications: Combustion: industrial burners, aircraft • engines. • Pipe tee mixers. • V/STOL aerodynamics. • Pollutant dispersion (chimney plumes, effluent discharge). Secondary Boeing Military Airplane Company Com bustion Zone U.S. Department of Defense Prim ary Com bustion Turbine Inlet Zone Guide Vanes Fuel Nozzle Dilution Air Jets Prim ary Air Jets Mixing in combustors for gas turbines  Higher thrust  Better operability Photographed by Otto Perry (1894--1970)  Lower NO x , CO, UHC Western History Department of the Denver Public Library

Numerical Results r = 10 . Re = 245 Re j = 2450 Vorticity Isosurfaces |w| = 3.5 The Ring algorithm allow for large simulations (> 5millions points), that couldn’t ne run with the copy algorithm.

Future Tasks Towards more parallel efficiency… • Find hybrid strategies between the copy and the ring algorithm , by spitting the particles in such a way as two maximize the use of local memory, and not splitting them by the number of processors when the memory limit is not reached yet. This will reduces the number of shifts in the ring algorithm, and increase its efficiency for large number of processors. • Other alternative: Use mixed Open MP/MPI programming, see next slide. This will reduces the number of shifts by the number of processor per node (8 in our case).

Mixed MPI-OpenMP implementation An other alternative would be to use MPI for shared memory and OpenMP locally (on each node), with the ring algorithm : Pros: - Easy implementation, not much modifications - Built-in load balancing subroutines - Fast summation will be more time efficient on a bigger set of particles. - Will reduce the communication time , the number of travelling cluster with ring algorithm will be reduced by the number of processor per node (8 in our case)!!!

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - PowerPoint PPT Presentation

3D Particle Methods Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

Parallelization in Time Mark Maienschein-Cline Department of Chemistry University of Chicago

Parallelization of Geodesic Ray-Tracing for Arbitrary Metrics Guillermo Andree Oliva Mercado

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in

T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware VICTOR A.

80% of Code Red 2 Code Red 2 re-re- Code Red 1 and Code Red 2 Code Red 2 re- cleaned up

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

Code Parallelization for Multi-Core Software Defined Radio Platforms with OpenMP Dipl.-Ing.

Have Financed Treatment December 7, 2018 Through its Substance Use Prevention and Treatment

DIVERSION and JUVENILE JUSTICE States lead on juvenile justice reform NFWL 2019 Annual Conference

The Dynamics of Parabolic Transcendental Maps Mashael Alhamd University of Liverpool October 3,

Upgrade tracker of the HL-LHC Sergio Dez Cornell, Berkeley Lab (USA), On behalf of the ATLAS

Richard Schilizzi Final RadioNet Board Meeting, 26 November 2020 RadioNet Pre-History Eu Euro

Recent news on hybrid stars & mass twins David.Blaschke@gmail.com University of Wroclaw,

February 4, 2013 Background Proposal driven by number of factors High demand for fish and

The Diplomas Now Partnership Core Function Related Resources Math Instructional Coach

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient - PowerPoint PPT Presentation

3D Particle Methods Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory optimization of a CFD code used for the direct numerical simulation (DNS) of turbulent combustion. Hardware: 1) Our lab

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Identifying opportunities for parallelization In the hotspots of your code PARALLWARE SW

for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

Parallelization in Time Mark Maienschein-Cline Department of Chemistry University of Chicago

Parallelization of Geodesic Ray-Tracing for Arbitrary Metrics Guillermo Andree Oliva Mercado

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in

T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware VICTOR A.

80% of Code Red 2 Code Red 2 re-re- Code Red 1 and Code Red 2 Code Red 2 re- cleaned up

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

in practice source code source code javac scalac groovyc jrubyc 0xCAFEBABE byte code

Code Parallelization for Multi-Core Software Defined Radio Platforms with OpenMP Dipl.-Ing.

Have Financed Treatment December 7, 2018 Through its Substance Use Prevention and Treatment

DIVERSION and JUVENILE JUSTICE States lead on juvenile justice reform NFWL 2019 Annual Conference

The Dynamics of Parabolic Transcendental Maps Mashael Alhamd University of Liverpool October 3,

Upgrade tracker of the HL-LHC Sergio Dez Cornell, Berkeley Lab (USA), On behalf of the ATLAS

Richard Schilizzi Final RadioNet Board Meeting, 26 November 2020 RadioNet Pre-History Eu Euro

Recent news on hybrid stars &amp; mass twins David.Blaschke@gmail.com University of Wroclaw,

February 4, 2013 Background Proposal driven by number of factors High demand for fish and

The Diplomas Now Partnership Core Function Related Resources Math Instructional Coach

Recent news on hybrid stars & mass twins David.Blaschke@gmail.com University of Wroclaw,