Parallelization of Scientific Applications (I) A Parallel Structured - PowerPoint PPT Presentation

Parallelization of Scientific Applications (I) A Parallel Structured Flow Solver - URANUS Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 4. Day, 30 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center Stuttgart

Outline • Introduction • Basics • Boundary Handling • Example: Finite Volume Flow Simulation on Structured Meshes • Example: Finite Element Approach on an Unstructured Mesh Slide 2 High Performance Computing Center Stuttgart

URANUS - Overview • Calculation of 3D reentry flows – High mach number – High temperature – Chemistry • Calculation of heat flow on the surface – Full catalytic and semi catalytic surfaces Slide 3 High Performance Computing Center Stuttgart

URANUS - Numerics • cell-center oriented finite-volume approach for discretization of the unsteady, compressible Navier-Stokes equations in space • time integration accomplished by the Euler backward scheme • the implicit equation system is solved iteratively by Newton’s method • two different limiters for second order accuracy • Jacobi line relaxation method with subiterations to speedup convergence • special handling of the singularity in one-block c-meshes Slide 4 High Performance Computing Center Stuttgart

Parallelization - Target • High Application Performance • Calculation of full configrations in 3D with chemical reactions – Performance Issue – Memory Issue • Calculation of complex topologies • Using real big MPP’s – no loss in efficiency even when using 500 Processors and more • Using large Vector Systems Slide 5 High Performance Computing Center Stuttgart

Re-entry Simulation - X38 (CRV) Slide 6 High Performance Computing Center Stuttgart

Amdahls Law • Lets assume a problem with fixed size • The serial fraction s • The parallel fraction p • Number of processors n • Degree of parallelization p β = + • Speedup s p 1 1   → →  ∞ n  −  − β 1 1 − β 1  1   n  Does parallelization pay off? Slide 7 High Performance Computing Center Stuttgart

Amdahls Law for 1 - 64 processors 64 60 50 40 32 30 20 16 10 8 4 0 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Slide 8 High Performance Computing Center Stuttgart

Problem with Amdahls Law • der Schluss is obviously correct • But it does not taggle an important precondition • Problem size is considered to be constant • But this is typically not true for all kind of simulations • Computers are always too small • The problem size shall grow, if ever possibel Slide 9 High Performance Computing Center Stuttgart

Gustafsons Law • Lets assume a problem of growing size • Number of Processors n • Constant serial fraction s • The parallel fraction scales = p ( n ) p n • linearer Speedup 1 + + s p ( n ) s p n s p = = + 1 1 n p ( n ) + + + s p s p s p + s 1 1 1 n Parallelization may pay off Slide 10 High Performance Computing Center Stuttgart

Parallelization of CFD Applications - Principles Slide 11 High Performance Computing Center Stuttgart

A Problem (I) Flow around a cylinder: Numerical Simulation using FV, FE or FD Data Structure: A(1:n,1:m) Solve: (A+B+C)x=b Movie: Lutz Tobiska, University of Magdeburg http://david.math.uni-magdeburg.de/home/john/cylinder.html Slide 12 High Performance Computing Center Stuttgart

Parallelization strategies do i=1,100 Work decomposition � i=1,25 i=26,50 Scaling ? Flow around a cylinder: Numerical Simulation using FV, FE or FD i=51,75 Data Structure: A(1:n,1:m) i=76,100 Solve: (A+B+C)x=b A(1:20,1:50) Data decomposition A(1:20,51:100) A(1:20,101:150) Scales A(1:20,151:200) too much communication ? Domain decomposition Good Chance Slide 13 High Performance Computing Center Stuttgart

Parallelization Problems • Decomposition (Domain, Data, Work) • Communication = − du dx / ( u u ) / dx + − i 1 i 1 u i-1 u i+1 Slide 14 High Performance Computing Center Stuttgart

Concepts - Message Passing (II) User defined communication Slide 15 High Performance Computing Center Stuttgart

How to split the Domain in the Dimensions (I) 1 - Dimensional 2 (and 3) - Dimensional Slide 16 High Performance Computing Center Stuttgart

How to split the Domain in the Dimensions (II) • That depends on: – computational speed i.e. processor: vectorprocessor or cache – communication speed: • latency • bandwidth • topology – number of subdomains needed – load distribution (is the effort for every mesh cell equal) Slide 17 High Performance Computing Center Stuttgart

Replication versus Communication (I) • If we need a value from a neighbour we have basically two opportunities – getting the necessary value directly from the neighbour, when needed Communication, Additional Synchronisation – calculating the value of the neighbour again locally from values known there Additional Calculation • Selection depends on the application Slide 18 High Performance Computing Center Stuttgart

Replication versus Communication (II) • Normally replicate the values – Consider how many calculations you can execute while only sending 1 Bit from one process to another (6 µ s, 1.0 Gflop/s � 6000 operations) – Sending 16 kByte (20x20x5) doubles (with 300 MB/s bandwidth � 53.3 µ s � 53 300 operations) – very often blocks have to wait for their neighbours – but extra work limits parallel efficiency • Communication should only be used if one is quite sure that this is the best solution Slide 19 High Performance Computing Center Stuttgart

2- Dimensional DD with two Halo Cells Mesh Partitioning Subdomain for each Process Slide 20 High Performance Computing Center Stuttgart

Back to URANUS Slide 21 High Performance Computing Center Stuttgart

Analysis of the Sequentiell Program • Written in FORTRAN77 • Using structured meshes • Parts of the program – Preprocessing, reading Data – Main Loop • Setup of the equation system • Preconditioning • Solving step – Postprocessing, writing Data Slide 22 High Performance Computing Center Stuttgart

Deciding for the Data Model • We will use Domain Decomposition – The domain is split into subdomians – large topologies and large number of subdomains necessary � 3D – Domain Decomposition • Each cell has in maximum 6 neighbors – Now 2 boundary types: • Physical boundary – Subdomain boundary is domain boundary • Inner boundary – Subdomain boundary is boundary to another subdomain – Data exchange between subdomains necessary � communication Slide 23 High Performance Computing Center Stuttgart

Data Distribution • Domain is splitted by a simple algorithm – Make subdomains equal sized � minor load balancing issue • Each subdomain is calculated by its own process (on its own processor) • Data needs to be distributed before calculating – One process reads all data – Data are then distributed to all the other processes • Bottleneck • Sequential part – Parallel read would have been better, but MPI-I/O was not available that time Slide 24 High Performance Computing Center Stuttgart

Dynamic Data Structures • Pure FORTRAN77 is too static – number of processors can vary from run to run size of arrays even within the same case can vary – dynamic data structurs • use Fortran90 dynamic arrays • use all local memory on a PE for a huge FORTARN77 array and setup your own memory management • second method has a problem on SMP’s and cc- NUMA’s we should only use as much memory as necessary Slide 25 High Performance Computing Center Stuttgart

Inauguration of the Dynamic Data Structure • FORTRAN77 common /geo/ x(0:n1m,0:n2m,0:n3m), y(0:n1m,0:n2m,0:n3m), z ... • Fortran90 – Direct usage of dynamic arrays not possible � common block – Usage of Fortran90 pointers common /geo/ x, y, z ... real, pointer :: x(:,:,:) real, pointer :: y(:,:,:) – Allocation and Deallocation in the main program necessary Slide 26 High Performance Computing Center Stuttgart

Hints for (Dynamic) Data Structure • Do not use global data structures – All data in a subroutine should be local – Data should be moved to the subroutine during the call – Better maintainability of the program • This was done by a total reengineering of the URANUS code in a later step Slide 27 High Performance Computing Center Stuttgart

Main Loop (I) • Setup of the equation system – each cell has 6 neighbours – needs data from neighbour cells – 2 halo cells at the inner boundaries – Special part for handling physical boundaries Slide 28 High Performance Computing Center Stuttgart

Numbering of the cells in the subdomains Slide 29 High Performance Computing Center Stuttgart

Parallelization of Scientific Applications (I) A Parallel Structured - PowerPoint PPT Presentation

Parallelization of Scientific Applications (I) A Parallel Structured Flow Solver - URANUS Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 4. Day, 30 th of June, 2005 HLRS, University of

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization Alain

for Effective Speculative Parallelization in Hardware VICTOR A. YING MARK C. JEFFREY* DANIEL

Parallelization in Time Mark Maienschein-Cline Department of Chemistry University of Chicago

Parallelization of Geodesic Ray-Tracing for Arbitrary Metrics Guillermo Andree Oliva Mercado

1/18 Straightforward parallelization of polynomial multiplication using parallel collections in

Scientific report Mariusz ynel April 22, 2015 Scientific report 2 Contents 1 Scientific

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

SCIENCE SCIENCE Scientific Question Hypothesis Prediction Experimental Test Scientific

Scientific Programming in mpags-python.github.io Steven Bamford An introduction to scientific

Parallelization techniques: Applying Map, Reduce and Cross concepts using bioActors Ilkay

LECTURE 6 Numerical and Scientific Packages NUMERICAL AND SCIENTIFIC APPLICATIONS As you

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives Jos M. Andin,

Outline The UNICORE Grid System 1. Introduction UNICORE UNICORE Plus Project Tutorial

CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 CS 251 Fall 2019 Principles of

Authorship Identification with Modality Specific Meta Features Thamar Solorio, Sangita Pillay,

Meta-Scheduling in Advance using Red-Black Trees in Heterogeneous Grids Luis Toms, Agustn

Static Scheduling for Large-Scale Heterogeneous Platforms Yves Robert Ecole Normale Sup

Computational Account of Emotion - An Oxymoron? R. von Haugwitz G. Dodig-Crnkovic A. Almr

CISS: CISS: The Canadian The Canadian Internetworked Scientific Scientific Internetworked

Importing flat files from the web Importing Data in Python Youre already great at importing!