Parallelization of Scientific Applications (I) A Parallel Structured - - PowerPoint PPT Presentation

parallelization of scientific applications i
SMART_READER_LITE
LIVE PREVIEW

Parallelization of Scientific Applications (I) A Parallel Structured - - PowerPoint PPT Presentation

Parallelization of Scientific Applications (I) A Parallel Structured Flow Solver - URANUS Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 4. Day, 30 th of June, 2005 HLRS, University of


slide-1
SLIDE 1

Slide 1 High Performance Computing Center Stuttgart

Parallelization of Scientific Applications (I)

A Parallel Structured Flow Solver - URANUS

Russian-German School on High Performance Computer Systems, June, 27th until July, 6th 2005, Novosibirsk

  • 4. Day, 30th of June, 2005

HLRS, University of Stuttgart

slide-2
SLIDE 2

Slide 2 High Performance Computing Center Stuttgart

Outline

  • Introduction
  • Basics
  • Boundary Handling
  • Example: Finite Volume Flow Simulation on Structured Meshes
  • Example: Finite Element Approach on an Unstructured Mesh
slide-3
SLIDE 3

Slide 3 High Performance Computing Center Stuttgart

URANUS - Overview

  • Calculation of 3D reentry flows

– High mach number – High temperature – Chemistry

  • Calculation of heat flow on the surface

– Full catalytic and semi catalytic surfaces

slide-4
SLIDE 4

Slide 4 High Performance Computing Center Stuttgart

URANUS - Numerics

  • cell-center oriented finite-volume approach for discretization of the

unsteady, compressible Navier-Stokes equations in space

  • time integration accomplished by the Euler backward scheme
  • the implicit equation system is solved iteratively by Newton’s

method

  • two different limiters for second order accuracy
  • Jacobi line relaxation method with subiterations to speedup

convergence

  • special handling of the singularity in one-block c-meshes
slide-5
SLIDE 5

Slide 5 High Performance Computing Center Stuttgart

Parallelization - Target

  • High Application Performance
  • Calculation of full configrations in 3D with chemical reactions

– Performance Issue – Memory Issue

  • Calculation of complex topologies
  • Using real big MPP’s

– no loss in efficiency even when using 500 Processors and more

  • Using large Vector Systems
slide-6
SLIDE 6

Slide 6 High Performance Computing Center Stuttgart

Re-entry Simulation - X38 (CRV)

slide-7
SLIDE 7

Slide 7 High Performance Computing Center Stuttgart

Amdahls Law

  • Lets assume a problem with fixed size
  • The serial fraction
  • The parallel fraction
  • Number of processors
  • Degree of parallelization
  • Speedup

s

p

n

β β −   →        − −

∞ →

1 1 1 1 1 1

n

n

p s p + = β

Does parallelization pay off?

slide-8
SLIDE 8

Slide 8 High Performance Computing Center Stuttgart

Amdahls Law for 1 - 64 processors

10 20 30 40 50 60 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 4 64 32 16 8

slide-9
SLIDE 9

Slide 9 High Performance Computing Center Stuttgart

Problem with Amdahls Law

  • der Schluss is obviously correct
  • But it does not taggle an important precondition
  • Problem size is considered to be constant
  • But this is typically not true for all kind of simulations
  • Computers are always too small
  • The problem size shall grow, if ever possibel
slide-10
SLIDE 10

Slide 10 High Performance Computing Center Stuttgart

Gustafsons Law

  • Lets assume a problem of growing size
  • Number of Processors
  • Constant serial fraction
  • The parallel fraction scales
  • linearer Speedup

s

n

n p n p

1

) ( =

n p s p p s s p s n p s n n p s n p s

1 1 1 1 1

) ( ) ( + + + = + + = + +

Parallelization may pay off

slide-11
SLIDE 11

Slide 11 High Performance Computing Center Stuttgart

Parallelization of CFD Applications - Principles

slide-12
SLIDE 12

Slide 12 High Performance Computing Center Stuttgart

A Problem (I) Flow around a cylinder: Numerical Simulation using FV, FE or FD Data Structure: A(1:n,1:m) Solve: (A+B+C)x=b

Movie: Lutz Tobiska, University of Magdeburg http://david.math.uni-magdeburg.de/home/john/cylinder.html

slide-13
SLIDE 13

Slide 13 High Performance Computing Center Stuttgart

Parallelization strategies

Flow around a cylinder: Numerical Simulation using FV, FE or FD Data Structure: A(1:n,1:m) Solve: (A+B+C)x=b

Work decomposition Domain decomposition

A(1:20,1:50) A(1:20,51:100) A(1:20,101:150) A(1:20,151:200)

Data decomposition Scaling ?

Scales too much communication ? Good Chance

do i=1,100 i=1,25 i=26,50 i=51,75 i=76,100

slide-14
SLIDE 14

Slide 14 High Performance Computing Center Stuttgart

Parallelization Problems

  • Decomposition (Domain, Data, Work)
  • Communication

du dx u u dx

i i

/ ( ) / = −

+ − 1 1

ui-1 ui+1

slide-15
SLIDE 15

Slide 15 High Performance Computing Center Stuttgart

Concepts - Message Passing (II)

User defined communication

slide-16
SLIDE 16

Slide 16 High Performance Computing Center Stuttgart

How to split the Domain in the Dimensions (I)

1 - Dimensional 2 (and 3) - Dimensional

slide-17
SLIDE 17

Slide 17 High Performance Computing Center Stuttgart

How to split the Domain in the Dimensions (II)

  • That depends on:

– computational speed i.e. processor: vectorprocessor or cache – communication speed:

  • latency
  • bandwidth
  • topology

– number of subdomains needed – load distribution (is the effort for every mesh cell equal)

slide-18
SLIDE 18

Slide 18 High Performance Computing Center Stuttgart

Replication versus Communication (I)

  • If we need a value from a neighbour we have basically two
  • pportunities

– getting the necessary value directly from the neighbour, when needed Communication, Additional Synchronisation – calculating the value of the neighbour again locally from values known there Additional Calculation

  • Selection depends on the application
slide-19
SLIDE 19

Slide 19 High Performance Computing Center Stuttgart

Replication versus Communication (II)

  • Normally replicate the values

– Consider how many calculations you can execute while only sending 1 Bit from one process to another (6 µs, 1.0 Gflop/s 6000 operations) – Sending 16 kByte (20x20x5) doubles (with 300 MB/s bandwidth 53.3 µs 53 300 operations) – very often blocks have to wait for their neighbours – but extra work limits parallel efficiency

  • Communication should only be used if one is quite sure that this is

the best solution

slide-20
SLIDE 20

Slide 20 High Performance Computing Center Stuttgart

2- Dimensional DD with two Halo Cells

Mesh Partitioning Subdomain for each Process

slide-21
SLIDE 21

Slide 21 High Performance Computing Center Stuttgart

Back to URANUS

slide-22
SLIDE 22

Slide 22 High Performance Computing Center Stuttgart

Analysis of the Sequentiell Program

  • Written in FORTRAN77
  • Using structured meshes
  • Parts of the program

– Preprocessing, reading Data – Main Loop

  • Setup of the equation system
  • Preconditioning
  • Solving step

– Postprocessing, writing Data

slide-23
SLIDE 23

Slide 23 High Performance Computing Center Stuttgart

Deciding for the Data Model

  • We will use Domain Decomposition

– The domain is split into subdomians – large topologies and large number of subdomains necessary 3D – Domain Decomposition

  • Each cell has in maximum 6 neighbors

– Now 2 boundary types:

  • Physical boundary – Subdomain boundary is domain

boundary

  • Inner boundary – Subdomain boundary is boundary to another

subdomain

– Data exchange between subdomains necessary communication

slide-24
SLIDE 24

Slide 24 High Performance Computing Center Stuttgart

Data Distribution

  • Domain is splitted by a simple algorithm

– Make subdomains equal sized minor load balancing issue

  • Each subdomain is calculated by its own process (on its own

processor)

  • Data needs to be distributed before calculating

– One process reads all data – Data are then distributed to all the other processes

  • Bottleneck
  • Sequential part

– Parallel read would have been better, but MPI-I/O was not available that time

slide-25
SLIDE 25

Slide 25 High Performance Computing Center Stuttgart

Dynamic Data Structures

  • Pure FORTRAN77 is too static

– number of processors can vary from run to run size of arrays even within the same case can vary – dynamic data structurs

  • use Fortran90 dynamic arrays
  • use all local memory on a PE for a huge FORTARN77

array and setup your own memory management

  • second method has a problem on SMP’s and cc-

NUMA’s we should only use as much memory as necessary

slide-26
SLIDE 26

Slide 26 High Performance Computing Center Stuttgart

Inauguration of the Dynamic Data Structure

  • FORTRAN77

common /geo/ x(0:n1m,0:n2m,0:n3m), y(0:n1m,0:n2m,0:n3m), z ...

  • Fortran90

– Direct usage of dynamic arrays not possible common block – Usage of Fortran90 pointers common /geo/ x, y, z ... real, pointer :: x(:,:,:) real, pointer :: y(:,:,:) – Allocation and Deallocation in the main program necessary

slide-27
SLIDE 27

Slide 27 High Performance Computing Center Stuttgart

Hints for (Dynamic) Data Structure

  • Do not use global data structures

– All data in a subroutine should be local – Data should be moved to the subroutine during the call – Better maintainability of the program

  • This was done by a total reengineering of the URANUS code in a

later step

slide-28
SLIDE 28

Slide 28 High Performance Computing Center Stuttgart

Main Loop (I)

  • Setup of the equation system

– each cell has 6 neighbours – needs data from neighbour cells – 2 halo cells at the inner boundaries – Special part for handling physical boundaries

slide-29
SLIDE 29

Slide 29 High Performance Computing Center Stuttgart

Numbering of the cells in the subdomains

slide-30
SLIDE 30

Slide 30 High Performance Computing Center Stuttgart

Routine for data exchange (I)

subroutine dqchange(dq) c------------------------------------------------------------------------------ c c Unterprogramm fuer den Randaustausch von dq an allen 6 Raendern, c Austauschtiefe +-1 c c------------------------------------------------------------------------------ implicit none real dq(0:n1+1,0:n2+1,0:n3+1,nc) integer sendhandle(6), recvhandle(6) real, dimension (:), pointer :: sendbuf1, sendbuf2, recvbuf1, recvbuf2, . sendbuf3, sendbuf4, recvbuf3, recvbuf4, . sendbuf5, sendbuf6, recvbuf5, recvbuf6 integer, dimension (MPI_STATUS_SIZE,6) :: sendstatusarray integer, dimension (MPI_STATUS_SIZE) :: status integer :: uranus_packed

slide-31
SLIDE 31

Slide 31 High Performance Computing Center Stuttgart

Routine for data exchange (II)

#ifdef TRACE write (6,*) mytid, ': subroutine dqchange' #endif c...Asynchrones Empfangen der dq der Nachbarprozessoren if (sio.eq.0) then iu = n1+1 io = n1+1 ju = dmju jo = dmjo ku = dmku ko = dmko call prepare_receive(dq, 0, n1+1, 0, n2+1, 0, n3+1, nc, REALx, & iu, io, ju, jo, ku, ko, epio, buffer, count2, uranus_packed) recvbuf2 => buffer call MPI_IRECV(recvbuf2, count2, uranus_packed, tid_io, 10001, MPI_COMM_WORLD, & recvhandle(1), info) else recvhandle(1) = MPI_REQUEST_NULL endif

slide-32
SLIDE 32

Slide 32 High Performance Computing Center Stuttgart

Routine for data exchange (III)

if (siu.eq.0) then iu = 0 io = 0 ju = dmju jo = dmjo ku = dmku ko = dmko call prepare_receive(dq, 0, n1+1, 0, n2+1, 0, n3+1, nc, REALx, & iu, io, ju, jo, ku, ko, epiu, buffer, count1, uranus_packed) recvbuf1 => buffer call MPI_IRECV(recvbuf1, count1, uranus_packed, tid_iu, 10001, MPI_COMM_WORLD, & recvhandle(2), info) else recvhandle(2) = MPI_REQUEST_NULL endif

slide-33
SLIDE 33

Slide 33 High Performance Computing Center Stuttgart

Routine for data exchange (IV)

c...Asynchrones Senden der dq an die Nachbarprozessoren if (siu.eq.0) then iu = 1 io = 1 ju = dmju jo = dmjo ku = dmku ko = dmko call uranus_pack(dq, 0, n1+1, 0, n2+1, 0, n3+1, nc, REALx, & iu, io, ju, jo, ku, ko, epiu, buffer, count1, uranus_packed) sendbuf1 => buffer call MPI_ISEND(sendbuf1, count1, uranus_packed, tid_iu, 10001, & MPI_COMM_WORLD, sendhandle(1), info) else sendhandle(1) = MPI_REQUEST_NULL endif

slide-34
SLIDE 34

Slide 34 High Performance Computing Center Stuttgart

Routine for data exchange (V)

do 300 ni = 1, 6 call MPI_WAITANY(6, recvhandle, handnum, status, info) if (handnum .eq. 1) then iu = n1+1 io = n1+1 ju = dmju jo = dmjo ku = dmku ko = dmko buffer => recvbuf2 call uranus_unpack(dq, 0, n1+1, 0, n2+1, 0, n3+1, nc, REALx, & iu, io, ju, jo, ku, ko, epio, buffer, count2, uranus_packed) call uranus_buffer_release else if (handnum .eq. 2) then

slide-35
SLIDE 35

Slide 35 High Performance Computing Center Stuttgart

Routine for data exchange (VI)

else if (handnum .eq. MPI_UNDEFINED) then c case (MPI_UNDEFINED) goto 400 endif 300 continue 400 continue call MPI_WAITALL(6, sendhandle, sendstatusarray, info) buffer => sendbuf1 call uranus_buffer_release ... return end

slide-36
SLIDE 36

Slide 36 High Performance Computing Center Stuttgart

Changes in the loops

do 100 k = 0, n3+1 do 100 j = 0, n2+1 do 100 i = 0, n1+1 vdt(i,j,k) = damax(i,j,k) . * ( sqrt( u(i,j,k) * u(i,j,k) . + v(i,j,k) * v(i,j,k) . + w(i,j,k) * w(i,j,k)) + c(i,j,k) ) / cfl 100 continue do 100 k = dmku, dmko do 100 j = dmju, dmjo do 100 i = dmiu, dmio vdt(i,j,k) = damax(i,j,k) . * ( sqrt( u(i,j,k) * u(i,j,k) . + v(i,j,k) * v(i,j,k) . + w(i,j,k) * w(i,j,k)) + c(i,j,k) ) / cfl 100 continue

slide-37
SLIDE 37

Slide 37 High Performance Computing Center Stuttgart

Changes for boundary handling

  • Boundary handling, only if there is a physical boundary

if (rand_inflow) then do 100 n = 1, nc do 100 k = dmku, dmko do 100 j = dmju, dmjo rhs(0,j,k,n) = - ( q(0,j,k,n) - qzu(n) ) 100 continue endif

slide-38
SLIDE 38

Slide 38 High Performance Computing Center Stuttgart

Main Loop (II)

  • Preconditioning

– no neighbor information needed at all – completely done locally – Each proces calls the preconditioner with its local subdomain

slide-39
SLIDE 39

Slide 39 High Performance Computing Center Stuttgart

Main Loop (III) - Solver

  • Sequential:

r A u r u A r r r r

1 −

= =

– Jacobi Line relaxation with subiterations

  • Parallelization problems:

– Matrix A is distributed – Matrix inversion is a priori not parallelizable

  • Gauss Elamination is recursive
slide-40
SLIDE 40

Slide 40 High Performance Computing Center Stuttgart

Heptadiagonalmatrix

                                    D X

X X X D X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X X D X X X X D X X X X D

slide-41
SLIDE 41

Slide 41 High Performance Computing Center Stuttgart

Parallelization - Solver

        − = − = − = = + + = =

− − − ) 1 ( 1 ) ( ) 1 ( ) (

) (

j j j j

u M r L u u M r u L u M r u L r u M L M L A r u A r r r r r r r r r r r r r

slide-42
SLIDE 42

Slide 42 High Performance Computing Center Stuttgart

Parallelization - Solver (III)

= A

L M +

= A

                                     

12 12 11 11 11 10 10 10 9 9 8 8 7 7 7 6 6 6 5 5 4 4 3 3 3 2 2 2 1 1

m l u m l u m l u m m l u m l u m l u m m l u m l u m l u m

+

                                     

9 8 5 4

l u l u

slide-43
SLIDE 43

Slide 43 High Performance Computing Center Stuttgart

A Real Heptadiagonalmatrix

34 35 36 31 32 33 28 29 30 25 26 27 22 23 24 19 20 21 16 17 18 13 14 15 10 11 12 7 8 9 4 5 6 1 2 3

D X X X X D X X X X D X X X D X X X X X D X X X X X D X X X D X X X X X D X X X X X D X X X D X X X X D X X X X D X X D X X X X X D X X X X X D X X X X D X X X X X X D X X X X X X D X X X X D X X X X X X D X X X X X X D X X X X D X X X X X D X X X X X D X X D X X X X D X X X X D X X X D X X X X X D X X X X X D X X X D X X X X X D X X X X X D X X X D X X X X D X X X X D

                                   

1 2 1 1 2 2

slide-44
SLIDE 44

Slide 44 High Performance Computing Center Stuttgart

Difference between strong / weak coupling

  • Solver with weak coupling

– Extra computational effort due to additional solving step (but no factorization) – Additional update of right hand side – Two times communication

  • one after each solving step
  • Solver with stronger coupling

– Jacobi line relaxation method with subiterations – collapse iterations from line relaxation and parallelzation – no additional iterations – much more communication

slide-45
SLIDE 45

Slide 45 High Performance Computing Center Stuttgart

Comparison of Solvers - Convergence

L2 - Residual

Solver 1 — Solver 2 —

Iterations

slide-46
SLIDE 46

Slide 46 High Performance Computing Center Stuttgart

Comparison of Solvers - Performance

T i m e i n S e c

  • n

d s

strong Ethernet weak Ethernet weak HPS strong HPS

Number of Processors

slide-47
SLIDE 47

Slide 47 High Performance Computing Center Stuttgart

Results - Solving Method

  • the presented solver with weak coupling works fine for this CFD

problems

  • Solutions differ in the scale of one percent
  • convergence rate is nearly equal to the sequential program
slide-48
SLIDE 48

Slide 48 High Performance Computing Center Stuttgart

Domain Decomposition

slide-49
SLIDE 49

Slide 49 High Performance Computing Center Stuttgart

Computation Time on Cray T3E

T i m e i n S e c

  • n

d s Number of Processors

slide-50
SLIDE 50

Slide 50 High Performance Computing Center Stuttgart

Speedup on Cray T3E

Number of Processors Speedup

Problemsize 110592 Problemsize 55296 Problemsize 27648

slide-51
SLIDE 51

Slide 51 High Performance Computing Center Stuttgart

Scaleup on Cray T3E

Scaleup Number of Processors

slide-52
SLIDE 52

Slide 52 High Performance Computing Center Stuttgart

Results - Properties of the Parallel Program

  • very flexible

– no recompilation necessary when changing problemsize and process number

  • runs on workstation clusters and on supercomputers
  • large problems can be computed
  • there is no change in program handling
slide-53
SLIDE 53

Slide 53 High Performance Computing Center Stuttgart

Portability of Parallel Code

  • Application should run on every platform with Fortran 90 and MPI

available

  • tested platforms

– Cray T3D, Cray T3E – Hitachi SR2201, Hitachi SR8000 – IBM RS/6000 SP, IBM p690 (Regatta) – Intel Paragon – NEC SX-4, NEC SX-5, NEC SX-6, NEC SX-8 – SGI Origin2000 – Intel IA-64 – Intel IA-32, AMD Opteron

slide-54
SLIDE 54

Slide 54 High Performance Computing Center Stuttgart

Computed Solution and Visualization

slide-55
SLIDE 55

Slide 55 High Performance Computing Center Stuttgart

Computed Solution and Visualization X-38

slide-56
SLIDE 56

Slide 56 High Performance Computing Center Stuttgart

Conclusions

  • parallelization of a 3D flow solver
  • parallel program is portable and flexible
  • no loss in convergence speed due to the parallelization
  • speedup is normal for a flow solvers
  • scaleup is very good
slide-57
SLIDE 57

Slide 57 High Performance Computing Center Stuttgart

Metacomputing

slide-58
SLIDE 58

Slide 58 High Performance Computing Center Stuttgart

Metacomputer Setup

slide-59
SLIDE 59

Slide 59 High Performance Computing Center Stuttgart

Metacomputer Setup: Network

slide-60
SLIDE 60

Slide 60 High Performance Computing Center Stuttgart

Metacomputer Setup: Network Performance

Connection Distance Latency Bandwidth HLRS-CSAR 594 miles 20 ms 10 MBits/s PSC-CSAR 3594 miles 60 ms 1 MBit/s PSC-HLRS 4181 miles 80 ms 4 MBit/s HLRS-ETL 5870 miles 300-500 ms 1-2 MBit/s

slide-61
SLIDE 61

Slide 61 High Performance Computing Center Stuttgart

Metacomputing Setup: Computers

  • Hitachi SR8000 at TACC/Japan, 512 CPU/64nodes
  • Cray T3E-900 at PSC/U.S.A., 512 CPU
  • Cray T3E-1200 at CSAR/U.K., 576 CPU
  • Cray T3E-900 at HLRS/Germany, 512 CPU

==> Total Performance of more than 2 TFlops/s

slide-62
SLIDE 62

Slide 62 High Performance Computing Center Stuttgart

X-38 Calculated Solution

  • Mach 6
  • Angle of attack 40°
  • 888 832 control volumes
  • 600 iterations
  • Cray T3E: 101 minutes on

128 processors

  • Metacomputing:
  • 3.6 Million cell mesh
  • 1536 processors
  • 3 T3E‘s
slide-63
SLIDE 63

Slide 63 High Performance Computing Center Stuttgart

Migrating from a parallel single block to a parallel multiblock flow solver

slide-64
SLIDE 64

Slide 64 High Performance Computing Center Stuttgart

Outline

  • Introduction to parallel URANUS
  • Why using Multiblock meshes
  • Extending URANUS to use Multiblock meshes
  • Results
  • Outlook
slide-65
SLIDE 65

Slide 65 High Performance Computing Center Stuttgart

Sequential 2D/3D URANUS (Non Equilibrium Flows)

  • Cell-center oriented finite volume approach
  • solving the unsteady, compressible Navier-Stokes equations
  • the implicit equation system is solved iteratively by Newton’s

method

  • two different limiters for second order accuracy
  • CVCV multiple temperature gas phase model
  • Chapman-Cowling transport coefficients models
  • Gaskinetic gas-surface model with different catalysis models
  • PARADE/HERTA gas-radiation coupling
slide-66
SLIDE 66

Slide 66 High Performance Computing Center Stuttgart

Parallelization

  • domain decomposition

– with two halo cells at the subdomain boundaries

  • dynamic data structures using Fortran90
  • special solver
  • execution model SPMD
  • message-passing with MPI
  • still working only on C-meshes
slide-67
SLIDE 67

Slide 67 High Performance Computing Center Stuttgart

Why Using Multiblock Meshes

  • There are topologies which cannot be meshed or which are hard to

mesh with a C-mesh

  • Singularity and sometimes heavily distorted mesh cells are limiting

the convergence rate

  • using unstructured meshes would result in rewriting the code
  • to obtain performance on current Supercomputers is easier with

structured meshes

  • using multiblock meshes:

– meshing of complex topologies is possible – structured blocks – Performance easier to obtain

slide-68
SLIDE 68

Slide 68 High Performance Computing Center Stuttgart

A Multiblock Mesh for X-38 (I)

slide-69
SLIDE 69

Slide 69 High Performance Computing Center Stuttgart

A Multiblock Mesh for X-38 (II)

slide-70
SLIDE 70

Slide 70 High Performance Computing Center Stuttgart

Characteristics of Multiblock Meshes

  • Each block may have a local coordinate system which is different

from that of its neighbours

  • A block may have one, two or more neighbours on one block side
  • Physical boundaries may occur on each blockside
  • Blocks have generally different sizes

the program must be able to handle all this

slide-71
SLIDE 71

Slide 71 High Performance Computing Center Stuttgart

Extensions Necessary to Handle Multiblock Meshes

  • Handling of block internal orientation
  • Handling of more complex neighbour dependencies
  • Handling of physical boundaries at each block side
  • Load balancing
  • handling of multiple blocks on one processor
  • automatic block splitting
  • using of a load balancer for block distribution
slide-72
SLIDE 72

Slide 72 High Performance Computing Center Stuttgart

Axis Orientation: Different Local Coordiante Systems

  • Reason:
  • Solution: Changing storage order according to the difference during

the communication (Currently sender side)

ξ η ξ η ξ η ξ η ξ η

slide-73
SLIDE 73

Slide 73 High Performance Computing Center Stuttgart

Neighbour Dependencies - Occurence

  • A block may have more than one neighbour at one blockside
slide-74
SLIDE 74

Slide 74 High Performance Computing Center Stuttgart

Neighbour Dependencies - Handling

  • Up to now only one neighbour block at each side
  • How to handle several neighbours at each block side:
  • Replacement of the whole communication structure
  • New data structure to store all communication specific parameters

like: – Host process of the neighbouring block – orientation of the neighbour – side identifier of the neighbouring side this (part of the) side is connected to – specification of the side (cell subscription) at the block side

  • Extension of communication routines
slide-75
SLIDE 75

Slide 75 High Performance Computing Center Stuttgart

Physical Boundaries

  • C-mesh:

– a specific physical boundary type is bound to a specific block side

  • physical boundaries in multiblock meshes can occur on all of the

block sides: Block 2 Block 1 Body

slide-76
SLIDE 76

Slide 76 High Performance Computing Center Stuttgart

Efficient Calculation of Boundaries (I)

  • Special data structure for each boundary type:

– location of each boundary – subtype of the boundary

  • Only one code segment for boundary handling

– no doubling of code for each side

  • one code to update and maintain
  • no cut and paste bugs
  • No branches

– chance of performance improvement

slide-77
SLIDE 77

Slide 77 High Performance Computing Center Stuttgart

Efficient Calculation of Boundaries (II)

do m = 0, aktblock%inflowvek(1)-1 do n = 1, nc do k = aktblock%inflowvek(m*17+14), aktblock%inflowvek(m*17+15), . aktblock%inflowvek(m*17+17) do j = aktblock%inflowvek(m*17+9), aktblock%inflowvek(m*17+10), . aktblock%inflowvek(m*17+12) do i = aktblock%inflowvek(m*17+4), aktblock%inflowvek(m*17+5), . aktblock%inflowvek(m*17+7) rhs(i,j,k,n) = - ( q(i,j,k,n) - qzu(n) ) end do end do end do end do

slide-78
SLIDE 78

Slide 78 High Performance Computing Center Stuttgart

Load Balancing

  • Target: Efficient use of Massively Parallel Processors
  • Blocks have different size
  • Block number is generally different from number of used processors
  • Initial load balancing is necessary
  • Problems to solve:

– There are blocks which are too large to be calculated efficiently

  • nto one processor

– Block splitting necessary – There are blocks which are too small to be calculated alone

  • nto one processor

– Process should be able to calculate more than one block at a time

slide-79
SLIDE 79

Slide 79 High Performance Computing Center Stuttgart

Extensions for block handling

  • Different block numbers on a process

– Extension of the subroutines and algorithms

  • r block loops around subroutines

– Communication between blocks on one process is done using MPI – Extension of the communication structure, so that each incoming message reaches its block

  • Blocks on a process may have different size

– how shall we store them

slide-80
SLIDE 80

Slide 80 High Performance Computing Center Stuttgart

New Data Structure

  • New dynamic data structure

– dynamic linked list of blocks – a block may contain different meshes having different resolutions (prepared for multigrid methods and adaptive mesh refinement) – all information regarding a block is stored in its data structure e.g.: neighbours, physical boundaries, the local jacobian matrix, message handles, ... the result vector and the geometry information is bound to the mesh – number of blocks per process only limited by the available memory

slide-81
SLIDE 81

Slide 81 High Performance Computing Center Stuttgart

Block Splitting

  • Blocks which are too large for one process are automatically split:

– with a side proportion of 2:1:1 –

  • nly powers of two and three as side divider
  • Update information for data transmission between new neighbours is

handled by the program (even left case can be handled)

slide-82
SLIDE 82

Slide 82 High Performance Computing Center Stuttgart

Load Balancing Step

  • Using (parallel) jostle to distribute the obtained blocks to the

available processors

  • Generating a graph out of the block distribution with:

– nodes representing the blocks – node weight representing the block size (computational effort) – edges representing the neighbour dependencies between blocks

  • block redistribution according to jostle’s suggestion
slide-83
SLIDE 83

Slide 83 High Performance Computing Center Stuttgart

Example Block Distribution (I)

  • X-38 mesh:

– 6 blocks – 68712 cells – largest: 32256 – smallest: 1176

slide-84
SLIDE 84

Slide 84 High Performance Computing Center Stuttgart

Example Block Distribution (II)

  • Same mesh with blocks cut

automatically: – 9 procs – 11 blocks

slide-85
SLIDE 85

Slide 85 High Performance Computing Center Stuttgart

Example Block Distribution (III)

  • Blocks distrubuted to

the processors: – largest 8064 – smallest 6048 – load imbalance: 18 %

slide-86
SLIDE 86

Slide 86 High Performance Computing Center Stuttgart

Results

  • Solution

(X-38 NG) – mach 19.8 – angle of attack 40°

slide-87
SLIDE 87

Slide 87 High Performance Computing Center Stuttgart

Results Speedup Speedup Processors

slide-88
SLIDE 88

Slide 88 High Performance Computing Center Stuttgart

Results Parallelization

  • The block handling is in principle totally decoupled from the

calculation for the simulation

  • In principle the whole block handling can be used to parallelize

every sequential program which uses structured (multiblock) meshes

  • In some cases, even the solver can be adapted or directly be used
  • Due to advanced data structure many cases where it should fit

(adaptive mesh refinement, multigrid approaches, load balancing)

  • All this comes for free
slide-89
SLIDE 89

Slide 89 High Performance Computing Center Stuttgart

Summary: Parallel 3D-Multiblock URANUS

  • Portable data parallel simulation program

– Fortran90 (dynamic data structures) – message passing using MPI

  • Domain decomposition based on structured multiblock meshes
  • Different index directions within blocks
  • Physical and inner boundaries on all block sides
  • Different neighbour numbers on each block side possible
  • Handling of different block sizes (automatic initial block distribution)
  • Blocks not fitting on one process (load imbalance) are split

automatically

  • Number of blocks on each process only limited by memory