Why Parallelize? Why Parallelize? To decrease the overall - - PDF document

why parallelize why parallelize
SMART_READER_LITE
LIVE PREVIEW

Why Parallelize? Why Parallelize? To decrease the overall - - PDF document

Simple Steps for Parallelizing a Simple Steps for Parallelizing a FORTRAN Code Using FORTRAN Code Using Message Passing Interface (MPI) Message Passing Interface (MPI) Justin L. Morgan and Jason B. Gilbert Justin L. Morgan and Jason B.


slide-1
SLIDE 1

1

Simple Steps for Parallelizing a Simple Steps for Parallelizing a FORTRAN Code Using FORTRAN Code Using Message Passing Interface (MPI) Message Passing Interface (MPI)

Justin L. Morgan and Jason B. Gilbert Justin L. Morgan and Jason B. Gilbert

Department of Aerospace Engineering, Auburn University Department of Aerospace Engineering, Auburn University

Why Parallelize? Why Parallelize?

  • To decrease the overall computation time of a job.

To decrease the overall computation time of a job.

  • To decrease the per

To decrease the per-

  • processor memory usage.

processor memory usage.

  • As William

As William Gropp Gropp states in states in Using MPI Using MPI, , “ “To pull a bigger To pull a bigger wagon, it is easier to add more oxen than to grow a wagon, it is easier to add more oxen than to grow a gigantic ox. gigantic ox.” ”

slide-2
SLIDE 2

2

Physical Problem Formulation Physical Problem Formulation

  • Determine temperature distribution in a flat plate

Determine temperature distribution in a flat plate with a temperature of 300 K being applied to with a temperature of 300 K being applied to three edges and 500 K being applied to the three edges and 500 K being applied to the fourth edge. fourth edge.

Governing Equation Governing Equation

  • Conservation of Energy

Conservation of Energy

(Differential Conservation Form) (Differential Conservation Form)

  • Assumptions Made

Assumptions Made

  • Front and Back Faces are

Front and Back Faces are Perfectly Insulated Perfectly Insulated

  • Steady Conditions

Steady Conditions

  • No Energy Transformation

No Energy Transformation

st g

  • ut

in

E E E E & & & & = + −

slide-3
SLIDE 3

3

Discretization Discretization

  • Point Jacobi Method

Point Jacobi Method

  • Iteratively solve for

Iteratively solve for

1 , + k j i

T

( ) ( )

2 2

2 1 , 1 , 1 , 2 , 1 1 , , 1

≅ Δ + − + Δ + −

− + + − + +

y T T T x T T T

k j i k j i k j i k j i k j i k j i

Implementation in FORTRAN Implementation in FORTRAN

  • Dimension Arrays

Dimension Arrays

  • Set Initial and Boundary Conditions

Set Initial and Boundary Conditions

  • Begin Iterative Process

Begin Iterative Process

  • Monitor Convergence

Monitor Convergence

slide-4
SLIDE 4

4

Results Results

  • Iterative Convergence

Iterative Convergence

k j i k j i k j i j i

T T T

, , 1 , ,

− =

+

ε

( ) ( ) ( ) ( )

α α

ε

1 , 1 max 2 1 max 2

⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ∑ ∑ =

− = − =

N L

j i j j i i

norm

Results Results

  • Temperature Distribution

Temperature Distribution

slide-5
SLIDE 5

5

Code Verification Code Verification

  • Method of Manufactured Solutions (MMS)

Method of Manufactured Solutions (MMS)

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

3 4 2 3 1 2 1

sin cos sin ~ a xy C a y C a x C C T π π π

( )

y x f a xy x a C a y a C a xy y a C a x a C y T x T , sin cos sin sin ~ ~

3 2 2 2 3 4 2 2 2 2 3 3 2 2 2 3 4 1 2 2 1 2 2 2 2 2

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = ∂ ∂ + ∂ ∂ π π π π π π π π

Code Verification Code Verification

  • Discretization Error (DE)

Discretization Error (DE) ( ) ( ) ( )

2 2 2 , 1 1 2 1 , 1 , 1 ,

2 ~ ~ ~ ~ ~ y x y T T x T T T

k j i k i k j i k j i k j i

NUMERICAL

Δ Δ Δ + + Δ + ≅

− + − + +

1 , 1 ,

~ ~

+ +

− =

k j i k j i

EXACT NUMERICAL

T T DE

slide-6
SLIDE 6

6

Code Verification Code Verification

  • Discretization Error (DE)

Discretization Error (DE)

Mesh Nodes Maximum DE (K) 10 x 10 13.00 25 x 25 1.30 50 x 50 0.34

Code Verification Code Verification

  • Global Discretization Error

Global Discretization Error

  • Formal Order of Accuracy

Formal Order of Accuracy

  • Observed Order of Accuracy

Observed Order of Accuracy

1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 1 10 100 h L 2 N

  • rm

L2Norm 2nd Order Slope

1 1

y y x x h

k k k

Δ Δ = Δ Δ =

slide-7
SLIDE 7

7

Parallelization Parallelization

Domain Decomposition for 2 Processors Domain Decomposition for 2 Processors

  • Blue box represents information

Blue box represents information to be passed between processors to be passed between processors after each iteration. after each iteration.

  • Red boxes are fixed boundary

Red boxes are fixed boundary conditions conditions

  • Green boxes include the grid points

Green boxes include the grid points that are initially sent to each that are initially sent to each processor. processor.

Parallel Code Structure Parallel Code Structure

  • The code is divided into three main sections: the portion perfor

The code is divided into three main sections: the portion performed by all med by all processors, the portion performed by the master processor, and t processors, the portion performed by the master processor, and the portion he portion performed by the slave processors. performed by the slave processors. All Processors All Processors Declare Variables Declare Variables Dimension Arrays Dimension Arrays INCLUDE INCLUDE ‘ ‘MPIF.H MPIF.H’ ’ Initialize MPI Initialize MPI If I am master If I am master then then … … Else (slave processors) Else (slave processors) … … End If End If

  • MPIF.H is a file telling the compiler where to find the MPI libr

MPIF.H is a file telling the compiler where to find the MPI libraries. aries.

slide-8
SLIDE 8

8

Parallel Code Structure Parallel Code Structure

  • The job of the master processor is to initialize the grid with i

The job of the master processor is to initialize the grid with initial and nitial and boundary conditions, then decompose it and send each processor t boundary conditions, then decompose it and send each processor the he information it needs. information it needs.

  • Each slave processor receives its initial grid from the master n

Each slave processor receives its initial grid from the master node and

  • de and

begins to perform calculations. After each iteration, individual begins to perform calculations. After each iteration, individual processors must pass the first and last column of their respecti processors must pass the first and last column of their respective grid to ve grid to neighboring processors to update its values. neighboring processors to update its values.

  • The slave processors iterate until an acceptable convergence has

The slave processors iterate until an acceptable convergence has been been reached and then send the new temperature values back to the mas reached and then send the new temperature values back to the master ter processor to reassemble the grid. processor to reassemble the grid.

MPI Functions MPI Functions

  • MPI Functions Called By All Processors

MPI Functions Called By All Processors

  • MPI_INIT(IERR)

MPI_INIT(IERR)

  • MPI_FINALIZE(IERR)

MPI_FINALIZE(IERR)

  • MPI_COMM_RANK(MPI_COMM_WORLD, MYID, IERR)

MPI_COMM_RANK(MPI_COMM_WORLD, MYID, IERR)

  • MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IERR)

MPI_COMM_SIZE(MPI_COMM_WORLD, NUMPROCS, IERR)

  • MPI Communication Operations

MPI Communication Operations

  • MPI_SEND(BUFFER, COUNT, DATATYPE, DESTINATION, TAG,

MPI_SEND(BUFFER, COUNT, DATATYPE, DESTINATION, TAG, MPI_COMM_WORLD, IERR) MPI_COMM_WORLD, IERR)

  • MPI_RECV(BUFFER,COUNT, DATATYPE, SOURCE, TAG,

MPI_RECV(BUFFER,COUNT, DATATYPE, SOURCE, TAG, MPI_COMM_WORLD, STATUS, IERR) MPI_COMM_WORLD, STATUS, IERR)

slide-9
SLIDE 9

9

MPI Functions MPI Functions

  • MPI_INIT

MPI_INIT

Initializes MPI environment. Can be called only once in a given Initializes MPI environment. Can be called only once in a given code. code.

  • MPI_FINALIZE

MPI_FINALIZE

Closes MPI. Once MPI_FINALIZE is called, MPI cannot be restarted Closes MPI. Once MPI_FINALIZE is called, MPI cannot be restarted. .

  • MPI_COMM_RANK

MPI_COMM_RANK

Assigns each processor a unique integer identifier. Assigns each processor a unique integer identifier.

  • MPI_COMM_SIZE

MPI_COMM_SIZE

Returns the number of processors available. Returns the number of processors available.

  • MPI_SEND

MPI_SEND

Standard MPI send operation. Standard MPI send operation.

  • MPI_RECV

MPI_RECV

Standard MPI receive operation. Standard MPI receive operation.

Parallel Performance Parallel Performance

  • Timing

Timing

  • In order to gauge the performance of the parallelized program a

In order to gauge the performance of the parallelized program a built built-

  • in MPI timing routine was used.

in MPI timing routine was used.

  • MPI_WTIME()

MPI_WTIME()

  • MPI_WTIME is called like any other MPI function except that it

MPI_WTIME is called like any other MPI function except that it has no arguments. It returns a double has no arguments. It returns a double-

  • precision floating point

precision floating point number that is the time in seconds since some arbitrary time in number that is the time in seconds since some arbitrary time in the past. the past.

  • The timing routine is called once by the master node at the

The timing routine is called once by the master node at the beginning of the program and again after the solution has beginning of the program and again after the solution has converged and the grid has been reassembled. The difference in converged and the grid has been reassembled. The difference in the two values returned is the total runtime of the program. the two values returned is the total runtime of the program.

slide-10
SLIDE 10

10

Parallel Performance Parallel Performance

  • Parallel Speedup

Parallel Speedup

  • Parallel speedup is a measure of the performance of parallel

Parallel speedup is a measure of the performance of parallel programs. programs. t t1

1 is the time for 1 processor to

is the time for 1 processor to finish, and finish, and t tn

n is the time for

is the time for n n processors. processors.

  • Amdahl

Amdahl’ ’s Law s Law

  • Theoretical maximum speedup of a parallel program.

Theoretical maximum speedup of a parallel program. P P is the parallel fraction, is the parallel fraction, S is the serial fraction, and N is S is the serial fraction, and N is the number of processors. the number of processors.

  • If 100% of the code is parallelized (

If 100% of the code is parallelized (S=0, P=1 S=0, P=1), the max speedup will ), the max speedup will be N, the number of processors. be N, the number of processors.

S N P MaxSpeedup + = 1

n

t t Speedup

1

=

Results Results

10 20 30 40 50 60 10 20 30 40 50 60

Number of Processors Speedup

Maximum Speedup 102x102 Mesh Speedup 1002x102 Mesh Speedup Theoretical Max Speedup

slide-11
SLIDE 11

11

Results Results

  • The results for speedup demonstrate the increasing influence of

The results for speedup demonstrate the increasing influence of communication time on the overall program runtime. communication time on the overall program runtime.

  • As the amount of calculations that each processor has to do decr

As the amount of calculations that each processor has to do decreases, the eases, the ratio of communication time to processing time increases. This c ratio of communication time to processing time increases. This causes the auses the speedup to decrease as more processors are added to smaller jobs speedup to decrease as more processors are added to smaller jobs. .

  • For example, two processors operating on the 102x102 node mesh h

For example, two processors operating on the 102x102 node mesh have 50 ave 50 columns of data a piece to perform calculations on, and there ar columns of data a piece to perform calculations on, and there are only two e only two sends and two receives per iteration. sends and two receives per iteration.

  • Fifty processors operating on the same mesh have only two column

Fifty processors operating on the same mesh have only two columns a s a piece to operate on. There are 98 sends and 98 receives per iter piece to operate on. There are 98 sends and 98 receives per iteration. ation.

  • These send/receive operations begin to take longer to complete t

These send/receive operations begin to take longer to complete than the han the calculations being performed, and the processors become calculations being performed, and the processors become “ “starved starved” ” for for data. data.

Results Results

This figure shows the dramatic decrease in runtime using 1, 2, 5 This figure shows the dramatic decrease in runtime using 1, 2, 5, 10, 25, , 10, 25, and 50 processors for the 1002x102 node mesh. and 50 processors for the 1002x102 node mesh.

50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60

Number of Processors Time, s

slide-12
SLIDE 12

12

Parallel Comparison Parallel Comparison

  • Temperature Distribution Compared

Temperature Distribution Compared between Parallelized and Serial Codes between Parallelized and Serial Codes

  • Zero Difference at Every Node

Zero Difference at Every Node