Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin - - PDF document

▶

Jun 20, 2023 197 likes •340 views

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for parallelizing Fortran and C/C++ on shared memory systems Minimal changes to sequential code required Incremental parallelization OpenMP

SLIDE 1

Speeding Up Reactive Transport Code Using OpenMP

By Jared McLaughlin

OpenMP

A standard for parallelizing Fortran and C/C++ on

shared memory systems

Minimal changes to sequential code required
Incremental parallelization
OpenMP‐Compliant and normal compilers

– !$OMP

No message passing between processors
Fine‐ and course‐grained Parallelism

– Do Loop – Sections

SLIDE 2

Threads

A thread is forked to start the parallel region and joined at the end of the region. The thread that was forked is called the master, while the other threads are the workers.

What is a thread?

Parallel Region Constructor

!$OMP PARALLEL clause1 clause2… …parallel code is placed here… !$OMP END PARALLEL Optional clauses include

PRIVATE (list)
SHARED (list)
DEFAULT (PRIVATE | SHARED | NONE)
FIRSTPRIVATE (list)
REDUCTION (operator:list)
IF (scalar logical expression)
NUM THREADS (scalar integer expression)

SLIDE 3

!$OMP DO clause1 clause2… DO i=1, N …parallel code is placed here… END DO !$OMP END DO end_clause Optional clauses include

PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
REDUCTION (operator:list)
SCHEDULE (type, chunk)

Work‐Sharing Constructs

!$OMP SECTIONS clause1 clause2 ... !$OMP SECTION ... parallel code is placed here… !$OMP SECTION ... parallel code is placed here… !$OMP END SECTIONS end_clause Optional clauses include

PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
REDUCTION (operator:list)

!$OMP SINGLE clause1 clause2 ... ... !$OMP END SINGLE end_clause Optional clauses include

PRIVATE (list)
FIRSTPRIVATE (list)

!$OMP WORKSHARE ... !$OMP END WORKSHARE end_clause

Clauses

Shared(list)

Same location of variable

available to all threads; exists before and after parallel region Must check that no race conditions occur Private(list)

Each thread has its own

copy of the variable considered to be local to that parallel construct

Private variables have to be

initialized inside the parallel region and are considered to be undefined outside of that region

Do Loop Counters are

always private

Default(NONE|SHARED|PRIVATE)

Any unstated variables can be

defaulted to shared or private

None says all variables must be

declared in the shared or private clauses

SLIDE 4

Clauses

FIRSTPRIVATE (list) – gives the private variable an initialized value of the original variable when entering the parallel region LASTPRIVATE (list) – gives the exiting private variable the value of the last iteration or final section REDUCTION (operator:list) – to ensure that a shared variable location is written to one thread at a time; each thread has a private copy

f the shared variable that gets updated at the end of the parallel

region; operators include +, *, ‐, .AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR or IEOR IF (scalar logical expression) – allows the parallel region to be run sequentially if the expression is false NUM THREADS (scalar integer expression) – allows the number of threads the region is fork into to be declared (still an optional command that is not needed)

Clauses

SCHEDULE (type, chunk) – type can be static, dynamic, or guided; help determine the efficiency of the code

Static ‐ divides the iterations statically in the beginning between the

threads; if a chunk size is set the last thread may have a different number of iterations then the others; offers the best performance if all the iterations require the same computational time

Dynamic – each thread is given a small amount of work the size of chunk

and when it is done it is given more; if the chunk is not specified the default is one; obviously increases overhead

Guided – gives a combination of the two by handing out large loads at

first and then handing out smaller loads decreasing exponentially NOWAIT – an end_clause causing the threads not to wait at the end of a work‐sharing region, but to continue on to the next work‐sharing region; without this clause there is an implied barrier for all the threads to catch up and synchronize with each other

SLIDE 5

Van der Pas, R. (2005, June 1‐4). An Introduction into

OpenMP. Presented at the University of Oregon.

REACTION TRANSPORT MODELING

SLIDE 6

Performance Analysis

Note: debug versus release mode x C v t C ∂ ∂ − = ∂ ∂

2 2

x C D t C ∂ ∂ = ∂ ∂ SS t C = ∂ ∂ reactions t C = ∂ ∂

reactions SS x C D x C v t C + + ∂ ∂ + ∂ ∂ − = ∂ ∂

2 2

Governing Equation: Operator Split:

SLIDE 7

100 Species in 1D Column after 40 years

100 Species Problem Specifics

Parallelized in two places

– Advection‐Dispersion Equation with parallel Do‐Loop species iterations split between threads – Reactions with parallel Do‐Loop node iterations split between threads the same way RT3D is done

Results presented from Debug mode runs

Simulation Time (yr) 40 Length (m) 2000 Velocity (m/yr) 5 ∆x 1 ∆t 0.1 Dispersion coefficient; Dx (m^2/yr) 50 Courant 0.5 Peclet 0.1

SLIDE 8

Timing ‐ (Static Scheduling) 1 2 3 4 Program Run Time 172.77765 88.43443 62.32897 49.229939 Program Speedup 1.953737 2.772028 3.5096052 Efficiency 0.976869 0.924009 0.8774013 Reaction Run Time 134.536829 66.42403 44.81983 35.355174 Reaction Speedup 2.025424 3.001726 3.80529393 Efficiency 1.012712 1.000575 0.95132348 Dispersion RunTime 35.341182 18.9151 14.41493 10.701726 Adv‐Disp Speedup 1.868411 2.451707 3.3023815 Efficiency 0.934206 0.817236 0.82559538 Time Spent in Reactions 77.87% 75.11% 71.91% 71.82% Time Spent in Adv‐Disp 20.45% 21.39% 23.13% 21.74% Don’t focus on the Program speedup and efficiency, just the parallelized sections. Timing ‐ (Guided Scheduling) 1 2 3 4 Program Run Time 172.77765 93.36956 65.11877 51.98954 Program Speedup 1.850471 2.65327 3.32331561 Efficiency 0.925235 0.884423 0.8308289 Reaction Run Time 134.536829 66.10992 44.28971 33.657039 Reaction Speedup 2.035048 3.037654 3.99728654 Efficiency 1.017524 1.012551 0.99932164 Dispersion RunTime 35.341182 23.54525 17.70636 14.985179 Adv‐Disp Speedup 1.50099 1.99596 2.35840907 Efficiency 0.750495 0.66532 0.58960227 Time Spent in Reactions 77.87% 70.80% 68.01% 64.74% Time Spent in Adv‐Disp 20.45% 25.22% 27.19% 28.82%

SLIDE 9

Timing ‐ (Static Adv‐Disp & Guided Reactions) 1 2 3 4 Program Run Time 172.77765 88.07855 61.75762 48.335757 Program Speedup 1.961631 2.797673 3.57453076 Efficiency 0.980816 0.932558 0.89363269 Reaction Run Time 134.536829 66.12452 44.33819 34.313412 Reaction Speedup 2.034598 3.034333 3.92082341 Efficiency 1.017299 1.011444 0.98020585 Dispersion RunTime 35.341182 18.84921 14.26557 10.735491 Adv‐Disp Speedup 1.874943 2.477375 3.29199494 Efficiency 0.937471 0.825792 0.82299873 Time Spent in Reactions 77.87% 75.07% 71.79% 70.99% Time Spent in Adv‐Disp 20.45% 21.40% 23.10% 22.21% Superlinear Speedup

100 Species Runtimes

SLIDE 10

100 Species Speedup

Vinyl Chloride after 10000 days

SLIDE 11

RT3D Problem Specifics

A Program called MT3D solves the advection, dispersion, and

source/sink equations and calls the RT3D subroutines to solve the reactions equation

The specific problem solved in this example was the sequential

decay of PCE, TCE, DCE, and VC.

The continuous source spill concentration of PCE was 1000 mg/L

at the well.

The initial levels of all chemicals in the aquifer was 0.0 mg/L.
The site was 510 m x 310 m x 100 m. This created a grid 51x31x10.
The reactions solved were as follows:

RPCE = ‐ k1 * [PCE] RTCE = k1*YTCE/PCE*[PCE] ‐ k2 * [TCE] RDCE = k2*YDCE/TCE*[TCE] – k3 * [DCE] RVC = k3*YVC/DCE*[DCE] – k4 * [VC]

Results presented from a Release mode version

k1 0.005 day‐1 k2 0.003 day‐1 k3 0.002 day‐1 k4 0.001 day‐1 YTCE/PCE 0.7920 YDCE/TCE 0.7377 YVC/DCE 0.6445

Timing ‐ Loop Around Row Do Loop (Static Scheduling) 1 2 3 4 Program Run Time 394.4463 284.5834 258.6008 240.5579 Program Speedup 1.386048 1.525309 1.639715 Efficiency 0.693024 0.508436 0.409929 Rt3d Run Time 229.1753 120.3458 94.81803 75.2866 Rt3d Speedup 1.904307 2.417002 3.044039 Efficiency 0.952153 0.805667 0.76101 Time Spent in Rt3d 58.10% 42.29% 36.67% 31.30% Don’t focus on the Program speedup and efficiency, just the parallelized sections.

SLIDE 12

Timing ‐ Loop Around Row Do Loop (Guided Scheduling) 1 2 3 4 Program Run Time 394.4463 280.7615 247.1585 227.8098 Program Speedup 1.404916 1.595924 1.731472 Efficiency 0.702458 0.531975 0.432868 Rt3d Run Time 229.1753 117.2388 80.37176 60.98532 Rt3d Speedup 1.954774 2.851441 3.757877 Efficiency 0.977387 0.95048 0.939469 Time Spent in Rt3d 58.10% 41.76% 32.52% 26.77%

RT3D Decay Problem Runtimes

SLIDE 13

RT3D Decay Problem Runtimes

Conclusion

Clearly the capabilities of OpenMP are limited to the available

computer architectures. Much more speedup is possible with hundreds of processors in a cluster system possibly using Message Passing Interface routines, but OpenMP leaves code intact sequentially, is easy to implement, and accomplishes great speedup when a limited number of processors are available in a shared memory system.

Options for future research can include a Hybrid‐MPI/OpenMP

code utilizing the benefits of both standards.

OpenMP is available primarily in commercial compilers such as Intel

Visual Fortran and PGI compilers.

Omni compiler might be free with OpenMP

http://phase.hpcc.jp/Omni/ ‐ I have not tried it so I don’t know if it works.