Load Balancing and Data Migration in a Hybrid Computational Fluid - - PowerPoint PPT Presentation

load balancing and data migration in a hybrid
SMART_READER_LITE
LIVE PREVIEW

Load Balancing and Data Migration in a Hybrid Computational Fluid - - PowerPoint PPT Presentation

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application Esteban Meneses Patrick Pisciuneri Center for Simulation and Modeling (SaM) University of Pittsburgh University of Pittsburgh High Performance


slide-1
SLIDE 1

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Esteban Meneses Patrick Pisciuneri

Center for Simulation and Modeling (SaM)
 University of Pittsburgh

slide-2
SLIDE 2

Load Balancing in a CFD Application

University of Pittsburgh

2

High Performance Computing Scientific Computing Computer Science

slide-3
SLIDE 3

Load Balancing in a CFD Application

Center for Simulation and Modeling (SaM)

3

Frank

Sciences Engineering Health Technical

Educational

Research

521 users 8,040 cores

91% utilization in 2014 HPC researchers/consultants

slide-4
SLIDE 4

Load Balancing in a CFD Application

IPLMCFD

  • A massively parallel solver for turbulent reactive flows.
  • LES via filtered density function (FDF).

4

slide-5
SLIDE 5

Load Balancing in a CFD Application

Load Imbalance

  • IPLMCFD uses a graph partitioning library

(METIS) to redistribute work.

  • Requires to split execution between calls to

repartition cells.

5

slide-6
SLIDE 6

Load Balancing in a CFD Application

Reasons for Load Imbalance in CFD

  • Approaches:

❖ Task-parallel
 ❖ Zoltan
 ❖ Charm++

6

Traditional IPLMCFD

Langer et al, SBAC-PAD, 2012.

Adaptive Mesh Refinement Chemical Reaction

slide-7
SLIDE 7

Load Balancing in a CFD Application

Agenda

  • IPLMCFD: A Hybrid Computational Fluid

Dynamics Application

  • Zoltan Library
  • PaSR Benchmark
  • Zoltan vs Charm++ Comparison

7

slide-8
SLIDE 8

Load Balancing in a CFD Application

Hybrid CFD Application

  • IPLMCFD: Irregularly

Portioned Lagrangian Monte Carlo Finite Difference.

  • Domain divided into

cells, the atomic distribution unit.

  • Ensemble of cells:
  • Same number of FD points.
  • Same number of MC particles.

8

slide-9
SLIDE 9

Load Balancing in a CFD Application

Computational Fluid Dynamics

9

#"Grids" #"Par*cles" #"Species" Required" Memory" GBs" GFLOP"per" itera*on" #"Itera*ons" Serial"""" Run>*me"" (1"GFLOP/s)" 106$ 6$x$106$ 9$ 1.69$ 29.5$ 60,000$ 20.5$days$ 106$ 6$x$106$ 19$ 2.48$ 90.7$ 60,000$ 63$days$ 5$x$106$ 50$x$106$ 19$ 24.0$ 544.7$ 220,000$ 3.8$years$

slide-10
SLIDE 10

Load Balancing in a CFD Application

Code Structure

10

C++ C++ Fortran/ C 10,101 LOC 3,091 LOC Metis TVMet Chemkin ODE Pack Interface Iplmcfd Ipfd Iplmc MPI

slide-11
SLIDE 11

Load Balancing in a CFD Application

IPLMCFD

  • A scalable algorithm for hybrid

Eulerian/Lagrangian solvers.

  • Goals:
  • Balance the computational load

among processors through weighted graph partitioning.

  • To minimize the number of adjacent

elements assigned to different processors (minimize the edge-cut).

  • Irregularly shaped decompositions:
  • Disadvantages:
  • Nontrivial communication patterns
  • Increased communication cost.
  • Advantage (major):
  • Evenly distributed load among partitions.

11

  • P. H. Pisciuneri et al., SIAM J.
  • Sci. Comput., vol. 35, no. 4, pp.

C438-C452 (2013).

slide-12
SLIDE 12

Load Balancing in a CFD Application

Strong Scaling

  • Geometry:
  • 2.5 million FD points
  • 20 million MC particles
  • Chemistry: 9 species, 5-step
  • Top:
  • Unbalanced: 22% efficiency (9K cores)
  • IPLMCFD: 76% efficiency (9K cores)
  • Bottom:
  • Performance of IPLMCFD improves as

the number of MC particles increases

  • IPLMCFD: 84% efficiency at 9k

processors for 40M particles

  • Timing:
  • The average of 10 iterations immediately

after load balancing

12

slide-13
SLIDE 13

Load Balancing in a CFD Application

Simulation of a Premixed Flame

13

slide-14
SLIDE 14

Load Balancing in a CFD Application

Temporal Performance of IPLMCFD

  • Unbalanced: approx.

static performance

  • IPLMCFD: variable

performance

  • Load balancing is performed
  • approx. every 2000 iterations
  • Optimal performance

immediately after load balancing

  • Performance degrades in time
  • Potential walltime

savings afforded by IPLMCFD for this example:

14

TUnbalanced - TIPLMCFD = 30 hours

slide-15
SLIDE 15

Load Balancing in a CFD Application

Cost of Repartitioning

  • Naïve ¡approach: ¡
  • Immediately before load-balancing

checkpoint the entire simulation

  • Restart the simulation with a new

decomposition

  • Costly, involves:
  • Writing to shared filesystem
  • Simulation cleanup
  • Simulation startup
  • Reading from shared filesystem
  • Does not scale
  • O(102 – 103) iterations in cost
  • Op.mal ¡approach: ¡
  • Repartitioning should be handled in

memory

  • The new partition is aware of the

previous partition, thus minimal data movement and interruption

15

slide-16
SLIDE 16

Load Balancing in a CFD Application

Zoltan

  • “A toolkit of parallel

combinatorial algorithms for unstructured and/or adaptive computations”.

  • Sandia-OSU collaboration

since 2000.

  • Part of Trilinos package.
  • Zoltan2 project in C++.

16

Dynamic load balancing Parallel repartitioning Data migration tools Distributed data directories Unstructured communication Dynamic memory management

slide-17
SLIDE 17

Load Balancing in a CFD Application

Zoltan IPLMCFD

  • Zoltan’s callback function interface.
  • Methodology:

❖ Atomic unit ⟶ cell (irregular subdomains). ❖ Data registration ⟶ number of objects, object weights. ❖ Graph management ⟶ number of edges, edge weights. ❖ Migration ⟶ pack/unpack functions. ❖ Load balancing ⟶ partition, repartition, refinement. ❖ Global information ⟶ distributed data directory.

17

slide-18
SLIDE 18

Load Balancing in a CFD Application

Charm++ IPLMCFD

  • Goal: fully exploit Charm++ features.
  • Methodology:

❖ Atomic unit ⟶ subdomain (regular subdomains). ❖ Containing class ⟶ 3D chare array. ❖ Process-based data ⟶ chare group. ❖ Communication ⟶ outermost level. ❖ Structured control flow ⟶ Structured Dagger. ❖ Migration ⟶ PUP methods.

18

slide-19
SLIDE 19

Load Balancing in a CFD Application

Partially Stirred Reactor (PaSR)

19

100% ¡AIR ¡ 300 ¡K 60% ¡CH4 ¡ 40% ¡AIR ¡ 300 ¡K PRODUCTS

¡

  • Parameters:
  • IC: Stoichiometric mixture of methane&air reacted until

equilibrium (T≈2230 K)

  • Simulation duration: tend=10 𝜐res
  • Realizability:
  • Lower bound, no mixing
  • Upper bound, perfectly stirred
slide-20
SLIDE 20

Load Balancing in a CFD Application

Dynamic Load-Balancing

Dynamic Partitioning

20

Static Partition

slide-21
SLIDE 21

Load Balancing in a CFD Application

Strong Scaling

  • Parameters:

❖ 10,000 particles ❖ Chemistry: 9 species, 5-step

  • Timings over the entire simulation (Stampede)

❖ The Zoltan and Charm++ timings include all overhead associated with repartitioning and data migration

21

ZOLTAN Charm++

slide-22
SLIDE 22

Load Balancing in a CFD Application

Programming Effort

22

Zoltan IPLMCFD Charm++ IPLMCFD Startup 39 Object Graph Management 80 Data Migration 427 61 Load Balancing 40 3

Measured in lines of code (LOC)

slide-23
SLIDE 23

Load Balancing in a CFD Application

Charm++ Wishlist

  • MPI ⟶ Charm++ migration guide:

❖ Instructions on using Charm++ with build systems. ❖ Translating common MPI programming patterns. ❖ Dealing with communication operations. ❖ Highlighting opportunities for improvement.

  • Parallel I/O documentation.
  • Accelerator programming documentation.

23

slide-24
SLIDE 24

Load Balancing in a CFD Application

Conclusions

  • Competitive performance between Zoltan and

Charm++ for adaptive simulations of turbulent reactive flows.

  • Charm++ alleviates programming effort of

infrastructure for adaptive computation.

24

Thank You! Q&A