DUNE on Blue Gene / P Markus Blatt - - PowerPoint PPT Presentation

dune on blue gene p
SMART_READER_LITE
LIVE PREVIEW

DUNE on Blue Gene / P Markus Blatt - - PowerPoint PPT Presentation

DUNE on Blue Gene / P Markus Blatt (Markus.Blatt@iwr.uni-heidelberg.de) joint work with: Olaf Ippisch and Felix Heimann Interdisziplin ares Zentrum f ur wissenschaftliches Rechnen Universit at Heidelberg SciComp 15, Barcelona, May 21,


slide-1
SLIDE 1

DUNE on Blue Gene / P

Markus Blatt (Markus.Blatt@iwr.uni-heidelberg.de) joint work with: Olaf Ippisch and Felix Heimann

Interdisziplin¨ ares Zentrum f¨ ur wissenschaftliches Rechnen Universit¨ at Heidelberg

SciComp 15, Barcelona, May 21, 2009

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 1 / 19

slide-2
SLIDE 2

Outline

1

DUNE

2

Parallelization Approach

3

Porting to BG/P

4

Scalability

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 2 / 19

slide-3
SLIDE 3

DUNE

DUNE

Why another framework?

  • Lots of good frameworks for PDEs out there.
  • Using one it might be
  • either impossible have a particular feature,
  • or very inefficient in certain applications.
  • Extension of the feature set is usually hard

Distributed and Unified Numerics Environment

  • Separation of data structures and algorithms by abstract interfaces.
  • Efficient implementation of these interfaces using generic

programming techniques in C++.

  • Static polymorphism enables extensive optimization by the compiler.
  • Algorithms are parametrized with data structures. Interface is

removed at compile time.

  • Open Source available from http://www.dune-project.org
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 3 / 19

slide-4
SLIDE 4

DUNE

DUNE is modular

dune−grid ALU UG dune−grid−howto dune−fem dune−istl dune−common Alberta NeuronGrid dune−pdelab−howto dune−pdelab dune−localfunctions VTK Gmsh SuperLU Metis

  • Grid interface: (non-)conforming hierarchically nested,

multi-element-type parallel grids in arbitrary space dimensions.

  • Iterative Solver Template Library: Generic sparse and dense matrix

and vector classes supporting recursive block structures. Corresponding (parallel) solvers, e.g. AMG.

  • PDELab: Discretization module that is closely related to the

mathematical formulation of finite element methods.

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 4 / 19

slide-5
SLIDE 5

DUNE

Sample Simulations

  • Flow and transport in porous media
  • Neuron network simulation
  • Density-driven flow
  • Root uptake
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 5 / 19

slide-6
SLIDE 6

Parallelization Approach

Parallelization Approach

1

DUNE

2

Parallelization Approach

3

Porting to BG/P

4

Scalability

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 6 / 19

slide-7
SLIDE 7

Parallelization Approach

Index Based Communication

Goals

  • Allow reuse of efficient sequential data structures for computations
  • Let user initiate communication when needed.
  • Support
  • Unstructuredness
  • Adaptivity
  • Communication of different data with the same decomposition.

Approach

  • Keep decomposition and communication information outside of data

structures.

  • Use simple and portable index identification of items.
  • Data structures need to be augmented to contain ghost items.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 7 / 19

slide-8
SLIDE 8

Parallelization Approach

Index Sets

Index Set

  • Distributed overlapping index set I = P−1

Ip

  • Process p stores and manages mapping Ip −

→ [0, np).

  • Supports adaptivity.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 8 / 19

slide-9
SLIDE 9

Parallelization Approach

Index Sets

Index Set

  • Distributed overlapping index set I = P−1

Ip

  • Process p stores and manages mapping Ip −

→ [0, np).

  • Supports adaptivity.

Global Index

  • Identifies a position (index) globally.
  • Arbitrary and not consecutive (to support adaptivity).
  • Persistent.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 8 / 19

slide-10
SLIDE 10

Parallelization Approach

Index Sets

Index Set

  • Distributed overlapping index set I = P−1

Ip

  • Process p stores and manages mapping Ip −

→ [0, np).

  • Supports adaptivity.

Local Index

  • Addresses a position in the local container.
  • Convertible to an integral type.
  • Consecutive index starting from 0.
  • Non-persistent.
  • Provides an attribute to partition the set.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 8 / 19

slide-11
SLIDE 11

Parallelization Approach

Remote Index Information

  • Communication between different distributions of the index set is

possible, e.g.

  • Data agglomeration onto fewer processes.
  • Data redistribution for load balancing.
  • For each process one needs to store all global indices, which are stored
  • n that process, too, together with the corresponding attribute.
  • The remote index information can either be setup by hand (better

efficiency)

  • or computed automatically using global communication.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 9 / 19

slide-12
SLIDE 12

Parallelization Approach

Communication Interface

  • Contains information on a specific communication scheme.
  • Target and source partition of the index is chosen using attribute

flags, e.g from ghost to owner and ghost.

  • Still independent of the data to be communicated.
  • For each process a list of corresponding local indices at the source

and target index set is stored.

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 10 / 19

slide-13
SLIDE 13

Parallelization Approach

Communication

  • Communication occurs according to the setup interfaces.
  • Communication is possible in both directions (from source to target

and vice versa).

  • Data associated to indices can either
  • be of the same size for each index,
  • or of different size for each index.
  • Data can be manipulated either at the source or at the target

(customizable by user)

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 11 / 19

slide-14
SLIDE 14

Porting to BG/P

Porting to BG/P

1

DUNE

2

Parallelization Approach

3

Porting to BG/P

4

Scalability

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 12 / 19

slide-15
SLIDE 15

Porting to BG/P

Porting, a piece of cake?

Naive Assumptions

  • Dune uses the autotools-toolchain together with a custom script for

managing the module dependencies.

  • Autotools support cross compilation.
  • Configure test that need to run MPI programs can be switched off.
  • DUNE uses standard C++ (but advanced template stuff).
  • This should be really easy! Worked on other LINUX clusters, too!

The real HPC World

  • XLC lacks support for some standard template code (e.g. partial

template specialization).

  • Libtool gets confused somehow and tries to link shared libraries

statically.

  • Bottleneck (O(P)) in communication setup becomes apparent.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 13 / 19

slide-16
SLIDE 16

Porting to BG/P

Porting, a piece of cake?

Naive Assumptions

  • Dune uses the autotools-toolchain together with a custom script for

managing the module dependencies.

  • Autotools support cross compilation.
  • Configure test that need to run MPI programs can be switched off.
  • DUNE uses standard C++ (but advanced template stuff).
  • This should be really easy! Worked on other LINUX clusters, too!

The real HPC World

  • XLC lacks support for some standard template code (e.g. partial

template specialization).

  • Libtool gets confused somehow and tries to link shared libraries

statically.

  • Bottleneck (O(P)) in communication setup becomes apparent.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 13 / 19

slide-17
SLIDE 17

Porting to BG/P

Problem Resolutions

Missing template support in XLC

  • Thank goodness, GNU C++ compiler is also available!

Libtool problem

  • Use special option for Darwin (-dynamic).
  • Thanks to Bernd Mohr (JSC) and Frank Ingram (IBM).

O(P) bottleneck

  • At the time programming we were not thinking > 512 processors.
  • Fortunately we use a structured tensor product grid for our simulation.
  • Therefore we do not need send all indices in a ring!!
  • Switched to asynchronous communication with just the neighboring
  • processors. Now O(3d) for dimension d.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 14 / 19

slide-18
SLIDE 18

Porting to BG/P

Problem Resolutions

Missing template support in XLC

  • Thank goodness, GNU C++ compiler is also available!

Libtool problem

  • Use special option for Darwin (-dynamic).
  • Thanks to Bernd Mohr (JSC) and Frank Ingram (IBM).

O(P) bottleneck

  • At the time programming we were not thinking > 512 processors.
  • Fortunately we use a structured tensor product grid for our simulation.
  • Therefore we do not need send all indices in a ring!!
  • Switched to asynchronous communication with just the neighboring
  • processors. Now O(3d) for dimension d.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 14 / 19

slide-19
SLIDE 19

Porting to BG/P

Problem Resolutions

Missing template support in XLC

  • Thank goodness, GNU C++ compiler is also available!

Libtool problem

  • Use special option for Darwin (-dynamic).
  • Thanks to Bernd Mohr (JSC) and Frank Ingram (IBM).

O(P) bottleneck

  • At the time programming we were not thinking > 512 processors.
  • Fortunately we use a structured tensor product grid for our simulation.
  • Therefore we do not need send all indices in a ring!!
  • Switched to asynchronous communication with just the neighboring
  • processors. Now O(3d) for dimension d.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 14 / 19

slide-20
SLIDE 20

Porting to BG/P

Further Improvements (BGP Personality)

  • Most of the communication is with neighboring processors.
  • Our grid assigns the processors lexicographically to the subdomains.
  • Make sure that neighboring processors (in terms of the grid

distribution are neighbors in the BGP torus as well.

  • Reorder MPI ranks according to BGP torus coordinates.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 15 / 19

slide-21
SLIDE 21

Scalability

Scalability

1

DUNE

2

Parallelization Approach

3

Porting to BG/P

4

Scalability

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 16 / 19

slide-22
SLIDE 22

Scalability

INVEST (Inverse Modeling of Terrestrial Systems)

Inverse Modeling of Terrestrial Systems

  • Virtual institute (IWR Heidelberg and Agrosphere ICG-4 J¨

ulich)

  • Strategies for deriving flow and transport parameters for models.
  • Processes at the scale of an agricultural field.

Real Systems

Hard to distinguish between

  • Measurement errors,
  • insufficient representation of the heterogeneity, and
  • wrong effective model.

Virtual Soil-Plant Systems

  • Highly detailed field models with accurate representation of

within-field conditions.

  • Obtain synthetic data-sets of simulated state variables, fluxes and

measurements by high resolution simulation.

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 17 / 19

slide-23
SLIDE 23

Scalability

INVEST (Inverse Modeling of Terrestrial Systems)

Real Systems

Hard to distinguish between

  • Measurement errors,
  • insufficient representation of the heterogeneity, and
  • wrong effective model.

Virtual Soil-Plant Systems

  • Highly detailed field models with accurate representation of

within-field conditions.

  • Obtain synthetic data-sets of simulated state variables, fluxes and

measurements by high resolution simulation.

  • Use sets to develop and test parameter estimation procedures.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 17 / 19

slide-24
SLIDE 24

Scalability

INVEST (Inverse Modeling of Terrestrial Systems)

Real Systems

Hard to distinguish between

  • Measurement errors,
  • insufficient representation of the heterogeneity, and
  • wrong effective model.

Virtual Soil-Plant Systems

  • Highly detailed field models with accurate representation of

within-field conditions.

  • Obtain synthetic data-sets of simulated state variables, fluxes and

measurements by high resolution simulation.

  • Use sets to develop and test parameter estimation procedures.
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 17 / 19

slide-25
SLIDE 25

Scalability

Weak Scalability

  • 3D simulation of infiltration of a heterogeneous soil (1m x 1m x 1m)

starting at hydraulic equilibrium using Richards’ equation.

  • No flux boundary conditions at the sides.
  • Lower boundary imposed by groundwater table.
  • Newtons method as nonlinear solver.
  • Algebraic Multigrid as linear solver.
  • Problem size 64 x 64 x 64 elements per processor.

cpus Dofs (106) time steps TSol eff. t.st. New. steps No It. TIt eff. TIt T Build eff. Build 1 0.26 1 393 7 21 1.76 7.66 8 2.10 2 692 1.13 11 46 1.88 0.94 8.84 0.87 64 16.8 4 1143 1.37 16 88 1.92 0.92 12.24 0.63 512 134 8 1957 1.60 26 187 1.95 0.90 12.06 0.64 4096 1074 16 3033 2.07 38 345 1.95 0.90 12.01 0.64

  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 18 / 19

slide-26
SLIDE 26

Scalability

Conclusion and Acknowledgments

  • Porting was mainly due to technical and platform difficulties.
  • Original code scaled well with only minor changes.
  • Thanks to the carefully crafted and extensible DUNE code.
  • Some bugs only appear with ≥ 512 processors *sigh*.
  • Achieved good scalability for our application on Blue Gene P.
  • We highly appreciate the good support given by both JSC and IBM.

Thank you very much!

  • Thank you to all fellow DUNE developers from Berlin, Freiburg and

Heidelberg.

  • Do not forget to check out DUNE at http://dune-project.org
  • M. Blatt (IWR, Heidelberg)

DUNE on BG/P ScicomP15, May 21, 2009 19 / 19