Evaluation of Message Passing Communication Patterns in Finite - - PowerPoint PPT Presentation

evaluation of message passing communication patterns in
SMART_READER_LITE
LIVE PREVIEW

Evaluation of Message Passing Communication Patterns in Finite - - PowerPoint PPT Presentation

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio


slide-1
SLIDE 1

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems

Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho

High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-2
SLIDE 2

Summary

Motivation; EdgeCFD: software features; Parallel models;

MPI collective; MPI peer-to-peer; Threaded parallelism;

Benchmark systems; Benchmark problem; Results; Conclusions.

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-3
SLIDE 3

Motivations

Petaflop computing poses new challenges and paradigms; Compromise with continuous performance improvements in our software; Understand some hardware and software issues; Check Intel Xeon processor evolution.

Largest South America research system 6464 Nehalem cores Preliminary tests: 65Tflops Final configuration: 7200 cores Main point is: What is happening? Why modern multi-core systems do not, naturally, give us better performances?

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-4
SLIDE 4

EdgeCFD Main Features

EdgeCFD: A parallel and general purpose CFD solver

Finite Element Method;

SUPG/PSPG formulation for incompressible flow; SUPG/YZβ formulation for advection diffusion;

Edge-based data structure; Hybrid parallel (MPI, OpenMP or both); u-p fully coupled flow solver; Free-surface flows (VoF and Level Sets); Adaptive time step control; Inexact-Newton solver; Dynamic deactivation; Mesh ordering according to aimed architecture.

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-5
SLIDE 5

EdgeCFD in action

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-6
SLIDE 6

Governing Equations

Incompressible Navier Stokes Equation ∂u ∂t + u · ∇u + 1 ρ∇p − ∇ · (ν · ∇u) = f in Ω × [0, tf] (1) ∇ · u = 0 in Ω × [0, tf] (2) Advection-Diffusion Transport Equation ∂φ ∂t + u · ∇φ − ∇ · (K · ∇φ) = 0 em Ω × [0, tf] (3)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-7
SLIDE 7

Stabilized Finite Element Formulation (1/2)

SUPG/PSPG Formulation

wh · ρ

  • ∂uh

∂t + uh · ∇uh − f

  • dΩ +

ε

  • wh

  • ph, uh

dΩ−

  • Γh

wh · h dΓ+

qh∇ · uhdΩ +

nel

  • e=1
  • Ωe

τSUP Guh · ∇wh

  • ρ
  • ∂uh

∂t + uh · ∇uh

  • − ∇ · σ
  • ph, uh

− ρf

  • dΩe

+

nel

  • e=1
  • Ωe

τP SP G ρ ∇qh ·

  • ρ
  • ∂uh

∂t + uh · ∇uh

  • − ∇ · σ
  • ph, uh

− ρf

  • dΩe

+

nel

  • e=1
  • Ωe

τLSIC∇· whρ∇ · uhΩe = 0 (4) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-8
SLIDE 8

Stabilized Finite Element Formulation (2/2)

SUPG/yzβ Formulation

wh ∂φh ∂t + uh · ∇φh

  • dΩ+

∇wh · K · ∇φhdΩ −

  • Γh

whhh dΓ +

nel

  • e=1
  • Ωe

τSUPGuh · ∇wh ∂φh ∂t + uh · ∇φh − ∇ · K · ∇φh

  • dΩ

+

nel

  • e=1
  • Ωe

δ (φ) ∇wh · ∇φhdΩ =

whf dΩ (5)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-9
SLIDE 9

EdgeCFD Time Stepping and Main Loops

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-10
SLIDE 10

Parallel Models in EdgeCFD (Summary)

(1) MPI based on collective communication pattern; (2) MPI based on peer-to-peer (P2P) pattern; (3) OpenMP in hot loops (edge matrices assembling and matrix-vector products); (4) Hybrid = OpenMP combined to (1) or (2);

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-11
SLIDE 11

Parallel Models

MPI collective

  • 1. All shared equations are synchronized in just one collective
  • peration;
  • 2. Easy to implement;
  • 3. Poor performance for massive parallelism;
  • 4. Some (small) improvements can be done (...but there’s no

miracle...).

(a) Original mesh (b) Redundant communication

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-12
SLIDE 12

Parallel Models

MPI peer-to-peer (P2P)

  • 1. Tedious to implement:

Neighbouring relationships; Clever message exchanges scheduling; Computation-communication overlap opportunities;

  • 2. Very efficient (if correctly implemented, of course...);
  • 3. More suitable to massive parallelism.

(a) Mesh Partition (b) Communication graph

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-13
SLIDE 13

Parallel Models

Threaded parallelism

  • 1. Easy to implement;
  • 2. Performance depends on compiler, hardware and implementation (hardware

and compilers are getting better...);

  • 3. Being survived on many-core processors and GPU computing;
  • 4. On EdgeCFD:

Employed only in main kernels (e. g., matrix-vector product); Mesh coloring employed to remove memory dependences.

(a) Original mesh (b) Colored mesh

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-14
SLIDE 14

Putting All Together

Hybrid Matrix-Vector Product

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-15
SLIDE 15

Benchmark systems

Hardware description

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-16
SLIDE 16

Benchmark Problem

Rayleigh-Benard (Ra=30,000, Pr=0.71)

Assumptions:

  • 1. All mesh entities were reordered to explore memory locality

[1];

  • 2. Same mesh ordering to all systems;
  • 3. Same compiler (Intel) and compilation flags (-fast)

Mesh sizes MSH1 MSH2 Tetrahedra 178,605 39,140,625 Nodes 39,688 7,969,752 Edges 225,978 48,721,528 Flow equations 94,512 31,024,728 Transport equations 36,080 7,843,248 Natural convection problem [1] COUTINHO et al, IJNME (66):431-460, 2006;

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-17
SLIDE 17

Tests summary

Parallel model speedup per number of cores (intra-node) Serial performance per processor Parallel performance per processor (intra-node) MPI process placement (inter-nodes) Large scale run (case study)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-18
SLIDE 18

Parallel models

Speedup per number of cores (intra-node) Clovertown x Nehalem

SGI Altix-ICE (Clovertown) Nehalem server (i7)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-19
SLIDE 19

Results

CPU comparison

Serial performance (a) CPU (serial run) Intra-node performance (b) 8 cores (2 CPUs×4 cores in 1 node)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-20
SLIDE 20

Results

Process placement effect (Cluster Dell/Harpertown)

Speedup considering best process placement

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-21
SLIDE 21

THE MULTICORE DILEMMA:

SANDIA and TACC’s statements:

“...more chip cores can mean slower supercomputing...”; “...16 multicores perform barely as well as two for complex applications...”; “...Process placement in multi-core processors has a strong influence

  • n performance....”;

“...more cores on a single chip don’t necessarily mean faster clock speeds...”.

Supermarket’s analogy: (by Sandia) If two clerks at the same checkout counter are processing your food instead of one, the checkout process should go faster. The problem is, if each clerk doesn’t have access to the groceries, he doesn’t necessarily help the process. Worse, the clerks may get in each other’s way.

SOURCE: Diamond, J. et al, ‘‘Multicore Optimization for Ranger’’, TACC, 2009 https://share.sandia.gov/news/resources/news_releases/more-chip-cores-can-mean-slower-supercomputing- sandia-simulation-shows/ Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-22
SLIDE 22

Results

Large scale run on 64 cores (case study using MSH2)

About 40M tets, 8M nodes, 50M edges, 31M flow equations, 8M transport equations

Time spent in 10 time steps

Communication graph

Tests performed on Cluster Dell (Harpertown)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-23
SLIDE 23

Conclusions

Older Intel Xeon processors dramatically suffer when large workloads are imposed to a single CPU

Consequence: processor placement has great influence in performance (...sadly, we should not fill up our nodes...);

Nehalem (Core I7) has several improvements over its predecessors;

Third level shared cache (now Intel has a true quad core...); Fast interconnect channel among processors (well, sounds like AMD Hyper Transport...).

Peer-to-peer MPI model is the best strategy to reach good parallel performance in EdgeCFD; OpenMP performance in EdgeCFD is still poor, but it’s getting better (for the same implementation!); Many-core and GPU paradigms are bringing back threaded parallelism...

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-24
SLIDE 24

Special Thanks to

Dell Brazil (Cluster Dell) Intel Brazil (Nehalem servers) High Performance Computing Center (NACAD) (Altix-ICE and infrastructure) Texas Advanced Computing Center (TACC) (Ranger) All of you that attended this presentation The Brazilian soccer team (for winning today)

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

slide-25
SLIDE 25

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems

Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho

High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil

Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)