Evaluation of Message Passing Communication Patterns in Finite - PowerPoint PPT Presentation

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Summary Motivation; EdgeCFD: software features; Parallel models; MPI collective; MPI peer-to-peer; Threaded parallelism; Benchmark systems; Benchmark problem; Results; Conclusions. Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Motivations Petaflop computing poses Largest South America research system new challenges and paradigms; Compromise with continuous performance improvements in our software; Understand some hardware and software issues; Check Intel Xeon processor 6464 Nehalem cores Preliminary tests: 65Tflops evolution. Final configuration: 7200 cores Main point is: What is happening? Why modern multi-core systems do not, naturally, give us better performances? Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

EdgeCFD Main Features EdgeCFD: A parallel and general purpose CFD solver Finite Element Method; SUPG/PSPG formulation for incompressible flow; SUPG/YZ β formulation for advection diffusion; Edge-based data structure; Hybrid parallel (MPI, OpenMP or both); u - p fully coupled flow solver; Free-surface flows (VoF and Level Sets); Adaptive time step control; Inexact-Newton solver; Dynamic deactivation; Mesh ordering according to aimed architecture. Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

EdgeCFD in action Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Governing Equations Incompressible Navier Stokes Equation ∂ u ∂t + u · ∇ u + 1 ρ ∇ p − ∇ · ( ν · ∇ u ) = f in Ω × [0 , t f ] (1) ∇ · u = 0 in Ω × [0 , t f ] (2) Advection-Diffusion Transport Equation ∂φ ∂t + u · ∇ φ − ∇ · ( K · ∇ φ ) = 0 em Ω × [0 , t f ] (3) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Stabilized Finite Element Formulation (1/2) SUPG/PSPG Formulation � ∂ u h � � � � � w h · ρ + u h · ∇ u h − f w h · h d Γ+ � w h � � p h , u h � q h ∇ · u h d Ω d Ω + ε : σ d Ω − ∂t Ω Ω Ω Γ h nel ∂ u h � � � � � τ SUP G u h · ∇ w h + u h · ∇ u h � p h , u h � d Ω e � + ρ − ∇ · σ − ρ f ∂t e =1 Ω e nel � � ∂ u h � � � τ P SP G ∇ q h · + u h · ∇ u h � p h , u h � d Ω e � + ρ − ∇ · σ − ρ f ρ ∂t e =1 Ω e nel � τ LSIC ∇· w h ρ ∇ · u h Ω e = 0 � + (4) e =1 Ω e Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Stabilized Finite Element Formulation (2/2) SUPG/yz β Formulation � ∂φ h � � � � ∂t + u h · ∇ φ h ∇ w h · K · ∇ φ h d Ω − w h h h d Γ w h d Ω+ Ω Ω Γ h n el � ∂φ h � � τ SUPG u h · ∇ w h ∂t + u h · ∇ φ h − ∇ · K · ∇ φ h � + d Ω e =1 Ω e n el � � δ ( φ ) ∇ w h · ∇ φ h d Ω = � w h f d Ω + (5) e =1 Ω e Ω Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

EdgeCFD Time Stepping and Main Loops Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Parallel Models in EdgeCFD (Summary) (1) MPI based on collective communication pattern; (2) MPI based on peer-to-peer (P2P) pattern; (3) OpenMP in hot loops (edge matrices assembling and matrix-vector products); (4) Hybrid = OpenMP combined to (1) or (2); Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Parallel Models MPI collective 1. All shared equations are synchronized in just one collective operation; 2. Easy to implement; 3. Poor performance for massive parallelism; 4. Some (small) improvements can be done ( ...but there’s no miracle... ). (a) Original mesh (b) Redundant communication Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Parallel Models MPI peer-to-peer (P2P) 1. Tedious to implement: Neighbouring relationships; Clever message exchanges scheduling; Computation-communication overlap opportunities; 2. Very efficient ( if correctly implemented, of course... ); 3. More suitable to massive parallelism. (a) Mesh Partition (b) Communication graph Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Parallel Models Threaded parallelism 1. Easy to implement; 2. Performance depends on compiler, hardware and implementation ( hardware and compilers are getting better... ); 3. Being survived on many-core processors and GPU computing; 4. On EdgeCFD: Employed only in main kernels (e. g., matrix-vector product); Mesh coloring employed to remove memory dependences. (a) Original mesh (b) Colored mesh Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Putting All Together Hybrid Matrix-Vector Product Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Benchmark systems Hardware description Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Benchmark Problem Rayleigh-Benard (Ra=30,000, Pr=0.71) Assumptions: 1. All mesh entities were reordered to explore memory locality [1]; 2. Same mesh ordering to all systems; 3. Same compiler (Intel) and compilation flags (-fast) Mesh sizes MSH1 MSH2 Tetrahedra 178,605 39,140,625 Nodes 39,688 7,969,752 Edges 225,978 48,721,528 Flow equations 94,512 31,024,728 Transport equations 36,080 7,843,248 Natural convection problem [1] COUTINHO et al, IJNME (66):431-460, 2006; Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Tests summary Parallel model speedup per number of cores (intra-node) Serial performance per processor Parallel performance per processor (intra-node) MPI process placement (inter-nodes) Large scale run (case study) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Parallel models Speedup per number of cores (intra-node) Clovertown x Nehalem SGI Altix-ICE (Clovertown) Nehalem server (i7) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Results CPU comparison Serial performance Intra-node performance (a) CPU (serial run) (b) 8 cores (2 CPUs × 4 cores in 1 node) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Results Process placement effect (Cluster Dell/Harpertown) Speedup considering best process placement Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

THE MULTICORE DILEMMA: SANDIA and TACC’s statements: “...more chip cores can mean slower supercomputing...”; “...16 multicores perform barely as well as two for complex applications...”; “...Process placement in multi-core processors has a strong influence on performance....”; “...more cores on a single chip don’t necessarily mean faster clock speeds...”. Supermarket’s analogy: (by Sandia) If two clerks at the same checkout counter are processing your food instead of one, the checkout process should go faster. The problem is, if each clerk doesn’t have access to the groceries, he doesn’t necessarily help the process. Worse, the clerks may get in each other’s way. SOURCE: Diamond, J. et al, ‘‘Multicore Optimization for Ranger’’, TACC, 2009 https://share.sandia.gov/news/resources/news_releases/more-chip-cores-can-mean-slower-supercomputing- sandia-simulation-shows/ Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Results Large scale run on 64 cores (case study using MSH2) About 40M tets, 8M nodes, 50M edges, 31M flow equations, 8M transport equations Time spent in 10 time steps Communication graph Tests performed on Cluster Dell (Harpertown) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Conclusions Older Intel Xeon processors dramatically suffer when large workloads are imposed to a single CPU Consequence: processor placement has great influence in performance ( ...sadly, we should not fill up our nodes... ); Nehalem (Core I7) has several improvements over its predecessors; Third level shared cache ( now Intel has a true quad core... ); Fast interconnect channel among processors ( well, sounds like AMD Hyper Transport... ). Peer-to-peer MPI model is the best strategy to reach good parallel performance in EdgeCFD; OpenMP performance in EdgeCFD is still poor, but it’s getting better ( for the same implementation! ); Many-core and GPU paradigms are bringing back threaded parallelism... Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Special Thanks to Dell Brazil ( Cluster Dell ) Intel Brazil ( Nehalem servers ) High Performance Computing Center (NACAD) ( Altix-ICE and infrastructure ) Texas Advanced Computing Center (TACC) ( Ranger ) All of you that attended this presentation The Brazilian soccer team ( for winning today ) Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio de Janeiro (UFRJ) Rio de Janeiro, Brazil Elias et. al (2010) VECPAR’10—Berkeley, CA (USA)

Evaluation of Message Passing Communication Patterns in Finite - PowerPoint PPT Presentation

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio

COMP31212: Concurrency Topics 4.3: Message Passing Topic 4.3: Message Passing Outline Topic

Message Passing Concepts Message Passing Model The message passing model is based on the

Message-Passing Programming with MPI Message-Passing Concepts Overview This lecture will

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Interference Alignment via Message-Passing Message-Passing M. Guillaud Motivation Maxime

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Lecture 5: Message Passing & Other Communication Mechanisms (SR & Java) Intro:

+ Design of Parallel Algorithms Introduction to the Message Passing Interface MPI + Principles

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

A little introduction to MPI Jean-Luc Falcone July 2017 Message Passing Basics Point to point

Message passing and channels INF4140 - Models of concurrency Message passing and channels Fall

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Probing a Probing a Pion Pion with Photons with Photons Adnan Adnan Bashir Bashir

Neutrino flavor transformation from compact object mergers Gail McLaughlin North Carolina State

radiata as potential anti-diabetic agent Chhaya Gadgoli 1, *, Lalit Sali, Mahesh Abhyankar, Prachi

Wicked Problems & Clumsy Solutions: The Role of Leadership Keith Grint What work problem is

Inclusive hadronic distributions in jets in the vacuum and in the medium Redamy Perez Ramos

NAFIPS2011 March 18-20 S 20 a c 8 20 El Paso, Texas Organizers: Vladik Kreinovich and

Physics Is Still Your Friend Drew Davidson A Little Bit About Me World of Goo (?)

By By Shaimaa Moustafa Mahdye Eldieghdye lecturer at Biological Application Dep. Atomic Energy

Evaluation of Message Passing Communication Patterns in Finite - PowerPoint PPT Presentation

Evaluation of Message Passing Communication Patterns in Finite Element Solution of Coupled Problems Renato N. Elias Jose J. Camata Albino A. Aveleda Alvaro L. G. A. Coutinho High Performance Computing Center (NACAD) Federal University of Rio

COMP31212: Concurrency Topics 4.3: Message Passing Topic 4.3: Message Passing Outline Topic

Message Passing Concepts Message Passing Model The message passing model is based on the

Message-Passing Programming with MPI Message-Passing Concepts Overview This lecture will

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Interference Alignment via Message-Passing Message-Passing M. Guillaud Motivation Maxime

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Distributed Objects Message Passing vs. Distributed Objects Message Passing versus Distributed

Lecture 5: Message Passing &amp; Other Communication Mechanisms (SR &amp; Java) Intro:

+ Design of Parallel Algorithms Introduction to the Message Passing Interface MPI + Principles

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

A little introduction to MPI Jean-Luc Falcone July 2017 Message Passing Basics Point to point

Message passing and channels INF4140 - Models of concurrency Message passing and channels Fall

Message Passing Dr. Liam OConnor University of Edinburgh LFCS (and UNSW) Term 2 2020 1

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Probing a Probing a Pion Pion with Photons with Photons Adnan Adnan Bashir Bashir

Neutrino flavor transformation from compact object mergers Gail McLaughlin North Carolina State

radiata as potential anti-diabetic agent Chhaya Gadgoli 1, *, Lalit Sali, Mahesh Abhyankar, Prachi

Wicked Problems &amp; Clumsy Solutions: The Role of Leadership Keith Grint What work problem is

Inclusive hadronic distributions in jets in the vacuum and in the medium Redamy Perez Ramos

NAFIPS2011 March 18-20 S 20 a c 8 20 El Paso, Texas Organizers: Vladik Kreinovich and

Physics Is Still Your Friend Drew Davidson A Little Bit About Me World of Goo (?)

By By Shaimaa Moustafa Mahdye Eldieghdye lecturer at Biological Application Dep. Atomic Energy

Lecture 5: Message Passing & Other Communication Mechanisms (SR & Java) Intro:

Wicked Problems & Clumsy Solutions: The Role of Leadership Keith Grint What work problem is