Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - PowerPoint PPT Presentation

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on “T2K Open Supercomputer” Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), September 22, 2009, Vienna to be held in conjunction with ICPP-09: The 38th International Conference on Parallel Processing

P2S2 2 Topics of this Study • Preconditioned Iterative Sparse Matrix Solvers for FEM Applications • T2K Open Supercomputer (Tokyo) (T2K/Tokyo) • Hybrid vs. Flat MPI Parallel Programming Models • Optimization of Hybrid Parallel Programming Models – NUMA Control – First Touch – Further Reordering of Data

P2S2 3 TOC • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

P2S2 4 T2K/Tokyo (1/2) • “T2K Open Supercomputer Alliance” – http://www.open-supercomputer.org/ – Tsukuba, Tokyo, Kyoto • “T2K Open Supercomputer (Todai Combined Cluster)” – by Hitachi – op. started June 2008 – Total 952 nodes (15,232 cores), 141 TFLOPS peak • Quad-core Opteron (Barcelona) – 27th in TOP500 (NOV 2008) (fastest in Japan at that time)

P2S2 5 T2K/Tokyo (2/2) • AMD Quad-core Opteron Memory Memory (Barcelona) 2.3GHz L3 L3 L3 L3 • 4 “sockets” per node L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 – 16 cores/node Core Core Core Core Core Core Core Core • Multi-core ， multi-socket system Core Core Core Core Core Core Core Core • cc-NUMA architecture L1 L1 L1 L1 L1 L1 L1 L1 – careful configuration needed L2 L2 L2 L2 L2 L2 L2 L2 L3 L3 L3 L3 • local data ~ local memory Memory Memory – To reduce memory traffic in the system, it is important to keep the data close to the cores that will work with the data (e.g. NUMA control).

P2S2 6 Flat MPI vs. Hybrid Flat-MPI ： Each PE -> Independent core core core memory memory memory core core core core core core core core core Hybrid ： Hierarchal Structure core core core core core core memory memory memory memory memory memory core core core core core core core core core core core core core core core core core core

P2S2 7 Flat MPI vs. Hybrid • Performance is determined by various parameters • Hardware – core architecture itself – peak performance – memory bandwidth, latency – network bandwidth, latency – their balance • Software – types: memory or network/communication bound – problem size

P2S2 8 Sparse Matrix Solvers by FEM, FDM … for (i=0; i<N; i++) { for (k=Index(i-1); k<Index(i); k++{ • Memory-Bound Y[i]= Y[i] + A [k]*X[Item[k]]; } – indirect accesses } – Hybrid (OpenMP) is more memory-bound • Latency-Bound for Parallel Computations – comm.’s occurs only at domain boundaries – small amount of messages • Exa-scale Systems – O(10 8 ) cores – Communication Overhead by MPI Latency for > 10 8 -way MPI’s – Expectations for Hybrid • 1/16 MPI processes for T2K/Tokyo

P2S2 9 Weak Scaling Results on ES GeoFEM Benchmarks [KN 2003] • Generally speaking, hybrid is better for large number of nodes • especially for small problem size per node – “less” memory bound 4.00 Large Flat MPI: Large ● Flat MPI Flat MPI: Small 3.00 Hybrid: Large ● Hybrid Hybrid: Small TFLOPS 2.00 Small ▲ Flat MPI 1.00 ▲ Hybrid 0.00 0 256 512 768 1024 1280 PE#

P2S2 10 • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

P2S2 11 Target Application • 3D Elastic Problems with Heterogeneous Material Property – E max =10 3 , E min =10 -3 , ν =0.25 • generated by “sequential Gauss” algorithm for geo-statistics [Deutsch & Journel, 1998] – 128 3 tri-linear hexahedral elements, 6,291,456 DOF • Strong Scaling • (SGS+CG) Iterative Solvers – Symmetric Gauss-Seidel – HID-based domain decomposition • T2K/Tokyo – 512 cores (32 nodes) • FORTARN90 (Hitachi) + MPI – Flat MPI, Hybrid (4x4, 8x2, 16x1)

P2S2 12 HID: Hierarchical Interface Decomposition [Henon & Saad 2007] • Multilevel Domain Decomposition – Extension of Nested Dissection • Non-overlapped Approach: Connectors, Separators • Suitable for Parallel Preconditioning Method 0 2 2 2 3 3 3 2,3 1 2 2 2 3 3 3 2,3 Level-1 2 0,1 2 2 2 3 3 3 2,3 3 0,1 0,1 0,2 0,2 0,2 1,3 1,3 1,3 2,3 0,2 0,1 0 0 0 1 1 1 2,3 Level-2 2,3 level-1 ：● 0 0 0 1 1 1 0,1 1,3 level-2 ：● 0,1, Level-4 0 0 0 1 1 1 0,1 level-4 ：● 2,3

P2S2 13 Parallel Preconditioned Iterative Solvers on an SMP/Multicore node by OpenMP • DAXPY, SMVP, Dot Products – Easy • Factorization, Forward/Backward Substitutions in Preconditioning Processes – Global dependency – Reordering for parallelism required: forming independent sets – Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM) – Works on “Earth Simulator” [KN 2002,2003] • both for parallel/vector performance • CM-RCM (Cyclic Multi Coloring + RCM) – robust and efficient – elements on each color are independent

P2S2 14 Ordering Methods 45 29 61 22 16 46 11 62 47 7 63 4 48 2 64 1 29 22 16 11 7 4 2 1 53 36 2 33 1 19 49 17 13 49 37 29 30 14 23 51 17 30 15 53 12 31 8 16 55 5 32 3 37 30 23 17 12 8 5 3 7 54 37 20 3 50 34 18 44 41 57 38 42 31 24 58 18 43 13 59 44 9 60 6 44 38 31 24 18 13 9 6 25 8 55 38 21 4 51 35 33 50 9 25 45 35 10 39 26 32 25 37 11 19 27 39 14 12 10 28 50 45 39 32 25 19 14 10 26 43 9 56 39 22 5 52 55 51 46 40 33 26 20 15 55 37 51 53 46 38 54 40 33 39 26 55 40 20 15 56 61 44 27 10 57 40 23 6 59 56 52 47 41 34 27 21 17 59 5 56 21 52 19 6 47 22 41 21 7 23 34 23 27 8 24 21 14 62 45 28 11 58 41 24 31 15 63 46 29 12 59 42 33 62 60 49 34 57 50 53 35 48 42 51 36 35 28 52 62 60 57 53 48 42 35 28 32 64 63 61 58 54 49 43 36 48 16 64 47 30 13 60 64 1 1 63 17 61 3 2 18 58 54 3 5 49 19 43 4 7 36 20 MC (Color#=4) RCM CM-RCM (Color#=4) Multicoloring Reverse Cuthill-Mckee Cyclic MC + RCM

P2S2 15 Effect of Ordering Methods on Convergence 90 ▲ MC ● CM-RCM 80 Iterations 70 60 50 1 10 100 1000 color #

P2S2 16 Re-Ordering by CM-RCM 5 colors, 8 threads Initial Vector Coloring color=1 color=2 color=3 color=4 color=5 (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Elements in each color are independent, therefore parallel processing is possible. => divided into OpenMP threads (8 threads in this case) Because all arrays are numbered according to “color”, discontinuous memory access may happen on each thread.

P2S2 17 • Background – Why Hybrid ? • Target Application – Overview – HID – Reordering • Preliminary Results • Remarks

18 Flat MPI, Hybrid (4x4, 8x2, 16x1) 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 Flat MPI Hybrid Hybrid Hybrid 16x1 4x4 8x2 P2S2

P2S2 19 CASES for Evaluation • Focused on optimization of HB8x2, HB16x1 • CASE-1 – initial case (CM-RCM) – for evaluation of NUMA control effect • specifies local core-memory configulation • CASE-2 (Hybrid only) – First-Touch • CASE-3 (Hybrid only) – Further Data Reordering + First-Touch • NUMA policy (0-5) for each case

P2S2 20 Results of CASE-1, 32 nodes/512cores computation time for linear solvers Normalized by Flat MPI (Policy 0) Policy 1.50 Command line switches ID policy 0 Relative Performance best (policy 2) 0 no command line switches 1.00 --cpunodebind=$SOCKET 1 --interleave=all --cpunodebind=$SOCKET 2 0.50 --interleave=$SOCKET --cpunodebind=$SOCKET 3 --membind=$SOCKET 0.00 --cpunodebind=$SOCKET Flat MPI HB 4x4 HB 8x2 HB 16x1 4 --localalloc Parallel Programming Models Best Policy 5 --localalloc Method Iterations CASE-1 Flat MPI 1264 2 HB 4x4 1261 2 HB 8x2 1216 2 e.g. mpirun – np 64 – cpunodebind 0,1,2,3 a.out HB 16x1 1244 2

P2S2 21 First Touch Data Placement ref. “Patterns for Parallel Programming” Mattson, T.G. et al. To reduce memory traffic in the system, it is important to keep the data close to the PEs that will work with the data (e.g. NUMA control). On NUMA computers, this corresponds to making sure the pages of memory are allocated and “owned” by the PEs that will be working with the data contained in the page. The most common NUMA page-placement algorithm is the “first touch” algorithm, in which the PE first referencing a region of memory will have the page holding that memory assigned to it. A very common technique in OpenMP program is to initialize data in parallel using the same loop schedule as will be used later in the computations.

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - PowerPoint PPT Presentation

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on T2K Open Supercomputer Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Hybrid systems and computer science a short tutorial Eugene Asarin Universit e Paris 7 -

Assume-Guarantee verification of Hybrid Systems in A RIADNE Davide Bresolin and Tiziano Villa

Hybrid Systems Modeling, Analysis and Control Radu Grosu Vienna University of Technology Aims of

Mono-chromatic beams for PRISM Mark Hartz, Kavli IPMU/TRIUMF 1 Motivation We know that there

Controller Synthesis for Linear Hybrid Systems SynCoP and PV April 14th, 2018 Marco Faella

Human in the Loop Network Control Systems Mike McCourt University of Washington Tacoma Prio

Hybrid Computer Architecture Brian Van Essen Benjamin Ylvisaker Carl Ebeling Moores Law: Is

Education presented by Dr. Krista Lynn Minnotte co-facilitated by Dr. Anne Kelsch & Carrie

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - PowerPoint PPT Presentation

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on T2K Open Supercomputer Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI Internals Advanced Parallel Programming Overview MPI Library Structure Point-to-point

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

In Introduction to MPI Shaohao Chen Research Computing Services Information Services and

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Hybrid systems and computer science a short tutorial Eugene Asarin Universit e Paris 7 -

Assume-Guarantee verification of Hybrid Systems in A RIADNE Davide Bresolin and Tiziano Villa

Hybrid Systems Modeling, Analysis and Control Radu Grosu Vienna University of Technology Aims of

Mono-chromatic beams for PRISM Mark Hartz, Kavli IPMU/TRIUMF 1 Motivation We know that there

Controller Synthesis for Linear Hybrid Systems SynCoP and PV April 14th, 2018 Marco Faella

Human in the Loop Network Control Systems Mike McCourt University of Washington Tacoma Prio

Hybrid Computer Architecture Brian Van Essen Benjamin Ylvisaker Carl Ebeling Moores Law: Is

Education presented by Dr. Krista Lynn Minnotte co-facilitated by Dr. Anne Kelsch &amp; Carrie

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Education presented by Dr. Krista Lynn Minnotte co-facilitated by Dr. Anne Kelsch & Carrie