Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - - PowerPoint PPT Presentation

flat mpi vs hybrid evaluation of parallel programming
SMART_READER_LITE
LIVE PREVIEW

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for - - PowerPoint PPT Presentation

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on T2K Open Supercomputer Kengo NAKAJIMA Information Technology Center The University of Tokyo Second International Workshop on Parallel


slide-1
SLIDE 1

Flat MPI vs. Hybrid: Evaluation of Parallel Programming Models for Preconditioned Iterative Solvers on “T2K Open Supercomputer”

Kengo NAKAJIMA

Information Technology Center The University of Tokyo

Second International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), September 22, 2009, Vienna

to be held in conjunction with ICPP-09: The 38th International Conference on Parallel Processing

slide-2
SLIDE 2

P2S2 2

  • Preconditioned Iterative Sparse Matrix Solvers for FEM

Applications

  • T2K Open Supercomputer (Tokyo) (T2K/Tokyo)
  • Hybrid vs. Flat MPI Parallel Programming Models
  • Optimization of Hybrid Parallel Programming Models

– NUMA Control – First Touch – Further Reordering of Data

Topics of this Study

slide-3
SLIDE 3

P2S2 3

  • Background

– Why Hybrid ?

  • Target Application

– Overview – HID – Reordering

  • Preliminary Results
  • Remarks

TOC

slide-4
SLIDE 4

P2S2 4

T2K/Tokyo (1/2)

  • “T2K Open Supercomputer Alliance”

– http://www.open-supercomputer.org/ – Tsukuba, Tokyo, Kyoto

  • “T2K Open Supercomputer

(Todai Combined Cluster)”

– by Hitachi – op. started June 2008 – Total 952 nodes (15,232 cores), 141 TFLOPS peak

  • Quad-core Opteron (Barcelona)

– 27th in TOP500 (NOV 2008) (fastest in Japan at that time)

slide-5
SLIDE 5

P2S2 5

T2K/Tokyo (2/2)

  • AMD Quad-core Opteron

(Barcelona) 2.3GHz

  • 4 “sockets” per node

– 16 cores/node

  • Multi-core,multi-socket

system

  • cc-NUMA architecture

– careful configuration needed

  • local data ~ local memory

– To reduce memory traffic in the system, it is important to keep the data close to the cores that will work with the data (e.g. NUMA control).

Core L1 Core L1 Core L1 Core L1 L2 L2 L2 L2 L3 Memory Core Core Core Core L1 L1 L1 L1 L2 L2 L2 L2 L3 Memory Core L1 Core L1 Core L1 Core L1 L2 L2 L2 L2 L3 Memory Core Core Core Core L1 L1 L1 L1 L2 L2 L2 L2 L3 Memory L3 L3 L3 L3

slide-6
SLIDE 6

P2S2 6

Flat MPI vs. Hybrid

Hybrid:Hierarchal Structure Flat-MPI:Each PE -> Independent

core core core core memory core core core core memory core core core core memory core core core core memory core core core core memory core core core core memory memory memory memory core core core core core core core core core core core core

slide-7
SLIDE 7

P2S2 7

Flat MPI vs. Hybrid

  • Performance is determined by various parameters
  • Hardware

– core architecture itself – peak performance – memory bandwidth, latency – network bandwidth, latency – their balance

  • Software

– types: memory or network/communication bound – problem size

slide-8
SLIDE 8

P2S2 8

Sparse Matrix Solvers by FEM, FDM …

  • Memory-Bound

– indirect accesses – Hybrid (OpenMP) is more memory-bound

  • Latency-Bound for Parallel Computations

– comm.’s occurs only at domain boundaries – small amount of messages

  • Exa-scale Systems

– O(108) cores – Communication Overhead by MPI Latency for > 108-way MPI’s – Expectations for Hybrid

  • 1/16 MPI processes for T2K/Tokyo

for (i=0; i<N; i++) {

for (k=Index(i-1); k<Index(i); k++{ Y[i]= Y[i] + A [k]*X[Item[k]]; } }

slide-9
SLIDE 9

P2S2 9

Weak Scaling Results on ES

GeoFEM Benchmarks [KN 2003]

0.00 1.00 2.00 3.00 4.00 256 512 768 1024 1280

PE# TFLOPS Flat MPI: Large Flat MPI: Small Hybrid: Large Hybrid: Small

  • Generally speaking, hybrid is better for large

number of nodes

  • especially for small problem size per node

– “less” memory bound Large

  • Flat MPI
  • Hybrid

Small ▲ Flat MPI ▲ Hybrid

slide-10
SLIDE 10

P2S2 10

  • Background

– Why Hybrid ?

  • Target Application

– Overview – HID – Reordering

  • Preliminary Results
  • Remarks
slide-11
SLIDE 11

P2S2 11

  • 3D Elastic Problems with Heterogeneous Material Property

– Emax=103, Emin=10-3, ν=0.25

  • generated by “sequential Gauss” algorithm for geo-statistics [Deutsch &

Journel, 1998]

– 1283 tri-linear hexahedral elements, 6,291,456 DOF

  • Strong Scaling
  • (SGS+CG) Iterative Solvers

– Symmetric Gauss-Seidel – HID-based domain decomposition

  • T2K/Tokyo

– 512 cores (32 nodes)

  • FORTARN90 (Hitachi) + MPI

– Flat MPI, Hybrid (4x4, 8x2, 16x1)

Target Application

slide-12
SLIDE 12

P2S2 12

  • Multilevel Domain Decomposition

– Extension of Nested Dissection

  • Non-overlapped Approach: Connectors, Separators
  • Suitable for Parallel Preconditioning Method

HID: Hierarchical Interface Decomposition [Henon & Saad 2007]

level-1:● level-2:● level-4:●

1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3 2 2 2

2,3

3 3 3 2 2 2

2,3

3 3 3

0,1

1 1 1

0,1

1 1 1

0,1 2,3 0,1 2,3 0,1 2,3

1 2 3

0,1 0,2 2,3 1,3

0,1, 2,3

Level-1 Level-2 Level-4

slide-13
SLIDE 13

P2S2 13

  • DAXPY, SMVP, Dot Products

– Easy

  • Factorization, Forward/Backward Substitutions in

Preconditioning Processes

– Global dependency – Reordering for parallelism required: forming independent sets – Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM) – Works on “Earth Simulator” [KN 2002,2003]

  • both for parallel/vector performance
  • CM-RCM (Cyclic Multi Coloring + RCM)

– robust and efficient – elements on each color are independent

Parallel Preconditioned Iterative Solvers

  • n an SMP/Multicore node by OpenMP
slide-14
SLIDE 14

P2S2 14

Ordering Methods

64 63 61 58 54 49 43 36 62 60 57 53 48 42 35 28 59 56 52 47 41 34 27 21 55 51 46 40 33 26 20 15 50 45 39 32 25 19 14 10 44 38 31 24 18 13 9 6 37 30 23 17 12 8 5 3 29 22 16 11 7 4 2 1 48 32 31 15 14 62 61 44 43 26 25 8 7 54 53 36 16 64 63 46 45 28 27 10 9 56 55 38 37 20 19 2 47 30 29 12 11 58 57 40 39 22 21 4 3 50 49 33 13 60 59 42 41 24 23 6 5 52 51 35 34 18 17 1 64 63 61 58 54 49 43 36 62 60 57 53 48 42 35 28 59 56 52 47 41 34 27 21 55 51 46 40 33 26 20 15 50 45 39 32 25 19 14 10 44 38 31 24 18 13 9 6 37 30 23 17 12 8 5 3 29 22 16 11 7 4 2 1 1 17 3 18 5 19 7 20 33 49 34 50 35 51 36 52 17 21 19 22 21 23 23 24 37 53 38 54 39 55 40 56 33 25 35 26 37 27 39 28 41 57 42 58 43 59 44 60 49 29 51 30 53 31 55 32 45 61 46 62 47 63 48 64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

RCM Reverse Cuthill-Mckee MC (Color#=4) Multicoloring CM-RCM (Color#=4) Cyclic MC + RCM

slide-15
SLIDE 15

P2S2 15

Effect of Ordering Methods on Convergence

50 60 70 80 90 1 10 100 1000

color # Iterations

▲ MC

  • CM-RCM
slide-16
SLIDE 16

P2S2 16

Re-Ordering by CM-RCM 5 colors, 8 threads

Initial Vector color=1 color=2 color=3 color=4 color=5 Coloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Elements in each color are independent, therefore parallel processing is possible. => divided into OpenMP threads (8 threads in this case) Because all arrays are numbered according to “color”, discontinuous memory access may happen on each thread.

slide-17
SLIDE 17

P2S2 17

  • Background

– Why Hybrid ?

  • Target Application

– Overview – HID – Reordering

  • Preliminary Results
  • Remarks
slide-18
SLIDE 18

P2S2 18

1 2 3

Flat MPI, Hybrid (4x4, 8x2, 16x1)

Flat MPI Hybrid 4x4 Hybrid 8x2

1 2 3 1 2 3

Hybrid 16x1

1 2 3

slide-19
SLIDE 19

P2S2 19

  • Focused on optimization of HB8x2, HB16x1
  • CASE-1

– initial case (CM-RCM) – for evaluation of NUMA control effect

  • specifies local core-memory configulation
  • CASE-2 (Hybrid only)

– First-Touch

  • CASE-3 (Hybrid only)

– Further Data Reordering + First-Touch

  • NUMA policy (0-5) for each case

CASES for Evaluation

slide-20
SLIDE 20

P2S2 20

Results of CASE-1, 32 nodes/512cores

computation time for linear solvers

  • -localalloc

5

  • -cpunodebind=$SOCKET
  • -localalloc

4

  • -cpunodebind=$SOCKET
  • -membind=$SOCKET

3

  • -cpunodebind=$SOCKET
  • -interleave=$SOCKET

2

  • -cpunodebind=$SOCKET
  • -interleave=all

1 no command line switches Command line switches Policy ID

2 1244 HB 16x1 2 1216 HB 8x2 2 1261 HB 4x4 2 1264 Flat MPI Best Policy CASE-1 Iterations Method

0.00 0.50 1.00 1.50 Flat MPI HB 4x4 HB 8x2 HB 16x1

Parallel Programming Models Relative Performance policy 0 best (policy 2)

Normalized by Flat MPI (Policy 0)

e.g. mpirun –np 64 –cpunodebind 0,1,2,3 a.out

slide-21
SLIDE 21

P2S2 21

First Touch Data Placement

  • ref. “Patterns for Parallel Programming” Mattson, T.G. et al.

To reduce memory traffic in the system, it is important to keep the data close to the PEs that will work with the data (e.g. NUMA control). On NUMA computers, this corresponds to making sure the pages of memory are allocated and “owned” by the PEs that will be working with the data contained in the page. The most common NUMA page-placement algorithm is the “first touch” algorithm, in which the PE first referencing a region of memory will have the page holding that memory assigned to it. A very common technique in OpenMP program is to initialize data in parallel using the same loop schedule as will be used later in the computations.

slide-22
SLIDE 22

P2S2 22

Further Re-Ordering for Continuous Memory Access 5 colors, 8 threads

color=1 color=2 color=3 color=4 color=5 Coloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 1 1 1 1 1 Initial Vector

slide-23
SLIDE 23

P2S2 23

0.00 0.50 1.00 1.50 Flat MPI HB 4x4 HB 8x2 HB 16x1

Parallel Programming Models Relative Performance Initial CASE-1 CASE-2 CASE-3

Improvement: CASE-1 ⇒ CASE-3

Normalized by the Best Performance of Flat MPI

32nodes, 512cores 196,608 DOF/node

CASE-1: NUMA control CASE-2: + F.T. CASE-3: + Further Reordering

slide-24
SLIDE 24

P2S2 24

0.00 0.50 1.00 1.50 Flat MPI HB 4x4 HB 8x2 HB 16x1

Parallel Programming Models Relative Performance Initial CASE-1 CASE-2 CASE-3

0.00 0.50 1.00 1.50 Flat MPI HB 4x4 HB 8x2 HB 16x1

Parallel Programming Models Relative Performance Initial CASE-1 CASE-2 CASE-3

Improvement: CASE-1 ⇒ CASE-3

Normalized by the Best Performance of Flat MPI

32nodes, 512cores 196,608 DOF/node 8nodes, 128cores 786,432 DOF/node

slide-25
SLIDE 25

P2S2 25

Strong Scalability (Best Cases)

32~512 cores Performance of Flat MPI with 32 cores= 32.0

100 200 300 400 500 128 256 384 512

core# Speed-Up ideal Flat MPI HB 4x4 HB 8x2 HB 16x1

slide-26
SLIDE 26

P2S2 26

0.00 0.25 0.50 0.75 1.00 1.25 32 64 128 192 256 384 512

core# Relative Performance HB 4x4 HB 8x2 HB 16x1

Relative Performance for Strong Scaling (Best Cases)

32~512 cores Normalized by BEST Flat MPI at each core#

slide-27
SLIDE 27

P2S2 27

  • Background

– Why Hybrid ?

  • Target Application

– Overview – HID – Reordering

  • Preliminary Results
  • Remarks
slide-28
SLIDE 28

P2S2 28

  • HID for Ill-Conditioned Problems on T2K/Tokyo

– Hybrid/Flat MPI, CM-RCM reordering

  • Hybid 4x4 and Flat MPI are competitive
  • Data locality and continuous memory access by (further

re-ordering + F.T.) provide significant improvement on Hybrid 8x2/16x1.

  • Performance of Hybrid is improved when,

– many cores, smaller problem size/core (strong scaling)

  • Future Works

– Higher-order of Fill-ins: BILU(p) – Extension to Multigrid-type Solvers/Preconditioning – Considering Page-Size for Optimizaion – Sophisticated Models for Performance Prediction/Evaluation

Summary & Future Works

slide-29
SLIDE 29

P2S2 29

  • Improvement of Flat MPI

– Current “Flat MPI” is not really flat – Socket, Node, Node-to- Node

  • Extension to GPGPU

Summary & Future Works (cont.)

Core L1 Core L1 Core L1 Core L1 L2 L2 L2 L2 L3 Memory Core Core Core Core L1 L1 L1 L1 L2 L2 L2 L2 L3 Memory Core L1 Core L1 Core L1 Core L1 L2 L2 L2 L2 L3 Memory Core Core Core Core L1 L1 L1 L1 L2 L2 L2 L2 L3 Memory L3 L3 L3 L3

slide-30
SLIDE 30

P2S2 30

  • Coalesced Access

(better one)

GPGPU Community

color=1 color=2 color=3 color=4 color=5 Coloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 Coloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 1 1 1 1 1 1 1 1 1 1 Initial Vector Initial Vector color=1 color=2 color=3 color=4 color=5 Coloring (5 colors) +Ordering color=1 color=2 color=3 color=4 color=5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

  • Sequential Access