A Parallel Numerical Solver Using Hierarchically Tiled Using - PowerPoint PPT Presentation

A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, María J. g g Garzarán, and David Padua

Outline • Motivation • Hierarchically Tiled Arrays (HTAs) • Hierarchically Tiled Arrays (HTAs) • SPIKE • HTA SPIKE • HTA-SPIKE • Conclusions 2

Motivation • Still much room for advancement in parallel programming • Want to facilitate parallel programming but still provide Want to facilitate parallel programming but still provide control over performance – Raise the level of abstraction • Higher level data parallel operators and their associated aggregate data Hi h l l d t ll l t d th i i t d t d t types • Write programs as a sequence of operators – Can easily reason about the resulting program Can easily reason about the resulting program – Portable across any platform that implements the desired operators – Control performance through: • Data Parallelism D t P ll li • Hierarchical Tiling – Who should tile? 3

Tiling and Compilers (Matrix Multiplication) Intel MKL Clearly, the Compiler LOPS could do better at tiling could do better at tiling 20x 20x icc ‐ O3 ‐ xT MF icc ‐ O3 Matrix Size 4

Hierarchically Tiled Array • The Hierarchically Tiled Array, or HTA, is a data type based on higher level data parallel operators – Fortran 90 array operators Fortran 90 array operators – Hierarchical Tiling • Makes Tiles first class objects – Recognizes importance of tiling to control: R i i f ili l • Locality • Data Distribution – Referenced explicitly, not implicitly generated by the compiler Referenced explicitly not implicitly generated by the compiler – Operators extended to function on tiles – C++ Library Implementations: – Distributed-Memory using MPI Di t ib t d M i MPI – Shared-Memory using TBB 5

HTAs, Illustrated Cluster Memory Hierarchy Cluster L2 Node Node Multicore L1 Cache Register 6

HTA Indexing A Flattened Flattened A[ A[0,3] ] access A(0,0) A(0,1)[0,1] Hybrid ( , )[ , ] y Tile access Tile access access A(1, 0:1) Simplified syntax 7

HTA Assignment lhs rhs lhs = rhs; lhs(:,:) = rhs(:,:); lh (0 lhs(0,:) = rhs(1,:); ) h (1 ) lhs(0,:)[1,:] = rhs(1,:)[0,:]; 8

HTA Assignment lhs rhs lhs = rhs; lhs(:,:) = rhs(:,:); lh (0 lhs(0,:) = rhs(1,:); ) h (1 ) lhs(0,:)[1,:] = rhs(1,:)[0,:]; Assignment Communication 9

Higher Level HTA Operations • Element ‐ by ‐ element operations – +, ‐ ,*,/,… , , ,/,… • Reductions • Transpositions Transpositions – Tile – Data • MapReduce • Matrix Multiplication p • FFT • etc. etc. 10

User-Defined Operations • Programmers can create new complex parallel operators through the primitive hmap p g p p – Applies user defined operators to each tile of the HTA • And corresponding tiles if multiple HTAs are involved – Operator applied in parallel across tiles – Assumes operation on tiles is independent F() F() HTA a,b; b = a.map( F() ); F() F() F() F() b a 11

SPIKE • Parallel solver for banded linear systems of equations • SPIKE family of algorithms originally by Prof. Ahmed Sameh Sameh • http://software.intel.com/en-us/articles/intel-adaptive-spike- based-solver/ based solver/ • SPIKE’s blocks map naturally to tiles in HTA SPIKE s blocks map naturally to tiles in HTA 12

SPIKE Algorithm • Name derived from the matrix formed by multiplying the original block tri-diagonal system by the inverse of the original block tri diagonal system by the inverse of the blocks of A 13

SPIKE Algorithm 2 • Most SPIKE algorithms proceed as follows: 1. 1. Compute the Spike matrix, S, in parallel Compute the Spike matrix, S, in parallel • Compute LU factorization and solve instead of directly computing inverse • • LAPACK routines DGBTRF/DGBTRS LAPACK routines DGBTRF/DGBTRS 2. Form a reduced system of much smaller size 3. Solve the reduced system directly y y 4. Retrieve the global solution in parallel 14

SPIKE Reduced System • Extracts smaller, independent subsystem from the original system system 15

SPIKE TU and TA • SPIKE has several algorithmic variants • Recent focus has been on the variants referred to as • Recent focus has been on the variants referred to as “Truncated” • Faster than the full SPIKE algorithm Faster than the full SPIKE algorithm – Reduces computation and has greater parallelism – Only applicable to diagonally dominant systems y pp g y y • Can wrap with an iterative method for non-diagonally dominant systems 16

Truncated Savings 1 17

Truncated Savings 2 18

TU and TA The truncated variants come in several flavors the the two that we implemented are TU and TA – T – truncated – U – Uses LU and UL factorizations – A – Uses alternating LU and UL factorizations The difference is in how the work is distributed to the processors not in what work is done. 19

TA Work Distribution 20

TU Work Distribution 21

HTA SPIKE • Blocks of SPIKE map very well to HTA tiles • Factorization and Solving of tiles parallelized with hmap • Factorization and Solving of tiles parallelized with hmap • Reduced System formed using array assignments HTA Representation Original Matrix System BC = × = LUA= ULA= r = 22

TU Implementation ... LUA.hmap(factorize_lua()); // factorize blocks of A ULA.hmap(factorize_ula()); // calculate the spike tips W(t) and V(b) from Bs and Cs BC.hmap(solve_bc(),LUA,ULA); // update right hand side g.hmap(solve_lua(),LUA); // form the reduced system REDUCED()[0:m-1,m:2*m-1] = BC(0:blks-2)[0:m-1,0:m-1]; () ( ) REDUCED()[m:2*m-1,0:m-1] = BC(1:blks-1)[0:m-1,m:2*m-1]; // form the reduced system RHS greduced()[0:m-1] = g(0:blks-2)[blocksize-m:blocksize-1]; greduced()[m:2*m-1] = g(1:blks-1)[0:m-1]; // factorize the reduced system // factorize the reduced system REDUCED.hmap(factorize()); REDUCED h (f t i ()) // solve the reduced system greduced.hmap(solve(),REDUCED); // Update RHS as r = r - Bz - Cz fv = r(0:num_blocks-2); fr_half = greduced()[0:m-1]; B map(dgemv() fv fr half); B.map(dgemv(),fv,fr_half); r(0:num_blocks-2) = fv; fw = r(1:num_blocks-1); fr_half = greduced()[m:2*m-1]; C.map(dgemv(),fw, fr_half); r(1:num_blocks-1) = fw; // S l // Solve the updated system th d t d t r.hmap(solve_lua(),LUA); ... 23

Portability • The HTA programming model expresses programs as a sequence of higher level data parallel operators sequence of higher level data parallel operators – Portable as long as the appropriate implementations of the operators exist for the target platform – For SPIKE, array assignments, tile indexing, and hmap. • The code on the last slide runs on both the MPI and TBB i implementations of the HTA library l t ti f th HTA lib – Simply change which header to include. 24

SPIKE TU Performance 10 9 8 ntial MKL 7 6 eedup vs Sequen 5 4 Sp 3 2 1 0 0 5 10 15 20 25 30 35 Number of CPUs HTA SHM HTA MPI SpikePACK 32 Core Intel Xeon 7555 (Nehalem) 25 @ 1.86 GHz

Cluster TU Performance 16 MKL 14 equential 12 10 dup vs Se 8 8 6 4 Speed 2 0 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 Number of CPUs HTA MPI SpikePACK Intel Xeon X5550 (Nehalem) @ 2.66 26 GHz

SPIKE TA Performance 10 MKL 9 8 8 equential 7 6 dup vs Se 5 5 4 3 2 2 Spee 1 0 0 10 20 30 40 Number of CPUs HTA SHM HTA MPI SpikePACK 32 Core Intel Xeon 7555 (Nehalem) 27 @ 1.86 GHz

Cluster TA Performance 10 MKL 9 8 8 equential 7 6 dup vs Se 5 5 4 3 2 2 Spee 1 0 0 10 20 30 40 Number of CPUs HTA MPI SpikePACK Intel Xeon X5550 (Nehalem) @ 2.66 28 GHz

Conclusions • The SPIKE algorithms are a good fit for the HTA programming model – First class tiles – Data parallel operators – Portable • Clean compact code • Clean, compact code – Our implementation takes ~50 lines of code – SpikePACK’s Fortran+MPI? …..a lot more. – A challenge – one of the most complex algorithms we’ve looked at using HTAs • Good performance – Original goal was only to match performance of Fortran+MPI Original goal was only to match performance of Fortran+MPI 29

Future Work • Sparse Tiling – Allow programmers to express the structure of the original system Allow programmers to express the structure of the original system • Support for Heterogeneous Architectures – Clusters of Multicores, GPUs C uste s o u t co es, G Us • Look at other complex algorithms • Linear algebra libraries and data layout Linear algebra libraries and data layout 30

Questions? 31

Multithreaded MKL Performance 8 8 7 7 6 6 # of Processors 5 5 MKL 4 4 3 3 2 2 1 1 2.5 2 2 1.5 1 0.5 0 5 0 edup Spee 32

Why is Raising the Level of Abstraction Good? 1) Allows programmers to focus on the algorithm rather than on the implementation p 2) Data parallel programs resemble ) g conventional, sequential programs – Parallelism is encapsulated – Parallelism is structured P ll li i t t d 3) Compact and readable 3) Compact and readable 33

A Parallel Numerical Solver Using Hierarchically Tiled Using - PowerPoint PPT Presentation

A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, Mara J. g g Garzarn, and David Padua Outline Motivation Hierarchically Tiled

On partitioning and reordering problems in a hierarchically parallel hybrid linear solver

Module 4.3 - Memory Model and Locality Tiled Matrix Multiplication Objective To understand

Mixing Tile Resolutions in Tiled Video: A Perceptual Quality Assessment Hui Wang , Vu-Thanh

Module 4.5 - Memory and Data Locality Handling Arbitrary Matrix Sizes in Tiled Algorithms

A massivelly parallel multigrid solver using PETSc for unstructured meshes on Tier0

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

OpenSMT2 A Parallel, Interpolating SMT Solver Antti Hyvrinen, Matteo Marescotti, Leonardo Alt,

Time-Parallel optimal control solver for parabolic equations Mohamed Kamal RIAHI, joint work

My Parallel Electromagnetic Solver A Star-Maxwell-P FDTD Propagator 18.337 Parallel Computing

mADM1 A MULTIDIMENSIONAL PARALLEL NUMERICAL SOLVER OF ADM1 MODEL FOR ANAEROBIC DIGESTERS IN

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.1 Parallel Algorithm

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.3 Parallel

Development of Tiled Gamma-ray Detector Circuit using Photodetector Array Kyeyoung Cho a ,

Adjoint Solver Workshop Why is an Adjoint Solver useful? Design and manufacture for better

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Why Tagus Avenue? HIST ORY CR 54 1. Higher volume gravel road 2. Required frequent blading

3D OpenSource software Stack FOSS4G 2015 Seoul Oslandia Team Licence GNU FDL SARL / 16.

+ 3D Elevation Program (3DEP) Claire DeVaughan USGS National Geospatial Program National Map

Fu Full ll Yea ear r Res esults ults 2016 2016 Slumber wood effect tile 1 Agenda

Hallfield Estate W104 Service Providers Proposal (SPP) Meeting Thursday 5 September 2019

www. velpaprojects .com Finishing your property the VELPA way Time plan Week 1 - 4 Week 5 - 8

Braidwood Community Meeting Add photo here (centred) Agenda Welcome from the Mayor