A Parallel Numerical Solver Using Hierarchically Tiled Using - - PowerPoint PPT Presentation
A Parallel Numerical Solver Using Hierarchically Tiled Using - - PowerPoint PPT Presentation
A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, Mara J. g g Garzarn, and David Padua Outline Motivation Hierarchically Tiled
Outline
- Motivation
- Hierarchically Tiled Arrays (HTAs)
- Hierarchically Tiled Arrays (HTAs)
- SPIKE
- HTA SPIKE
- HTA-SPIKE
- Conclusions
2
Motivation
- Still much room for advancement in parallel programming
- Want to facilitate parallel programming but still provide
Want to facilitate parallel programming but still provide control over performance
– Raise the level of abstraction
Hi h l l d t ll l t d th i i t d t d t
- Higher level data parallel operators and their associated aggregate data
types
- Write programs as a sequence of operators
Can easily reason about the resulting program – Can easily reason about the resulting program – Portable across any platform that implements the desired operators
– Control performance through:
D t P ll li
- Data Parallelism
- Hierarchical Tiling
– Who should tile?
3
Tiling and Compilers (Matrix Multiplication)
Intel MKL 20x LOPS
Clearly, the Compiler could do better at tiling
20x MF icc ‐O3 ‐xT
could do better at tiling
icc ‐O3
4
Matrix Size
Hierarchically Tiled Array
- The Hierarchically Tiled Array, or HTA, is a data type based on higher level data
parallel operators
Fortran 90 array operators – Fortran 90 array operators – Hierarchical Tiling
- Makes Tiles first class objects
R i i f ili l – Recognizes importance of tiling to control:
- Locality
- Data Distribution
Referenced explicitly not implicitly generated by the compiler – Referenced explicitly, not implicitly generated by the compiler – Operators extended to function on tiles
– C++ Library Implementations:
Di t ib t d M i MPI – Distributed-Memory using MPI – Shared-Memory using TBB
5
HTAs, Illustrated
Cluster Memory Hierarchy
Cluster Node L2 Node Multicore L1 Cache Register
6
HTA Indexing
A
A[ ] Flattened A[0,3] A(0,0) Tile access Flattened access A(0,1)[0,1] Hybrid Tile access ( , )[ , ] y access A(1, 0:1) Simplified syntax
7
HTA Assignment
lhs = rhs;
lhs rhs
lhs(:,:) = rhs(:,:); lh (0 ) h (1 ) lhs(0,:) = rhs(1,:); lhs(0,:)[1,:] = rhs(1,:)[0,:];
8
HTA Assignment
lhs = rhs;
lhs rhs
lhs(:,:) = rhs(:,:); lh (0 ) h (1 ) lhs(0,:) = rhs(1,:); lhs(0,:)[1,:] = rhs(1,:)[0,:]; Assignment Communication
9
Higher Level HTA Operations
- Element‐by‐element operations
– +,‐,*,/,… , , ,/,…
- Reductions
- Transpositions
Transpositions
– Tile – Data
- MapReduce
- Matrix Multiplication
p
- FFT
- etc.
etc.
10
User-Defined Operations
- Programmers can create new complex parallel
- perators through the primitive hmap
p g p p
– Applies user defined operators to each tile of the HTA
- And corresponding tiles if multiple HTAs are involved
– Operator applied in parallel across tiles – Assumes operation on tiles is independent
HTA a,b; b = a.map( F() ); F() F() F() F() F() F() b a
11
SPIKE
- Parallel solver for banded linear systems of equations
- SPIKE family of algorithms originally by Prof. Ahmed
Sameh Sameh
- http://software.intel.com/en-us/articles/intel-adaptive-spike-
based-solver/ based solver/
- SPIKE’s blocks map naturally to tiles in HTA
SPIKE s blocks map naturally to tiles in HTA
12
SPIKE Algorithm
- Name derived from the matrix formed by multiplying the
- riginal block tri-diagonal system by the inverse of the
- riginal block tri diagonal system by the inverse of the
blocks of A
13
SPIKE Algorithm 2
- Most SPIKE algorithms proceed as follows:
1. Compute the Spike matrix, S, in parallel 1. Compute the Spike matrix, S, in parallel
- Compute LU factorization and solve instead of directly computing
inverse
- LAPACK routines DGBTRF/DGBTRS
- LAPACK routines DGBTRF/DGBTRS
2. Form a reduced system of much smaller size 3. Solve the reduced system directly y y 4. Retrieve the global solution in parallel
14
SPIKE Reduced System
- Extracts smaller, independent subsystem from the original
system system
15
SPIKE TU and TA
- SPIKE has several algorithmic variants
- Recent focus has been on the variants referred to as
- Recent focus has been on the variants referred to as
“Truncated”
- Faster than the full SPIKE algorithm
Faster than the full SPIKE algorithm
– Reduces computation and has greater parallelism – Only applicable to diagonally dominant systems y pp g y y
- Can wrap with an iterative method for non-diagonally dominant systems
16
Truncated Savings 1
17
Truncated Savings 2
18
TU and TA
The truncated variants come in several flavors the the two that we implemented are TU and TA – T – truncated – U – Uses LU and UL factorizations – A – Uses alternating LU and UL factorizations
The difference is in how the work is distributed to the processors not in what work is done.
19
TA Work Distribution
20
TU Work Distribution
21
HTA SPIKE
- Blocks of SPIKE map very well to HTA tiles
- Factorization and Solving of tiles parallelized with hmap
- Factorization and Solving of tiles parallelized with hmap
- Reduced System formed using array assignments
BC =
Original Matrix System HTA Representation
LUA= ULA= × = r =
22
TU Implementation
... LUA.hmap(factorize_lua()); ULA.hmap(factorize_ula()); // factorize blocks of A BC.hmap(solve_bc(),LUA,ULA); g.hmap(solve_lua(),LUA); REDUCED()[0:m-1,m:2*m-1] = BC(0:blks-2)[0:m-1,0:m-1]; // calculate the spike tips W(t) and V(b) from Bs and Cs // update right hand side // form the reduced system () ( ) REDUCED()[m:2*m-1,0:m-1] = BC(1:blks-1)[0:m-1,m:2*m-1]; greduced()[0:m-1] = g(0:blks-2)[blocksize-m:blocksize-1]; greduced()[m:2*m-1] = g(1:blks-1)[0:m-1]; REDUCED h (f t i ()) // form the reduced system RHS // factorize the reduced system REDUCED.hmap(factorize()); greduced.hmap(solve(),REDUCED); fv = r(0:num_blocks-2); fr_half = greduced()[0:m-1]; B map(dgemv() fv fr half); // factorize the reduced system // solve the reduced system // Update RHS as r = r - Bz - Cz B.map(dgemv(),fv,fr_half); r(0:num_blocks-2) = fv; fw = r(1:num_blocks-1); fr_half = greduced()[m:2*m-1]; C.map(dgemv(),fw, fr_half); r(1:num_blocks-1) = fw; // S l th d t d t
23
r.hmap(solve_lua(),LUA); ... // Solve the updated system
Portability
- The HTA programming model expresses programs as a
sequence of higher level data parallel operators sequence of higher level data parallel operators
– Portable as long as the appropriate implementations of the
- perators exist for the target platform
– For SPIKE, array assignments, tile indexing, and hmap.
- The code on the last slide runs on both the MPI and TBB
i l t ti f th HTA lib implementations of the HTA library
– Simply change which header to include.
24
SPIKE TU Performance
9 10 6 7 8 ntial MKL 4 5 eedup vs Sequen 1 2 3 Sp 5 10 15 20 25 30 35 Number of CPUs
25
HTA SHM HTA MPI SpikePACK
32 Core Intel Xeon 7555 (Nehalem) @ 1.86 GHz
Cluster TU Performance
14 16 MKL 8 10 12 equential 4 6 8 dup vs Se 2 5 10 15 20 25 30 35 Speed 5 10 15 20 25 30 35 Number of CPUs HTA MPI SpikePACK
26 Intel Xeon X5550 (Nehalem) @ 2.66 GHz
SPIKE TA Performance
8 9 10 MKL 5 6 7 8 equential 2 3 4 5 dup vs Se 1 2 Spee 10 20 30 40 Number of CPUs HTA SHM HTA MPI SpikePACK
27 32 Core Intel Xeon 7555 (Nehalem) @ 1.86 GHz
Cluster TA Performance
8 9 10 MKL 5 6 7 8 equential 2 3 4 5 dup vs Se 1 2 Spee 10 20 30 40 Number of CPUs HTA MPI SpikePACK
28 Intel Xeon X5550 (Nehalem) @ 2.66 GHz
Conclusions
- The SPIKE algorithms are a good fit for the HTA programming model
– First class tiles – Data parallel operators – Portable
- Clean compact code
- Clean, compact code
– Our implementation takes ~50 lines of code – SpikePACK’s Fortran+MPI? …..a lot more. – A challenge – one of the most complex algorithms we’ve looked at using HTAs
- Good performance
– Original goal was only to match performance of Fortran+MPI Original goal was only to match performance of Fortran+MPI
29
Future Work
- Sparse Tiling
– Allow programmers to express the structure of the original system Allow programmers to express the structure of the original system
- Support for Heterogeneous Architectures
– Clusters of Multicores, GPUs C uste s o u t co es, G Us
- Look at other complex algorithms
- Linear algebra libraries and data layout
Linear algebra libraries and data layout
30
Questions?
31
Multithreaded MKL Performance
2 2.5 1.5 2 edup 0 5 1 Spee 0.5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 # of Processors MKL
32
Why is Raising the Level of Abstraction Good?
1) Allows programmers to focus on the algorithm rather than on the implementation p 2) Data parallel programs resemble ) g conventional, sequential programs
– Parallelism is encapsulated P ll li i t t d – Parallelism is structured
3) Compact and readable 3) Compact and readable
33
Why is Raising the Level of Abstraction Good?
4) Portable
– Run on any class of machine with suitable y implementations of the operators – Multicore/Clusters/GPUs
5) P h t l f d t i 5) Programmers have control of determinacy.
– Implicit barriers exist between operators. If operators do not alter any shared state, the program is deterministic. not alter any shared state, the program is deterministic.
6) Facilitates Optimization
– Exposes tunable parameters p p
34