A Parallel Numerical Solver Using Hierarchically Tiled Using - - PowerPoint PPT Presentation

a parallel numerical solver using hierarchically tiled
SMART_READER_LITE
LIVE PREVIEW

A Parallel Numerical Solver Using Hierarchically Tiled Using - - PowerPoint PPT Presentation

A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, Mara J. g g Garzarn, and David Padua Outline Motivation Hierarchically Tiled


slide-1
SLIDE 1

A Parallel Numerical Solver Using Hierarchically Tiled Using Hierarchically Tiled Arrays

James Brodman, G. Carl Evans, Murat Manguoglu, Ahmed Sameh, María J. g g Garzarán, and David Padua

slide-2
SLIDE 2

Outline

  • Motivation
  • Hierarchically Tiled Arrays (HTAs)
  • Hierarchically Tiled Arrays (HTAs)
  • SPIKE
  • HTA SPIKE
  • HTA-SPIKE
  • Conclusions

2

slide-3
SLIDE 3

Motivation

  • Still much room for advancement in parallel programming
  • Want to facilitate parallel programming but still provide

Want to facilitate parallel programming but still provide control over performance

– Raise the level of abstraction

Hi h l l d t ll l t d th i i t d t d t

  • Higher level data parallel operators and their associated aggregate data

types

  • Write programs as a sequence of operators

Can easily reason about the resulting program – Can easily reason about the resulting program – Portable across any platform that implements the desired operators

– Control performance through:

D t P ll li

  • Data Parallelism
  • Hierarchical Tiling

– Who should tile?

3

slide-4
SLIDE 4

Tiling and Compilers (Matrix Multiplication)

Intel MKL 20x LOPS

Clearly, the Compiler could do better at tiling

20x MF icc ‐O3 ‐xT

could do better at tiling

icc ‐O3

4

Matrix Size

slide-5
SLIDE 5

Hierarchically Tiled Array

  • The Hierarchically Tiled Array, or HTA, is a data type based on higher level data

parallel operators

Fortran 90 array operators – Fortran 90 array operators – Hierarchical Tiling

  • Makes Tiles first class objects

R i i f ili l – Recognizes importance of tiling to control:

  • Locality
  • Data Distribution

Referenced explicitly not implicitly generated by the compiler – Referenced explicitly, not implicitly generated by the compiler – Operators extended to function on tiles

– C++ Library Implementations:

Di t ib t d M i MPI – Distributed-Memory using MPI – Shared-Memory using TBB

5

slide-6
SLIDE 6

HTAs, Illustrated

Cluster Memory Hierarchy

Cluster Node L2 Node Multicore L1 Cache Register

6

slide-7
SLIDE 7

HTA Indexing

A

A[ ] Flattened A[0,3] A(0,0) Tile access Flattened access A(0,1)[0,1] Hybrid Tile access ( , )[ , ] y access A(1, 0:1) Simplified syntax

7

slide-8
SLIDE 8

HTA Assignment

lhs = rhs;

lhs rhs

lhs(:,:) = rhs(:,:); lh (0 ) h (1 ) lhs(0,:) = rhs(1,:); lhs(0,:)[1,:] = rhs(1,:)[0,:];

8

slide-9
SLIDE 9

HTA Assignment

lhs = rhs;

lhs rhs

lhs(:,:) = rhs(:,:); lh (0 ) h (1 ) lhs(0,:) = rhs(1,:); lhs(0,:)[1,:] = rhs(1,:)[0,:]; Assignment Communication

9

slide-10
SLIDE 10

Higher Level HTA Operations

  • Element‐by‐element operations

– +,‐,*,/,… , , ,/,…

  • Reductions
  • Transpositions

Transpositions

– Tile – Data

  • MapReduce
  • Matrix Multiplication

p

  • FFT
  • etc.

etc.

10

slide-11
SLIDE 11

User-Defined Operations

  • Programmers can create new complex parallel
  • perators through the primitive hmap

p g p p

– Applies user defined operators to each tile of the HTA

  • And corresponding tiles if multiple HTAs are involved

– Operator applied in parallel across tiles – Assumes operation on tiles is independent

HTA a,b; b = a.map( F() ); F() F() F() F() F() F() b a

11

slide-12
SLIDE 12

SPIKE

  • Parallel solver for banded linear systems of equations
  • SPIKE family of algorithms originally by Prof. Ahmed

Sameh Sameh

  • http://software.intel.com/en-us/articles/intel-adaptive-spike-

based-solver/ based solver/

  • SPIKE’s blocks map naturally to tiles in HTA

SPIKE s blocks map naturally to tiles in HTA

12

slide-13
SLIDE 13

SPIKE Algorithm

  • Name derived from the matrix formed by multiplying the
  • riginal block tri-diagonal system by the inverse of the
  • riginal block tri diagonal system by the inverse of the

blocks of A

13

slide-14
SLIDE 14

SPIKE Algorithm 2

  • Most SPIKE algorithms proceed as follows:

1. Compute the Spike matrix, S, in parallel 1. Compute the Spike matrix, S, in parallel

  • Compute LU factorization and solve instead of directly computing

inverse

  • LAPACK routines DGBTRF/DGBTRS
  • LAPACK routines DGBTRF/DGBTRS

2. Form a reduced system of much smaller size 3. Solve the reduced system directly y y 4. Retrieve the global solution in parallel

14

slide-15
SLIDE 15

SPIKE Reduced System

  • Extracts smaller, independent subsystem from the original

system system

15

slide-16
SLIDE 16

SPIKE TU and TA

  • SPIKE has several algorithmic variants
  • Recent focus has been on the variants referred to as
  • Recent focus has been on the variants referred to as

“Truncated”

  • Faster than the full SPIKE algorithm

Faster than the full SPIKE algorithm

– Reduces computation and has greater parallelism – Only applicable to diagonally dominant systems y pp g y y

  • Can wrap with an iterative method for non-diagonally dominant systems

16

slide-17
SLIDE 17

Truncated Savings 1

17

slide-18
SLIDE 18

Truncated Savings 2

18

slide-19
SLIDE 19

TU and TA

The truncated variants come in several flavors the the two that we implemented are TU and TA – T – truncated – U – Uses LU and UL factorizations – A – Uses alternating LU and UL factorizations

The difference is in how the work is distributed to the processors not in what work is done.

19

slide-20
SLIDE 20

TA Work Distribution

20

slide-21
SLIDE 21

TU Work Distribution

21

slide-22
SLIDE 22

HTA SPIKE

  • Blocks of SPIKE map very well to HTA tiles
  • Factorization and Solving of tiles parallelized with hmap
  • Factorization and Solving of tiles parallelized with hmap
  • Reduced System formed using array assignments

BC =

Original Matrix System HTA Representation

LUA= ULA= × = r =

22

slide-23
SLIDE 23

TU Implementation

... LUA.hmap(factorize_lua()); ULA.hmap(factorize_ula()); // factorize blocks of A BC.hmap(solve_bc(),LUA,ULA); g.hmap(solve_lua(),LUA); REDUCED()[0:m-1,m:2*m-1] = BC(0:blks-2)[0:m-1,0:m-1]; // calculate the spike tips W(t) and V(b) from Bs and Cs // update right hand side // form the reduced system () ( ) REDUCED()[m:2*m-1,0:m-1] = BC(1:blks-1)[0:m-1,m:2*m-1]; greduced()[0:m-1] = g(0:blks-2)[blocksize-m:blocksize-1]; greduced()[m:2*m-1] = g(1:blks-1)[0:m-1]; REDUCED h (f t i ()) // form the reduced system RHS // factorize the reduced system REDUCED.hmap(factorize()); greduced.hmap(solve(),REDUCED); fv = r(0:num_blocks-2); fr_half = greduced()[0:m-1]; B map(dgemv() fv fr half); // factorize the reduced system // solve the reduced system // Update RHS as r = r - Bz - Cz B.map(dgemv(),fv,fr_half); r(0:num_blocks-2) = fv; fw = r(1:num_blocks-1); fr_half = greduced()[m:2*m-1]; C.map(dgemv(),fw, fr_half); r(1:num_blocks-1) = fw; // S l th d t d t

23

r.hmap(solve_lua(),LUA); ... // Solve the updated system

slide-24
SLIDE 24

Portability

  • The HTA programming model expresses programs as a

sequence of higher level data parallel operators sequence of higher level data parallel operators

– Portable as long as the appropriate implementations of the

  • perators exist for the target platform

– For SPIKE, array assignments, tile indexing, and hmap.

  • The code on the last slide runs on both the MPI and TBB

i l t ti f th HTA lib implementations of the HTA library

– Simply change which header to include.

24

slide-25
SLIDE 25

SPIKE TU Performance

9 10 6 7 8 ntial MKL 4 5 eedup vs Sequen 1 2 3 Sp 5 10 15 20 25 30 35 Number of CPUs

25

HTA SHM HTA MPI SpikePACK

32 Core Intel Xeon 7555 (Nehalem) @ 1.86 GHz

slide-26
SLIDE 26

Cluster TU Performance

14 16 MKL 8 10 12 equential 4 6 8 dup vs Se 2 5 10 15 20 25 30 35 Speed 5 10 15 20 25 30 35 Number of CPUs HTA MPI SpikePACK

26 Intel Xeon X5550 (Nehalem) @ 2.66 GHz

slide-27
SLIDE 27

SPIKE TA Performance

8 9 10 MKL 5 6 7 8 equential 2 3 4 5 dup vs Se 1 2 Spee 10 20 30 40 Number of CPUs HTA SHM HTA MPI SpikePACK

27 32 Core Intel Xeon 7555 (Nehalem) @ 1.86 GHz

slide-28
SLIDE 28

Cluster TA Performance

8 9 10 MKL 5 6 7 8 equential 2 3 4 5 dup vs Se 1 2 Spee 10 20 30 40 Number of CPUs HTA MPI SpikePACK

28 Intel Xeon X5550 (Nehalem) @ 2.66 GHz

slide-29
SLIDE 29

Conclusions

  • The SPIKE algorithms are a good fit for the HTA programming model

– First class tiles – Data parallel operators – Portable

  • Clean compact code
  • Clean, compact code

– Our implementation takes ~50 lines of code – SpikePACK’s Fortran+MPI? …..a lot more. – A challenge – one of the most complex algorithms we’ve looked at using HTAs

  • Good performance

– Original goal was only to match performance of Fortran+MPI Original goal was only to match performance of Fortran+MPI

29

slide-30
SLIDE 30

Future Work

  • Sparse Tiling

– Allow programmers to express the structure of the original system Allow programmers to express the structure of the original system

  • Support for Heterogeneous Architectures

– Clusters of Multicores, GPUs C uste s o u t co es, G Us

  • Look at other complex algorithms
  • Linear algebra libraries and data layout

Linear algebra libraries and data layout

30

slide-31
SLIDE 31

Questions?

31

slide-32
SLIDE 32

Multithreaded MKL Performance

2 2.5 1.5 2 edup 0 5 1 Spee 0.5 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 # of Processors MKL

32

slide-33
SLIDE 33

Why is Raising the Level of Abstraction Good?

1) Allows programmers to focus on the algorithm rather than on the implementation p 2) Data parallel programs resemble ) g conventional, sequential programs

– Parallelism is encapsulated P ll li i t t d – Parallelism is structured

3) Compact and readable 3) Compact and readable

33

slide-34
SLIDE 34

Why is Raising the Level of Abstraction Good?

4) Portable

– Run on any class of machine with suitable y implementations of the operators – Multicore/Clusters/GPUs

5) P h t l f d t i 5) Programmers have control of determinacy.

– Implicit barriers exist between operators. If operators do not alter any shared state, the program is deterministic. not alter any shared state, the program is deterministic.

6) Facilitates Optimization

– Exposes tunable parameters p p

34