Sparse direct solvers on top of runtime systems ANR SOLHAR E. - - PowerPoint PPT Presentation

sparse direct solvers on top of runtime systems
SMART_READER_LITE
LIVE PREVIEW

Sparse direct solvers on top of runtime systems ANR SOLHAR E. - - PowerPoint PPT Presentation

Sparse direct solvers on top of runtime systems ANR SOLHAR E. Agullo, G. Bosilca, A. Buttari, A. Guermouche and F. Lopez , Universit e de Toulouse-IRIT ANR SOLHAR meeting 2014 The multifrontal QR method The Multifrontal QR method The


slide-1
SLIDE 1

Sparse direct solvers on top of runtime systems

ANR SOLHAR

  • E. Agullo, G. Bosilca, A. Buttari, A. Guermouche and
  • F. Lopez,

Universit´ e de Toulouse-IRIT

ANR SOLHAR meeting 2014

slide-2
SLIDE 2

The multifrontal QR method

slide-3
SLIDE 3

The Multifrontal QR method

The multifrontal QR factorization is guided by a graph called elimination tree:

  • each node is associated with a

relatively small dense matrix called frontal matrix (or front) containing k pivots to be eliminated along with all the other coefficients concerned by their elimination

3/24 ANR SOLHAR meeting 2014

slide-4
SLIDE 4

The Multifrontal QR method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed:

  • assembly: coefficients from the original

matrix associated with the pivots and contribution blocks produced by the treatment of the child nodes are stacked to form the frontal matrix

3/24 ANR SOLHAR meeting 2014

slide-5
SLIDE 5

The Multifrontal QR method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two operations are performed:

  • assembly: coefficients from the original

matrix associated with the pivots and contribution blocks produced by the treatment of the child nodes are stacked to form the frontal matrix

  • factorization: the k pivots are

eliminated through a complete QR factorization of the frontal matrix. As a result we get:

  • part of the global R and Q factors
  • a triangular contribution block that will

be assembled into the father’s front

3/24 ANR SOLHAR meeting 2014

slide-6
SLIDE 6

The Multifrontal QR method

Notable differences with multifrontal LU:

  • fronts are rectangular, either over or under-determined
  • assembly operations are just copies (with lots of indirect

addressing) and not sums. They can thus be done in any order (like in LU) but also in parallel (most likely not efficient because of false sharing issues)

  • fronts are not full: they have a staircase structure. The zeroes in

the lower-leftmost part can be ignored. This irregular structure makes the modeling of performance rather difficult

  • fronts are completely factorized and not just partially. This makes

the overall size of factors bigger and thus the active memory consumption less sensitive to the tree traversal

  • contribution blocks are trapezoidal and note square

4/24 ANR SOLHAR meeting 2014

slide-7
SLIDE 7

The Multifrontal QR method: parallelism

In the multifrontal methods we can distinguish two sources of parallelism:

Tree parallelism

Frontal matrices located in independent branches in the tree can be processed in parallel

Node parallelism

Large frontal matrices factorization may be performed in parallel by multiple threads

5/24 ANR SOLHAR meeting 2014

slide-8
SLIDE 8

The Multifrontal QR method in qr mumps

slide-9
SLIDE 9

Parallelism in qr mumps: a new approach

Our baseline is the approach used in qr mumps where the workload is expressed as a DAG of tasks defined through a 1D Block-column partitioning In qr mumps threading is implemented through OpenMP and scheduling of tasks is done “by hand”

7/24 ANR SOLHAR meeting 2014

slide-10
SLIDE 10

Parallelism: a new approach

The scheduling is performed by a finely-tuned, hand-written code the fine-grained decomposition and the asynchronous/dynamic scheduling deliver high concurrency and much better performance compared to the classical approach (SPQR) the scheduler is not scalable (the search for ready tasks in the DAG is inefficient)... ... extremely difficult to maintain... ... and not really portable

8/24 ANR SOLHAR meeting 2014

slide-11
SLIDE 11

Add new features in qr mumps

We want to develop the following features in qr mumps:

  • 2D partitioning of frontal matrices (finer granularity allowing

better parallelism) as 1D partitioning may not be adapted

  • most fronts are overdetermined
  • the problem is mitigated by concurrent front factorizations
  • Exploit GPUs
  • Memory-aware algorithms (perform factorization under a given

memory constraint)

  • Distributed memory architectures

9/24 ANR SOLHAR meeting 2014

slide-12
SLIDE 12

Add new features in qr mumps

We want to develop the following features in qr mumps:

  • 2D partitioning of frontal matrices (finer granularity allowing

better parallelism) as 1D partitioning may not be adapted

  • most fronts are overdetermined
  • the problem is mitigated by concurrent front factorizations

more concurrency more complex dependencies, more tasks

  • Exploit GPUs

memory transfers, CUDA kernels management

  • Memory-aware algorithms (perform factorization under a given

memory constraint)

  • Distributed memory architectures

MPI layer

All these problems may be overcome by using runtime system

9/24 ANR SOLHAR meeting 2014

slide-13
SLIDE 13

STF vs PTG models

slide-14
SLIDE 14

STF vs PTG models

The Sequential Task Flow (STF) model in StarPU:

  • The parallel corresponds to the sequential one except that
  • perations are not executed but submitted to the system in the

form of tasks

  • Depending on data access in tasks and the order of submission, the

runtime infers dependencies among them and builds a DAG Drawbacks of this model:

  • The DAG is entirely unrolled in the runtime: limited scalability

11/24 ANR SOLHAR meeting 2014

slide-15
SLIDE 15

STF vs PTG models

The Parametrized Task Graph (PTG) model in PaRSEC:

  • The DAG is represented with a compact format where the different

type of tasks are defined (domain of definition, CPU/GPU implementation) as well as their dependencies wrt other tasks (input/output data)

  • On task completion, the DAG is partially unrolled following

released data dependencies Drawbacks of this model:

  • programming model less intuitive than STF

12/24 ANR SOLHAR meeting 2014

slide-16
SLIDE 16

STF vs PTG models

The Parametrized Task Graph (PTG) model in PaRSEC:

  • The DAG is represented with a compact format where the different

type of tasks are defined (domain of definition, CPU/GPU implementation) as well as their dependencies wrt other tasks (input/output data)

  • On task completion, the DAG is partially unrolled following

released data dependencies Drawbacks of this model:

  • programming model less intuitive than STF

Objective

Develop a PaRSEC version of qr mumps following the PTG model and evaluate its effectiveness on a single-node, multicore systems

12/24 ANR SOLHAR meeting 2014

slide-17
SLIDE 17

PaRSEC multifrontal QR

slide-18
SLIDE 18

PaRSEC Multifrontal QR

  • The elimination tree is

represented in a main JDF

  • The front factorization is

represented in separate JDFs

  • 1D block partitioning
  • 2D block partitioning (not

necessarily square) with flat, binary (communication avoiding) or hybrid panel reduction trees

  • Upon activation (allocating

memory and initializing structures), the DAG corresponding to the front factorization is spawned in PaRSEC

1 2

a a

3

a

14/24 ANR SOLHAR meeting 2014

slide-19
SLIDE 19

PaRSEC Multifrontal QR

  • The elimination tree is

represented in a main JDF

  • The front factorization is

represented in separate JDFs

  • 1D block partitioning
  • 2D block partitioning (not

necessarily square) with flat, binary (communication avoiding) or hybrid panel reduction trees

  • Upon activation (allocating

memory and initializing structures), the DAG corresponding to the front factorization is spawned in PaRSEC

1 2

a a

3

a p1 u2 u3 p2 u3 p3 s2 s3 c a

14/24 ANR SOLHAR meeting 2014

slide-20
SLIDE 20

PaRSEC Multifrontal QR

  • The elimination tree is

represented in a main JDF

  • The front factorization is

represented in separate JDFs

  • 1D block partitioning
  • 2D block partitioning (not

necessarily square) with flat, binary (communication avoiding) or hybrid panel reduction trees

  • Upon activation (allocating

memory and initializing structures), the DAG corresponding to the front factorization is spawned in PaRSEC

1 2

a a

3

a p1 u2 u3 p2 u3 p3 s2 s3 c a p1 u2 u3 u3 u4 u4 u4 s2 s3 s4 p2 p3 c a

14/24 ANR SOLHAR meeting 2014

slide-21
SLIDE 21

PaRSEC Multifrontal QR

  • The elimination tree is

represented in a main JDF

  • The front factorization is

represented in separate JDFs

  • 1D block partitioning
  • 2D block partitioning (not

necessarily square) with flat, binary (communication avoiding) or hybrid panel reduction trees

  • Upon activation (allocating

memory and initializing structures), the DAG corresponding to the front factorization is spawned in PaRSEC

1 2

a a

3

a p1 u2 u3 u3 u4 u4 u4 p4 s3 s4 p2 p3 c a p1 u2 u3 p2 u3 p3 s2 s3 c a p1 u2 u3 u3 u4 u4 u4 s2 s3 s4 p2 p3 c a

14/24 ANR SOLHAR meeting 2014

slide-22
SLIDE 22

PaRSEC Multifrontal QR

  • Elimination tree and assembly operations have an irregular

input/output data-flow: tricky to express in the JDF format

... cj-1

cj fi c1 c2 ... fi r ... ... c1 c2 r cj-1 cj

  • Fronts matrices have a sparse structure (staircase): the

corresponding factorization DAG must be adapted from dense kernels

15/24 ANR SOLHAR meeting 2014

slide-23
SLIDE 23

Experimental results

# Matrix Gflops Ordering 1 LargeRegFile 19 Metis 2 EternityII A 39 Metis 3 EternityII E 107 Metis 4 cont11 l 112 Metis 5 sc205-2r 160 Metis 6 cat ears 4 4 184 Metis 7 karted 335 Metis 8 degme 558 Metis 9 flower 7 4 724 Metis 10 hirlam 1112 Metis 11 e18 1286 Metis 12 Rucci1 5179 Metis 13 TF17 15663 Metis 14 sls 26363 Metis

  • System 1:
  • IBM x3755
  • AMD Opteron Processor 8431

@ 2.4 GHz, 4 × 6 cores

  • 72 GB memory (NUMA)

16/24 ANR SOLHAR meeting 2014

slide-24
SLIDE 24

PaRSEC Multifrontal QR: results

5 1 0 1 5 20 25 30 1 D 2D

qrm_starpu

1 D 2D

qrm_parsec

L a r g e R e g F i l e E t e r n i t y I I _ A E t e r n i t y I I _ E c

  • n

t 1 1 _ l s c 2 5

  • 2

r c a t _ e a r s k a r t e d d e g m e fl

  • w

e r h i r l a m e 1 8 R u c c i 1 T F 1 7 s l s

Speedup -- 24 cores

17/24 ANR SOLHAR meeting 2014

slide-25
SLIDE 25

PaRSEC Multifrontal QR: results

5 1 0 1 5 20 25 30 1 D 2D

qrm_starpu

1 D 2D

qrm_parsec

L a r g e R e g F i l e E t e r n i t y I I _ A E t e r n i t y I I _ E c

  • n

t 1 1 _ l s c 2 5

  • 2

r c a t _ e a r s k a r t e d d e g m e fl

  • w

e r h i r l a m e 1 8 R u c c i 1 T F 1 7 s l s

Speedup -- 24 cores

17/24 ANR SOLHAR meeting 2014

slide-26
SLIDE 26

PaRSEC Multifrontal QR: results

  • In the etree the parent-child

dependencies are not finely managed resulting in poorer pipeline in the case of qrm parsec

1 2

a a

3

a p1 u2 u3 u3 u4 u4 u4 p4 s3 s4 p2 p3 c a p1 u2 u3 p2 u3 p3 s2 s3 c a p1 u2 u3 u3 u4 u4 u4 s2 s3 s4 p2 p3 c a

18/24 ANR SOLHAR meeting 2014

slide-27
SLIDE 27

PaRSEC Multifrontal QR: results

  • In the etree the parent-child

dependencies are not finely managed resulting in poorer pipeline in the case of qrm parsec

1 2

a a

3

a p1 u2 u3 u3 u4 u4 u4 p4 s3 s4 p2 p3 c a p1 u2 u3 p2 u3 p3 s2 s3 c a p1 u2 u3 u3 u4 u4 u4 s2 s3 s4 p2 p3 c a

18/24 ANR SOLHAR meeting 2014

slide-28
SLIDE 28

PaRSEC Multifrontal QR: results

  • In the etree the parent-child

dependencies are not finely managed resulting in poorer pipeline in the case of qrm parsec

  • Due to limitations in PaRSEC

(not in the PTG model) it is not currently possible to achieve a pure data-flow parallelism

1 2

a a

3

a p1 u2 u3 u3 u4 u4 u4 p4 s3 s4 p2 p3 c a p1 u2 u3 p2 u3 p3 s2 s3 c a p1 u2 u3 u3 u4 u4 u4 s2 s3 s4 p2 p3 c a

18/24 ANR SOLHAR meeting 2014

slide-29
SLIDE 29

STF vs PTG: Front partitioning

How can we take advantage of the PTG model?

  • The STF model allows a static approach: front partitioning occurs

at the beginning of the factorization

  • The PTG model allows a dynamic approach: front partitioning
  • ccurs upon front activation

In the PTG model the front partitioning may be decided depending

  • n the context of execution

19/24 ANR SOLHAR meeting 2014

slide-30
SLIDE 30

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Tree parallelism at the bottom of the tree: coarse grain

partitioning (1D partitioning or rectangular tiles)

  • better kernel efficiency
  • less tasks: less scheduling overhead

1 3 4 5 2 11 6 8 9 10 7

20/24 ANR SOLHAR meeting 2014

slide-31
SLIDE 31

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Tree parallelism at the bottom of the tree: coarse grain

partitioning (1D partitioning or rectangular tiles)

  • better kernel efficiency
  • less tasks: less scheduling overhead

1 3 4 5 2 11 6 8 9 10 7 1 3 4 5 2 6 8 9 7

20/24 ANR SOLHAR meeting 2014

slide-32
SLIDE 32

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Tree parallelism at the bottom of the tree: coarse grain

partitioning (1D partitioning or rectangular tiles)

  • better kernel efficiency
  • less tasks: less scheduling overhead

1 3 4 5 2 11 6 8 9 10 7 1 3 4 5 2 6 8 9 7 1 4 2 6 9 7 1 4 2 6 9 7

20/24 ANR SOLHAR meeting 2014

slide-33
SLIDE 33

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Node parallelism at top of the tree or when reaching a memory

constraint: fine grain partitioning

  • more parallelism
  • better pipeline

1 3 4 5 2 11 6 8 9 10 7

21/24 ANR SOLHAR meeting 2014

slide-34
SLIDE 34

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Node parallelism at top of the tree or when reaching a memory

constraint: fine grain partitioning

  • more parallelism
  • better pipeline

1 3 4 5 2 11 6 8 9 10 7 1 3 4 5 2

21/24 ANR SOLHAR meeting 2014

slide-35
SLIDE 35

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Node parallelism at top of the tree or when reaching a memory

constraint: fine grain partitioning

  • more parallelism
  • better pipeline

1 3 4 5 2 11 6 8 9 10 7 1 3 4 5 2 1 3 4 5 2

21/24 ANR SOLHAR meeting 2014

slide-36
SLIDE 36

STF vs PTG: Front partitioning

How can we adapt the front partitioning depending on the context of execution?

  • Extremely challenging to apply these rules in practice:
  • Huge search space for parameters
  • Tile dimensions
  • Inner blocking sizes
  • Panel reduction trees
  • Take into account the sparse structure of frontal matrices (staircase

structure)

22/24 ANR SOLHAR meeting 2014

slide-37
SLIDE 37

Conclusions and future work

Conclusions on PaRSEC

  • More challenging to use than other runtimes systems
  • Potentially more scalable
  • Some features should be added PaRSEC to enhance the current

version of qr parsec

Ongoing and Future work

  • Use GPUs with PaRSEC
  • Distributed-memory architecture
slide-38
SLIDE 38

?

Questions?