Direct methods on GPU-based systems Preliminary work towards a - - PowerPoint PPT Presentation

direct methods on gpu based systems
SMART_READER_LITE
LIVE PREVIEW

Direct methods on GPU-based systems Preliminary work towards a - - PowerPoint PPT Presentation

Direct methods on GPU-based systems Preliminary work towards a functioning code A. Decollas and F. Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon Sparse Days 2012. Toulouse, June 25th Context of the work


slide-1
SLIDE 1

Direct methods on GPU-based systems

Preliminary work towards a functioning code

  • A. Decollas and F. Lopez,

Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon

Sparse Days 2012. Toulouse, June 25th

slide-2
SLIDE 2

Context of the work

slide-3
SLIDE 3

Context

  • F. Lopez @ IRIT-Toulouse

Evaluate the efficiency of modern runtime systems for heterogeneous and irregular workloads such as Multifrontal solvers on homogeneous, multicore architectures.

  • A. Decollas @ Inria-Bordeaux

Develop dense linear algebra kernels specific to sparse, direct solvers capable of achieving high efficiency on heterogeneous systems equipped with multiple CPUs and GPUs. These two activities will ultimately be merged into a sparse, direct solver for accelerated multicore systems.

3/41 Sparse Days 2012. Toulouse, June 25th

slide-4
SLIDE 4

The multifrontal method

The multifrontal factorization is guided by a graph called elimination tree:

  • At each node of the tree k pivots are

eliminated

  • Each node of the tree is associated

with a relatively small dense matrix called frontal matrix (or, simply, front) which contains the k rows/columns related to the pivots and all the other coefficients concerned by their elimination

4/41 Sparse Days 2012. Toulouse, June 25th

slide-5
SLIDE 5

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

4/41 Sparse Days 2012. Toulouse, June 25th

slide-6
SLIDE 6

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

4/41 Sparse Days 2012. Toulouse, June 25th

slide-7
SLIDE 7

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5

4/41 Sparse Days 2012. Toulouse, June 25th

slide-8
SLIDE 8

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5

4/41 Sparse Days 2012. Toulouse, June 25th

slide-9
SLIDE 9

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5

4/41 Sparse Days 2012. Toulouse, June 25th

slide-10
SLIDE 10

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3

4/41 Sparse Days 2012. Toulouse, June 25th

slide-11
SLIDE 11

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3

4/41 Sparse Days 2012. Toulouse, June 25th

slide-12
SLIDE 12

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3

4/41 Sparse Days 2012. Toulouse, June 25th

slide-13
SLIDE 13

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3 3 4

4/41 Sparse Days 2012. Toulouse, June 25th

slide-14
SLIDE 14

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3 3 4

4/41 Sparse Days 2012. Toulouse, June 25th

slide-15
SLIDE 15

The multifrontal method

The tree is traversed in topological order (i.e., bottom-up) and, at each node, two

  • perations are performed:
  • assembly: a set of coefficients from the
  • riginal matrix associated with the

pivots and a number of contribution blocks produced by the treatment of the child nodes are summed to form the frontal matrix

  • factorization: the k pivots are

eliminated through a partial factorization of the frontal matrix. As a result we get:

  • k rows/columns of the global factors
  • A contribution block that will be

assembled into the parent’s front

1 4 5 2 3 3 4 4 5

4/41 Sparse Days 2012. Toulouse, June 25th

slide-16
SLIDE 16

CPU-GPU hybrid architectures

GPUs may be used as powerful accelerators for HPC applications: High computational performance (comparison GPU-CPU: 10× faster, memory access 5× faster) Energy efficient despite these capabilities, the use of GPUs is challenging: Complex architectures (comparison GPU-CPU : 100× more cores) CPU-GPU programming models incompatible. CPU ↔ GPU transfers are expensive (no shared memory) ⇒ specific algorithms

5/41 Sparse Days 2012. Toulouse, June 25th

slide-17
SLIDE 17

CPU-GPU hybrid architectures

Heterogeneous platform

CPU CPU CPU CPU GPU RAM GPU CPU CPU CPU CPU GPU RAM RAM RAM RAM

Elimination tree

  • An extremely heterogeneous workload
  • A heterogeneous architecture
  • mapping tasks is challenging

6/41 Sparse Days 2012. Toulouse, June 25th

slide-18
SLIDE 18

CPU-GPU hybrid architectures

Heterogeneous platform

CPU CPU CPU CPU GPU RAM GPU CPU CPU CPU CPU GPU RAM RAM RAM RAM

Elimination tree

One option is to do the mapping by hand (see T. Davis’ talk at SIAM PP12). This requires a very accurate performance models difficult to achieve.

6/41 Sparse Days 2012. Toulouse, June 25th

slide-19
SLIDE 19

CPU-GPU hybrid architectures

Heterogeneous platform

CPU CPU CPU CPU GPU RAM GPU CPU CPU CPU CPU GPU RAM RAM RAM RAM

Elimination tree StarPU scheduler DSM drivers

(CPU, GPU, SPU)

Runtime system

Another option is to exploit the features of a modern runtime system capable of handling the scheduling and the data coherency in a dynamic way.

6/41 Sparse Days 2012. Toulouse, June 25th

slide-20
SLIDE 20

Runtime systems

Runtime system: abstract layer between application and machine with the following features:

  • Automatic detection of the task dependencies
  • Dynamic task scheduling on different types of processing units.
  • Management of multi-versioned tasks (an implementation for each

type of processing unit)

  • Coherency management of manipulated data.

7/41 Sparse Days 2012. Toulouse, June 25th

slide-21
SLIDE 21

Multifrontal QR factorization

  • n multicores
slide-22
SLIDE 22

The multifrontal QR factorization

The multifrontal QR factorization of a sparse matrix A = QR follows the pattern defined by the Cholesky factorization of the associated normal equations ATA = LLT because of the equivalence of R and L. It shares most of the features of the multifrontal Cholesky algorithm apart from (most importantly):

  • Frontal matrices are, in general, rectangular (both over or

under-determined)

  • Frontal matrices are fully factorized
  • Contribution blocks are stacked and not summed

9/41 Sparse Days 2012. Toulouse, June 25th

slide-23
SLIDE 23

The multifrontal QR factorization: parallelism

Parallelism comes from two sources:

  • Tree: nodes in separate branches can be treated independently
  • Node: large nodes can be treated by multiple processes

In qr mumps both sources are exploited consistently, by partitioning the frontal matrices and replacing the elimination tree with a DAG:

10/41 Sparse Days 2012. Toulouse, June 25th

slide-24
SLIDE 24

The multifrontal QR factorization: StarPU integration

StarPU Task

  • Code CPU
  • Code GPU
  • Code SPU

Priority Input 1 Input n ... Output 1 Output m ...

  • Depending on the input/output, StarPU detects the dependencies

among tasks

  • Depending on the availability of resources and the data placement,

StarPU decides where to run a task

11/41 Sparse Days 2012. Toulouse, June 25th

slide-25
SLIDE 25

The multifrontal QR factorization: StarPU integration

StarPU Task

  • Code CPU
  • Code GPU
  • Code SPU

Priority Input 1 Input n ... Output 1 Output m ...

The easy way: replace all the call operation1(i1, ..., in, o1, ..., om) with call submit task(operation1, i1, ..., in, o1, ..., om) and let StarPU do all the work

11/41 Sparse Days 2012. Toulouse, June 25th

slide-26
SLIDE 26

The multifrontal QR factorization: StarPU integration

StarPU Task

  • Code CPU
  • Code GPU
  • Code SPU

Priority Input 1 Input n ... Output 1 Output m ...

This is functionally correct but the DAG may have millions of nodes which makes the scheduling job too complex and memory consuming

11/41 Sparse Days 2012. Toulouse, June 25th

slide-27
SLIDE 27

The multifrontal QR factorization: StarPU integration

StarPU Task

  • Code CPU
  • Code GPU
  • Code SPU

Priority Input 1 Input n ... Output 1 Output m ...

Our approach: We give to StarPU a limited view of the DAG; this is achieved by defining tasks that submit ohter tasks.

11/41 Sparse Days 2012. Toulouse, June 25th

slide-28
SLIDE 28

The multifrontal QR factorization: StarPU integration

In the DAG we define

  • activation tasks, i.e., tasks in charge of allocating the memory and

preparing the data structures needed for processing a front

12/41 Sparse Days 2012. Toulouse, June 25th

slide-29
SLIDE 29

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-30
SLIDE 30

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-31
SLIDE 31

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-32
SLIDE 32

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-33
SLIDE 33

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-34
SLIDE 34

The multifrontal QR factorization: StarPU integration

  • All the activation tasks are submitted

at once with the right dependencies and very low priority. Each of them submits other tasks with higher priority

  • The runtime handles a DAG whose size

is proportional only to the number of fronts that are active at a given moment

  • Tree traversal orders can be identified

such that the size of this dynamic DAG is as small as possible but big enough to feed all the threads

13/41 Sparse Days 2012. Toulouse, June 25th

slide-35
SLIDE 35

Experimental setup

  • Platform:
  • 4× AMD hexacore
  • 76 GB of memory (in 4 NUMA modules)
  • GNU 4.4 compilers
  • MKL 10.2
  • Problems: some relatively small matrices from the UF collection

Matrix m n nnz flops degme 185501 659415 8127528 591 G karted 46502 133115 1770349 258 G flower 7 4 27693 67593 202218 4261 G EternityII E 11077 262144 1503732 544 G cat ears 4 4 19020 44448 132888 716 G tp-6 142752 1014301 11537419 255 G

14/41 Sparse Days 2012. Toulouse, June 25th

slide-36
SLIDE 36

Experimental results

1 2 4 6 12 18 24

# cores qrm qrm_starpu spqr

AMD 24 cores -- degme

100 80 60 40 20

time (sec.)

AMD 24 cores -- karted

1 2 4 6 12 18 24

# cores

5 10 15 20 25 30 35 40 45

qrm qrm_starpu spqr time (sec.)

15/41 Sparse Days 2012. Toulouse, June 25th

slide-37
SLIDE 37

Experimental results

AMD 24 cores -- flower_7_4

1 2 4 6 12 18 24

# cores

100 200 300 400 500 600 700

time (sec.) qrm qrm_starpu spqr 1 2 4 6 12 18 24 # cores

AMD 24 cores -- EternityII_E

10 20 30 40 50 60 70 80 90 qrm qrm_starpu spqr time (sec.)

16/41 Sparse Days 2012. Toulouse, June 25th

slide-38
SLIDE 38

Experimental results

AMD 24 cores -- cat_ears_4_4

120 100 80 60 40 20 1 2 4 6 12 18 24

qrm qrm_starpu spqr # cores time (sec.)

AMD 24 cores -- tp-6

1 2 4 6 12 18 24

# cores

10 20 30 40 50

time (sec.) qrm qrm_starpu spqr

17/41 Sparse Days 2012. Toulouse, June 25th

slide-39
SLIDE 39

Experimental results

Execution trace for the degme matrix: Two main problems can be identified:

  • Too much time is spent into tasks submission (in red). The issue is

under investigation

  • At the moment, parent-child dependencies are not finely managed

which means that it is not possible to start working on a node until all of its children are completely factorized

18/41 Sparse Days 2012. Toulouse, June 25th

slide-40
SLIDE 40

Multifrontal Cholesky: front factorization on Cpu-Gpu hybrid systems

slide-41
SLIDE 41

From elimination tree to frontal matrix

Bottom-up traversal of the elimination tree. At each vertex (front):

  • Assembling of contribution

blocks from children

  • Partial factorization of the

frontal matrix

Elimination tree

20/41 Sparse Days 2012. Toulouse, June 25th

slide-42
SLIDE 42

From elimination tree to frontal matrix

Bottom-up traversal of the elimination tree. At each vertex (front):

  • Assembling of contribution

blocks from children

  • Partial factorization of the

frontal matrix

Elimination tree

20/41 Sparse Days 2012. Toulouse, June 25th

slide-43
SLIDE 43

Tile Cholesky front factorization

  • Derived from: A class of parallel tiled linear algebra algorithms for

multicore architectures. Buttari et al., Parallel Comput., 2009

  • Extension: Partial factorization of NPiv variables and computation
  • f the Schur complement
  • Use of a runtime system: StarPU

NPiv

Elimination tree Frontal matrix (zoom) 21/41 Sparse Days 2012. Toulouse, June 25th

slide-44
SLIDE 44

Tile Cholesky front factorization, NB|NPiv case

  • NB: tile size
  • NPiv: number of pivots

NPiv k

Up-to-date data current pannel trailing submatrix Symetric tiles

KERNELS

NB

22/41 Sparse Days 2012. Toulouse, June 25th

slide-45
SLIDE 45

Tile Cholesky front factorization, NB|NPiv case

  • NB: tile size
  • NPiv: number of pivots

NPiv k

Up-to-date data trsm trailing submatrix Symetric tiles

KERNELS

potrf

NB

22/41 Sparse Days 2012. Toulouse, June 25th

slide-46
SLIDE 46

Tile Cholesky front factorization, NB|NPiv case

  • NB: tile size
  • NPiv: number of pivots

NPiv k

Up-to-date data trsm gemm Symetric tiles

KERNELS

potrf syrk

NB

22/41 Sparse Days 2012. Toulouse, June 25th

slide-47
SLIDE 47

Tile Cholesky front factorization, NB|NPiv case

  • NB: tile size
  • NPiv: number of pivots

NPiv k

Up-to-date data current panel trailing submatrix Symetric tiles

KERNELS

NB

22/41 Sparse Days 2012. Toulouse, June 25th

slide-48
SLIDE 48

Task flow

  • Fine granularity
  • High concurrency
  • Out-of-order execution

trsm potrf syrk gemm

23/41 Sparse Days 2012. Toulouse, June 25th

slide-49
SLIDE 49

Multicore architecture

  • 12 cores, 2,67 GHz Dual-socket Hexa-core Westmere Intel Xeon

X5650 processor

  • 12 MB L3 cache
  • 36 GB RAM

24/41 Sparse Days 2012. Toulouse, June 25th

slide-50
SLIDE 50

Performance (multicore architecture)

N N

25/41 Sparse Days 2012. Toulouse, June 25th

slide-51
SLIDE 51

Cholesky front factorization, general case (any NPiv)

NPiv k

Up-to-date data current pannel trailing submatrix Symetric tiles

KERNELS

NB 26/41 Sparse Days 2012. Toulouse, June 25th

slide-52
SLIDE 52

Cholesky front factorization, general case (any NPiv)

NPiv k

Up-to-date data current panel trailing submatrix Symetric tiles

KERNELS

NB 26/41 Sparse Days 2012. Toulouse, June 25th

slide-53
SLIDE 53

Cholesky front factorization, general case (any NPiv)

NPiv

potrf syrk trsm gemm potrf_sc

KERNELS

trsm_sc

26/41 Sparse Days 2012. Toulouse, June 25th

slide-54
SLIDE 54

Performance (multicore architecture)

N N

27/41 Sparse Days 2012. Toulouse, June 25th

slide-55
SLIDE 55

CPU-GPU hybrid architecture

  • 12 cores, 2,67 GHz Dual-socket Hexa-core Westmere Intel Xeon

X5650 processor

  • 12 MB L3 cache
  • 36 GB RAM
  • 3 NVIDIA Tesla M2070 (Fermi) GPUs

28/41 Sparse Days 2012. Toulouse, June 25th

slide-56
SLIDE 56

Performance (CPU-GPU hybrid architecture) - N=40K

29/41 Sparse Days 2012. Toulouse, June 25th

slide-57
SLIDE 57

Execution trace (CPU-GPU hybrid architecture)

Task execution Idle Active waiting

30/41 Sparse Days 2012. Toulouse, June 25th

slide-58
SLIDE 58

Conclusion, perspectives

slide-59
SLIDE 59

Conclusion

  • Preliminary work towards a functioning multifrontal code
  • Not as efficient (for now) as codes designed for a specific

architecture

  • Performance portability across architectures
  • Dense kernels to be released in the MAGMA library

32/41 Sparse Days 2012. Toulouse, June 25th

slide-60
SLIDE 60

Future work

  • Robust solver (Cholesky assembly step, solution steps)
  • Memory consumption (progressive task activation)
  • Cluster of heterogeneous nodes
  • Investigate explicit data dependencies management (DAGuE)

33/41 Sparse Days 2012. Toulouse, June 25th

slide-61
SLIDE 61

Future work

  • Robust solver (Cholesky assembly step, solution steps)
  • Memory consumption (progressive task activation)
  • Cluster of heterogeneous nodes
  • Investigate explicit data dependencies management (DAGuE)

Thanks!

33/41 Sparse Days 2012. Toulouse, June 25th

slide-62
SLIDE 62

Appendix

slide-63
SLIDE 63

Validation (step 1)

  • Factorization on nelim

35/41 Sparse Days 2012. Toulouse, June 25th

slide-64
SLIDE 64

Validation (step 1)

  • Factorization on nelim

36/41 Sparse Days 2012. Toulouse, June 25th

slide-65
SLIDE 65

Validation (step 1)

  • Update the trailing submatrix

37/41 Sparse Days 2012. Toulouse, June 25th

slide-66
SLIDE 66

Validation (step 1)

  • Update the trailing submatrix

38/41 Sparse Days 2012. Toulouse, June 25th

slide-67
SLIDE 67

Validation (step 2)

  • Finish factorization of ”nelim” tile

39/41 Sparse Days 2012. Toulouse, June 25th

slide-68
SLIDE 68

Validation (step 2)

  • Update the rest of the matrix

40/41 Sparse Days 2012. Toulouse, June 25th

slide-69
SLIDE 69

Validation (step 2)

  • Factorize the rest of the matrix, as if NB | nelim

41/41 Sparse Days 2012. Toulouse, June 25th