Computational Linear Algebra in the age of Multicores Alfredo - - PowerPoint PPT Presentation

computational linear algebra
SMART_READER_LITE
LIVE PREVIEW

Computational Linear Algebra in the age of Multicores Alfredo - - PowerPoint PPT Presentation

Computational Linear Algebra in the age of Multicores Alfredo Buttari , CNRS-IRIT Toulouse RAIM 2011, Perpignan Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear


slide-1
SLIDE 1

Computational Linear Algebra

in the age of Multicores

Alfredo Buttari, CNRS-IRIT Toulouse

RAIM 2011, Perpignan

slide-2
SLIDE 2

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

2/70 RAIM 2011, Perpignan

slide-3
SLIDE 3

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

3/70 RAIM 2011, Perpignan

slide-4
SLIDE 4

Single vs double

On modern processors Single precision is faster than Double precision

4/70 RAIM 2011, Perpignan

slide-5
SLIDE 5

Single vs double

On modern processors Single precision is faster than Double precision

Vector units

we can fit twice as many single precision values in a vector than double precision

4/70 RAIM 2011, Perpignan

slide-6
SLIDE 6

Single vs double

On modern processors Single precision is faster than Double precision

Vector units

we can fit twice as many single precision values in a vector than double precision

Memory bus

we can move twice as many single precision values through the bus than double precision

4/70 RAIM 2011, Perpignan

slide-7
SLIDE 7

Single vs double

On modern processors Single precision is faster than Double precision

Vector units

we can fit twice as many single precision values in a vector than double precision

Memory bus

we can move twice as many single precision values through the bus than double precision

Caches, TLBs and registers

we can fit twice as many single precision values in the cache than double precision

4/70 RAIM 2011, Perpignan

slide-8
SLIDE 8

Single vs double

System BLAS 1 Intel Pentium III Coppermine GotoBLAS 2 Intel Pentium III Katmai GotoBLAS 3 Sun UltraSPARC IIe Sunperf 4 Intel Pentium IV Prescott GotoBLAS 5 Intel Pentium IV-M Northwood GotoBLAS 6 AMD Opteron GotoBLAS 7 CRAY X1 libsci 8 IBM PowerPC G5 VecLib 9 Compaq ALPHA EV6 CXML 10 IBM Power 3 ESSL 11 SGI Octane ATLAS

5/70 RAIM 2011, Perpignan

slide-9
SLIDE 9

Single vs double

1 2 3 4 5 6 7 8 9 10 11 1 2

Single vs Double precision

Single/Double system _gemm _getrf _gesv

6/70 RAIM 2011, Perpignan

slide-10
SLIDE 10

Mixed precision iterative refinement

Can we solve a linear system with single-precision speed but double-precision accuracy?

7/70 RAIM 2011, Perpignan

slide-11
SLIDE 11

Mixed precision iterative refinement

Can we solve a linear system with single-precision speed but double-precision accuracy? almost

7/70 RAIM 2011, Perpignan

slide-12
SLIDE 12

Newton’s method

The Newton’s method says that given an approximate root of the function f(x), we can refine it through iterations of the type: xk+1 = xk − f(xk) f′(xk)

Newton's method f(x)

k k+1

8/70 RAIM 2011, Perpignan

slide-13
SLIDE 13

Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root

  • f the function f(x) = b − Ax (equivalent to solving the linear system

Ax = b). xk+1 = xk − A−1rk where rk = b − Axk This leads to the well known iterative refinement method: x0← A−1b repeat rk ← b − Axk ek ← A−1rk xk+1 ← xk − ek until convergence

9/70 RAIM 2011, Perpignan

slide-14
SLIDE 14

Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root

  • f the function f(x) = b − Ax (equivalent to solving the linear system

Ax = b). xk+1 = xk − A−1rk where rk = b − Axk This leads to the well known iterative refinement method: x0← A−1b O(n3) repeat rk ← b − Axk O(n2) ek ← A−1rk O(n2) xk+1 ← xk − ek O(n) until convergence

9/70 RAIM 2011, Perpignan

slide-15
SLIDE 15

Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root

  • f the function f(x) = b − Ax (equivalent to solving the linear system

Ax = b). xk+1 = xk − A−1rk where rk = b − Axk This leads to the well known iterative refinement method: x0← A−1b εs repeat rk ← b − Axk εd ek ← A−1rk εs xk+1 ← xk − ek εd until convergence We can perform the expensive factorization in single precision and then do the refinement in double.

9/70 RAIM 2011, Perpignan

slide-16
SLIDE 16

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

10/70 RAIM 2011, Perpignan

slide-17
SLIDE 17

Dense linear systems

Mixed precision, Iterative Refinement for Dense Direct Solvers

1: LU← PA

(εs)

2: solve Ly = Pb

(εs)

3: solve Ux0 = y

(εs)

4: r0 ← b − Ax0

(εd) do k = 0, 1, 2, ...

5:

solve Ly = Prk (εs)

6:

solve Uek = y (εs)

7:

xk+1 ← xk + ek (εd) check convergence

8:

rk+1 ← b − Axk+1 (εd) done call sgetrf( ) ! step 1 call sgetrs( ) ! steps 2,3 call dgemm( ) ! step 4 do it= 1, itmax call sgetrs( ) ! stes 5,6 call dlaxpy( ) ! step 7 call idamax( ) ! conv. if converged exit call dgemm( ) ! step 8 end do

11/70 RAIM 2011, Perpignan

slide-18
SLIDE 18

Results: Double vs Mixed

AMD Opteron 246 – 2.0 GHz

1000 2000 3000 4000 5000 1 2 3 4 5 6 problem size Gflop/s

Unsymmetric problems

single double mixed 1000 2000 3000 4000 5000 1 2 3 4 5 6 problem size Gflop/s

Symmetric problems

single double mixed

Convergence is always achieved within 3 iterations when b − Ax2 ≤ x2 · Afro · εd · √n

12/70 RAIM 2011, Perpignan

slide-19
SLIDE 19

Results: Double vs Mixed

IBM PowerPC 970 – 2.0 GHz

1000 2000 3000 4000 5000 2 4 6 8 10 problem size Gflop/s

Symmetric problems

1000 2000 3000 4000 5000 1 2 3 4 5 6 7 8 problem size Gflop/s

Unsymmetric problems

single double mixed single double mixed

Convergence is always achieved within 3 iterations when b − Ax2 ≤ x2 · Afro · εd · √n

12/70 RAIM 2011, Perpignan

slide-20
SLIDE 20

Results: Double vs Mixed

Intel Woodcrest – 3.0 GHz

1000 2000 3000 4000 5000 5 10 15 problem size Gflop/s

Symmetric problems

1000 2000 3000 4000 5000 2 4 6 8 10 12 14 16 problem size Gflop/s

Unsymmetric problems

single double mixed single double mixed

Convergence is always achieved within 3 iterations when b − Ax2 ≤ x2 · Afro · εd · √n

12/70 RAIM 2011, Perpignan

slide-21
SLIDE 21

Results: Double vs mixed

SP is 14× faster than DP on the 1st generation Cell:

  • a factor 2 from the vector units
  • a factor 7 from the pipelining

STI Cell CBE

1000 2000 3000 4000 25 50 75 100 125 150 175 200

Size Gflop/s

SP peak SGEMM peak DP peak DSPOSV SPOSV

13/70 RAIM 2011, Perpignan

slide-22
SLIDE 22

Results: Quad vs mixed

Intel Xeon – 3.2 GHz

100 200 300 400 500 600 700 800 900 1000 10 20 30 40 50 60 70 80 90 100

problem size mmixed/quad performance

Quad vs Mixed -- Unsymmetric

Convergence is always achieved within 3 iterations when b − Ax2 ≤ x2 · Afro · εq · √n

14/70 RAIM 2011, Perpignan

slide-23
SLIDE 23

Convergence

Double precision (Datta)

its =

  • ln(εd)

ln(εd) + ln(κ)

  • Mixed precision

its =

  • ln(εd)

ln(εs) + ln(κ)

  • 10

10

2

10

4

10

6

10

8

10

10

5 10 15 20 25 30 35

Condition number # iterations

The method is stopped when b−Ax2 ≤ x2 ·Afro ·εd ·√n

15/70 RAIM 2011, Perpignan

slide-24
SLIDE 24

So far so good

16/70 RAIM 2011, Perpignan

slide-25
SLIDE 25

So far so good

What’s good

  • Almost as fast as lower precision
  • As accurate as higher precision
  • Convergence is achieved on most practical cases within few, cheap

iterations

16/70 RAIM 2011, Perpignan

slide-26
SLIDE 26

So far so good

What’s good

  • Almost as fast as lower precision
  • As accurate as higher precision
  • Convergence is achieved on most practical cases within few, cheap

iterations

What’s bad

  • It requires the matrix stored in both high and low precision (50%

more storage)

16/70 RAIM 2011, Perpignan

slide-27
SLIDE 27

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

17/70 RAIM 2011, Perpignan

slide-28
SLIDE 28

Sparse linear systems

With MUMPS it works exactly the same as for dense linear systems:

1: LU← PA

(εs)

2: solve Ly = Pb

(εs)

3: solve Ux0 = y

(εs)

4: r0 ← b − Ax0

(εd) do k = 0, 1, 2, ...

5:

solve Ly = Prk (εs)

6:

solve Uek = y (εs)

7:

xk+1 ← xk + ek (εd) check convergence

8:

rk+1 ← b − Axk+1 (εd) done job=6 call smumps( ) ! steps 1,2,3 call dspmv( ) ! step 4 do it= 1, itmax job=3 call smumps( ) ! steps 5,6 call dlaxpy( ) ! step 7 call idamax( ) ! conv. if converged exit call dspmv( ) ! step 8 end do

18/70 RAIM 2011, Perpignan

slide-29
SLIDE 29

Sparse linear systems

Matrix N NNZ Sym. PD κ 1 G64 7000 82918 yes no O(104) 2 Si10H16 17077 875923 yes no O(103) 3 c-71 76638 859554 yes no O(10) 4 cage11 39082 559722 no no O(1) 5 dawson5 51537 1010777 yes no O(104) 6 nasasrb 54870 2677324 yes yes O(107) 7 poisson3Db 85623 2374949 no no O(103) 8 rma10 46835 2374001 no no O(10) 9 s3rmt3m1 5489 112505 yes yes O(109) 10 wang4 26068 177196 no no O(103)

19/70 RAIM 2011, Perpignan

slide-30
SLIDE 30

Sparse linear systems: results

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Matrix n. Speedup wrt double precision

SunUltraSparc−IIe

3 3 2 2 3 9 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 1 2 3 4 Matrix n. Speedup wrt double precision

Intel PentiumIII

5 4 2 2 5 9 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 Matrix n. Speedup wrt double precision

AMD Opteron 246

5 4 2 2 4 10 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 Matrix n. Speedup wrt double precision

Intel Woodcrest

6 4 3 2 5 10 2 2 20 2 Single prec. su Mixed prec. su

20/70 RAIM 2011, Perpignan

slide-31
SLIDE 31

Sparse linear systems: memory

When using sparse direct solvers, the original matrix and the factors are stored separately Matrix N NNZ NNZ in LU 1 G64 7000 82918 4889782 2 Si10H16 17077 875923 32888553 3 c-71 76638 859554 21077836 4 cage11 39082 559722 60016690 5 dawson5 51537 1010777 7547980 6 nasasrb 54870 2677324 15682155 7 poisson3Db 85623 2374949 74907857 8 rma10 46835 2374001 13617819 9 s3rmt3m1 5489 112505 847139 10 wang4 26068 177196 11708654

21/70 RAIM 2011, Perpignan

slide-32
SLIDE 32

Sparse linear systems: memory

In mixed precisions iterref factors are stored in SP instead of DP

1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1

relative memory consumption matrix number

Memory in MUMPS iterref

22/70 RAIM 2011, Perpignan

slide-33
SLIDE 33

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

23/70 RAIM 2011, Perpignan

slide-34
SLIDE 34

Sparse, iterative solvers

Iterref and Richardson

The iterative refinement xk+1 = xk + M(b − Axk) with M = (LU)−1P, is equivalent to the Richardson iteration on MAx = Mb where M is called left preconditioner We can go three steps further:

  • 1. use right preconditioning: AMu = b,

x = Mu

  • 2. replace M with one step if an iterative method on A in single

precision

  • 3. use a Krylov method for the outer iteration in double precision

24/70 RAIM 2011, Perpignan

slide-35
SLIDE 35

Sparse, iterative solvers

The result is mixed-precision, inner-outer iteration, of the type FGMRES-GMRES:

1: for i = 0, 1, ... do 2:

r = b − Axi (εd)

3:

β = h1,0 = ||r||2 (εd)

4:

check convergence and exit if done

5:

for k = 1, . . . , mout do

6:

vk = r / hk,k−1 (εd)

7:

Perform one cycle of GMRESSP (min) in order to (approximately) solve Azk = vk, (initial guess zk = 0) (εs)

8:

r = A zk (εd)

9:

for j=1,. . . ,k do

10:

hj,k = rT vj (εd)

11:

r = r − hj,k vj (εd)

12:

end for

13:

hk+1,k = ||r||2 (εd)

14:

end for

15:

Define Zk = [z1, . . . , zk], Hk = {hi,j}1≤i≤k+1,1≤j≤k (εd)

16:

Find yk, the vector of size k, that minimizes ||βe1 − Hk yk||2 (εd)

17:

xi+1 = xi + Zk yk (εd)

18: end for

25/70 RAIM 2011, Perpignan

slide-36
SLIDE 36

Sparse, iterative solvers: results

n. Matrix N NNZ Sym PD κ 1 SiO 33401 1317655 yes no O(103) 2 Lin 25600 1766400 yes no O(105) 3 c-71 76638 859554 yes no O(10) 4 cage-11 39082 559722 no no O(1) 5 raefsky3 21200 1488768 no no O(10) 6 poisson3Db 85623 2374949 no no O(103)

26/70 RAIM 2011, Perpignan

slide-37
SLIDE 37

Sparse, iterative solvers: results

matrix number speedup

SP-DP/DP

1 2 3 4 5 6 1 2 3 4 5 6

matrix number speedup

SP-DP / DP-DP

1 2 3 4 5 6 0.5 1 1.5 2 2.5 Intel Woodcrest 3.0 GHz AMD Opteron 246 2.0 GHz IBM PowerPC 970 2.5 GHz Intel Woodcrest 3.0 GHz AMD Opteron 246 2.0 GHz IBM PowerPC 970 2.5 GHz

27/70 RAIM 2011, Perpignan

slide-38
SLIDE 38

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

28/70 RAIM 2011, Perpignan

slide-39
SLIDE 39

Coding for multicores

Fine Granularity

  • as parallelism degree increases, finer granularity is needed
  • data should be split in pieces that fit in the small local memories

Asynchronicity

  • higher parallelism degree suffers from strict synchronizations
  • asynchronous, dynamic execution can be used to hide the latency
  • f access to memory

Data Locality

  • the gap between CPU power and communication speed is

increasing and, thus, it is essential to reduce data movement

29/70 RAIM 2011, Perpignan

slide-40
SLIDE 40

How to achieve the properties?

We must operate the transition to Parallel Algorithms

LAPACK LAPACK MT−BLAS thread 1 thread n ST−BLAS ST−BLAS thread 1 thread n

parallelism parallelism

30/70 RAIM 2011, Perpignan

slide-41
SLIDE 41

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

31/70 RAIM 2011, Perpignan

slide-42
SLIDE 42

LAPACK algorithms

Panel reduction: a number of transformations are computed on a small portion of the matrix called “panel”. These are Level-2 BLAS operations. Trailing submatrix update: the transformations computed on the panel are applied to the rest of the matrix, i.e., the “trailing submatrix”. These are Level-3 BLAS operations.

32/70 RAIM 2011, Perpignan

slide-43
SLIDE 43

LAPACK algorithms

Panel reduction: a number of transformations are computed on a small portion of the matrix called “panel”. These are Level-2 BLAS operations. Trailing submatrix update: the transformations computed on the panel are applied to the rest of the matrix, i.e., the “trailing submatrix”. These are Level-3 BLAS operations.

32/70 RAIM 2011, Perpignan

slide-44
SLIDE 44

LAPACK algorithms

Idle time can be due to sequential tasks and load unbalance.

33/70 RAIM 2011, Perpignan

slide-45
SLIDE 45

Fine Granularity: parallel tiled algorithms

Parallel tiled algorithms: fine granularity (of data and operations) can be achieved by partitioning a matrix A of size m × n into tiles of size b × b: [A] =      A11 A12 · · · A1s A21 A22 · · · A2s . . . . . . ... . . . Ar1 Ar2 · · · Ars      Algorithms are rewritten in such a way that tiles are operands of elementary operations.

34/70 RAIM 2011, Perpignan

slide-46
SLIDE 46

Fine Granularity: parallel tiled algorithms

There are cases where the “tiling” can be done starting from the LAPACK algorithm. DPOTRF: A → L = chol(A) DTRSM: L, A → A = AL−T DSYRK: B, A → A = A − BBT for k = 1, nb + 1, 2 ∗ nb + 1, ... do DPOTRF(A(

k : k+b−1, k : k+b−1 ))

DTRSM(L(

k : k+b−1, k : k+b−1 ), A ( k+b : n, k : k+b−1 ))

DSYRK(L(

k+b : n, k : k+b−1 ), A ( k+b : n, k+b : n ))

end for

35/70 RAIM 2011, Perpignan

slide-47
SLIDE 47

Fine Granularity: parallel tiled algorithms

There are cases where the “tiling” can be done starting from the LAPACK algorithm. for k = 1, 2, ..., s do DPOTRF(Akk) for i = k + 1, k + 2, ..., s do DTRSM(Lkk, Aik) end for for i = k + 1, k + 2, ..., s do for j = k + 1, k + 2, ..., s do DSYRK(Lik, Ljk, Aij) end for end for end for

35/70 RAIM 2011, Perpignan

slide-48
SLIDE 48

Fine Granularity: parallel tiled algorithms

Things get much more complicated for LU... DGETRF: Akk → Lkk, Ukk, Pkk = LU(Akk) DGESSM: Akj, Lkk, Pkk → Ukj = L−1

kk PkkAkj

DTSTRF: Ukk Aik

  • → Ukk, Lik, Pik = LU

Ukk Aik

  • DSSSSM:

Ukj Aij

  • , Lik, Pik →

Ukj Aij

  • = L−1

ik Pik

Ukj Aij

  • 36/70

RAIM 2011, Perpignan

slide-49
SLIDE 49

Fine Granularity: parallel tiled algorithms

... and QR DGEQRT: Akk − → (Vkk, Rkk, Tkk) = QR(Akk) DLARFB: Akj, Vkk, Tkk − → Rkj = (I − VkkTkkV T

kk)Akj

DTSQRT: Rkk Aik

→ (Vik, Tik, Rkk) = QR Rkk Aik

  • DSSRFB:

Rkj Aij

  • , Vik, Tik −

→ Rkj Aij

  • = (I − VikTikV T

ik )

Rkj Aij

  • 37/70

RAIM 2011, Perpignan

slide-50
SLIDE 50

Fine Granularity: parallel tile algorithms LU:

for k = 1, 2..., min(p, q) do DGETRF(Akk, Lkk, Ukk, Pkk) for j = k + 1, k + 2, ..., q do DGESSM(Akj, Lkk, Pkk, Ukj) end for for i = k + 1, k + 1, ..., p do DTSTRF(Ukk, Aik, Pik) for j = k + 1, k + 2, ..., q do DSSSSM(Ukj, Aij, Lik, Pik) end for end for end for

QR:

for k = 1, 2..., min(p, q) do DGEQRT(Akk, Lkk, Ukk, Pkk) for j = k + 1, k + 2, ..., q do DLARFB(Akj, Lkk, Pkk, Ukj) end for for i = k + 1, k + 1, ..., p do DTSQRT(Ukk, Aik, Pik) for j = k + 1, k + 2, ..., q do DSSRFB(Ukj, Aij, Lik, Pik) end for end for end for

38/70 RAIM 2011, Perpignan

slide-51
SLIDE 51

Fine Granularity: parallel tile algorithms

39/70 RAIM 2011, Perpignan

slide-52
SLIDE 52

Fine Granularity: parallel tile algorithms

39/70 RAIM 2011, Perpignan

slide-53
SLIDE 53

Fine Granularity: parallel tile algorithms

39/70 RAIM 2011, Perpignan

slide-54
SLIDE 54

Fine Granularity: parallel tile algorithms

39/70 RAIM 2011, Perpignan

slide-55
SLIDE 55

Fine Granularity: parallel tile algorithms

Task

The execution of an elementary operation on one (or a few) tiles. fine granularity: tiles can be of arbitrary small size as long as sequential BLAS routines are efficient on them many tasks: tiles are typically of size ∼ 100; for a matrix of size 10000 we have more than 300000 tasks!!!. complex dependencies: how to exploit them in order to maximize parallelism???

Data-Flow Parallelism

40/70 RAIM 2011, Perpignan

slide-56
SLIDE 56

Data-flow parallel programming

Data-flow programming model [Silc et al. 97]

An instruction is said to be executable when all the input operands that are necessary for its execution are available to it. The instruction for which this condition is satisfied is said to be fired. The effect of Dependency an instruction is the consumption of its input values and generation of output values... As a result, a dataflow program can be represented as a directed graph consisting of named nodes, which represent instructions, and arcs, which represent data dependencies among instructions.

41/70 RAIM 2011, Perpignan

slide-57
SLIDE 57

Asynchronicity: dynamic DAG driven execution

  • nodes are tasks that operate on

tiles

  • edges are dependencies among

nodes Tasks can be scheduled asynchronously in any order as long as the dependencies are not violated.

  • dynamic execution flow
  • latency hiding
  • fewer synchro
  • no idle times
  • scheduling policies

42/70 RAIM 2011, Perpignan

slide-58
SLIDE 58

Asynchronicity: dynamic DAG driven execution

  • nodes are tasks that operate on

tiles

  • edges are dependencies among

nodes Tasks can be scheduled asynchronously in any order as long as the dependencies are not violated.

  • dynamic execution flow
  • latency hiding
  • fewer synchro
  • no idle times
  • scheduling policies

42/70 RAIM 2011, Perpignan

slide-59
SLIDE 59

Asynchronicity: dynamic DAG driven execution

Data-flow programming model [Silc et al. 97]

Due to the above rule the model is asynchronous. It is also self-scheduling since instruction sequencing is constrained only by data dependencies.

43/70 RAIM 2011, Perpignan

slide-60
SLIDE 60

Block Data Layout

Fine-granularity raises the need for novel data storage format in order to overcome the limitations of BLAS routines on small chunks of data

44/70 RAIM 2011, Perpignan

slide-61
SLIDE 61

Block Data Layout

Fine-granularity raises the need for novel data storage format in order to overcome the limitations of BLAS routines on small chunks of data

44/70 RAIM 2011, Perpignan

slide-62
SLIDE 62

Performance results

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70

Cholesky −− quad−socket, dual−core Opteron

problem size Gflop/s

PLASMA & ACML BLAS ACML Cholesky MKL Cholesky LAPACK & ACML BLAS

45/70 RAIM 2011, Perpignan

slide-63
SLIDE 63

Performance results

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70

QR −− quad−socket, dual−core Opteron

problem size Gflop/s

PLASMA & ACML BLAS ACML QR MKL QR LAPACK & ACML BLAS

46/70 RAIM 2011, Perpignan

slide-64
SLIDE 64

Performance results

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70

LU −− quad−socket, dual−core Opteron

problem size Gflop/s

PLASMA & ACML BLAS ACML LU MKL LU LAPACK & ACML BLAS

47/70 RAIM 2011, Perpignan

slide-65
SLIDE 65

Stability of GETWP

Gaussian Elimination with Tiled Pairwise Pivoting:

N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU

Three main differences occur between L from GEPP and N from GETWP:

  • 1. N is the combination of permutations and elementary

transformations, the effect of which can not be dissociated in two matrices L and P as it is for GEPP ,

  • 2. although we need n(n − 1)/2 elements to store all the (Li,j)’s, the

matrix N is not unit lower triangular and has in general a lot more than n(n − 1)/2 entries,

  • 3. the absolute values of the entries of the off-diagonal elements of N

can be greater than 1 and in practice they are notably larger.

48/70 RAIM 2011, Perpignan

slide-66
SLIDE 66

Stability of GETWP

Gaussian Elimination with Tiled Pairwise Pivoting:

N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU

Backward error for the factorization and the solution

48/70 RAIM 2011, Perpignan

slide-67
SLIDE 67

Stability of GETWP

Gaussian Elimination with Tiled Pairwise Pivoting:

N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU

  • y − Axwp∞

A∞xwp∞

  • /
  • y − Axpp∞

A∞xpp∞

  • A − NwpUwp∞

A∞

  • /
  • P A − LppUpp∞

A∞

  • 48/70

RAIM 2011, Perpignan

slide-68
SLIDE 68

Outline

Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear Algebra on Multicores Dense Factorizations Sparse QR Factorization

49/70 RAIM 2011, Perpignan

slide-69
SLIDE 69

Multifrontal QR

  • 1. Analysis: symbolic computations to reduce fill-in, compute

elimination tree, symbolic factorization, estimates etc.

  • 2. Factorization: compute the Q and R factors
  • 3. Solve: use Q and R to compute the solution of a problem (e.g.

minxAx − b2)

1 3 4 5 6 7 8 9

2 50/70 RAIM 2011, Perpignan

slide-70
SLIDE 70

Multifrontal QR

  • 2. Factorization: compute the Q and R factors

1 3 4 5 6 7 8 9

2

  • a dense frontal matrix is associated to each node.

Every frontal matrix contains a number of fully assembled rows that deliver a piece of the global R factor

  • the tree is processed bottom-up and at each

node:

  • 1. Assembly: the contribution blocks from the

children nodes are assembled into the frontal matrix

  • 2. Factorization: k eliminations are done trough

partial or full factorization of the frontal matrix

50/70 RAIM 2011, Perpignan

slide-71
SLIDE 71

The Multifrontal QR: front factorization

  • the frontal matrix is assumed to be sparse and permuted into a

staircase structure

  • a complete Level-3 BLAS QR factorization of the frontal matrix is

computed

  • the result is
  • k rows of the global R factor
  • a number of reflectors used to build the Q factor
  • a contibution block to be assembled into the father node

51/70 RAIM 2011, Perpignan

slide-72
SLIDE 72

The Multifrontal QR: front assembly

Frontal matrix assembly is achieved in two steps:

52/70 RAIM 2011, Perpignan

slide-73
SLIDE 73

The Multifrontal QR: front assembly

Frontal matrix assembly is achieved in two steps:

  • 1. contribution blocks are simply

appended at the bottom of the father front

52/70 RAIM 2011, Perpignan

slide-74
SLIDE 74

The Multifrontal QR: front assembly

Frontal matrix assembly is achieved in two steps:

  • 1. contribution blocks are simply

appended at the bottom of the father front

  • 2. a row permutation must be

computed to restore a staircase structure

52/70 RAIM 2011, Perpignan

slide-75
SLIDE 75

Parallelism

As for the Cholesky, LU, LDLT multifrontal method, two levels of parallelism can be exploited:

Tree parallelism

  • fronts associated to nodes in different branches are independent

and can, thus, be factorized in parallel

Front parallelism

  • if the size of a front is big enough, the front can be factorized in

parallel

53/70 RAIM 2011, Perpignan

slide-76
SLIDE 76

Parallelism: classical approach

The classical approach (Puglisi, Matstom, Davis)

  • Tree parallelism:
  • a front assembly+factorization corresponds to a task
  • computational tasks are added to a task pool
  • threads fetch tasks from the pool repeatedly until all the fronts are

done

  • Node parallelism:
  • Multithreaded BLAS for the front facto

What’s wrong with this approach? A complete separation of the two levels of parallelism which causes

  • potentially strong load unbalance
  • heavy synchronizations due to the sequential nature of some
  • perations (assembly)
  • sub-optimal exploitation of the concurrency in the multifrontal

method

54/70 RAIM 2011, Perpignan

slide-77
SLIDE 77

Parallelism: classical approach

Node parallelism grows towards the root Tree parallelism grows towards the leaves

55/70 RAIM 2011, Perpignan

slide-78
SLIDE 78

Parallelism: a new approach

fine-grained, data-flow parallel approach

  • fine granularity: tasks are not defined as operations
  • n fronts but as operations on portions of fronts

defined by a 1-D partitioning

  • data flow parallelism: tasks are scheduled

dynamically based on the dependencies between them

Both node and tree parallelism are handled the same way at any level

  • f the tree.

56/70 RAIM 2011, Perpignan

slide-79
SLIDE 79

Parallelism: a new approach

Fine-granularity is achieved through a 1-D block partitioning of fronts and the definition of five elementary operations:

  • 1. activate(front): the activation of a front

corresponds to a full determination of its (staircase) structure and allocation of the needed memory areas

  • 2. panel(bcol): QR factorization (Level2

BLAS) of a column

  • 3. update(bcol): update of a column in the

trailing submatrix wrt to a panel

  • 4. assemble(bcol): assembly of a column of

the contribution block into the father

  • 5. clean(front): cleanup the front in order to

release all the memory areas that are no more needed

57/70 RAIM 2011, Perpignan

slide-80
SLIDE 80

Parallelism: a new approach

58/70 RAIM 2011, Perpignan

slide-81
SLIDE 81

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

58/70 RAIM 2011, Perpignan

slide-82
SLIDE 82

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual

58/70 RAIM 2011, Perpignan

slide-83
SLIDE 83

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-84
SLIDE 84

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-85
SLIDE 85

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-86
SLIDE 86

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-87
SLIDE 87

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-88
SLIDE 88

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-89
SLIDE 89

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-90
SLIDE 90

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

58/70 RAIM 2011, Perpignan

slide-91
SLIDE 91

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

  • Dependency rule #1:

a panel can be factorized as soon as it is updated wrt the previous step (lookahead). Early panel factorizations results in more updates in a “ready” state and, thus, more parallelism

58/70 RAIM 2011, Perpignan

slide-92
SLIDE 92

Parallelism: a new approach

  • a frontal matrix is 1-D partitioned into

block-columns

  • panels are factorized as usual
  • updates can be applied to each column

separately

  • Dependency rule #1:

a panel can be factorized as soon as it is updated wrt the previous step (lookahead). Early panel factorizations results in more updates in a “ready” state and, thus, more parallelism

  • Dependency rule #2: a column can be

updated wrt a panel if it is up to date wrt all previous panels

58/70 RAIM 2011, Perpignan

slide-93
SLIDE 93

Parallelism: a new approach

A look at the whole tree:

59/70 RAIM 2011, Perpignan

slide-94
SLIDE 94

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

59/70 RAIM 2011, Perpignan

slide-95
SLIDE 95

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

59/70 RAIM 2011, Perpignan

slide-96
SLIDE 96

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

59/70 RAIM 2011, Perpignan

slide-97
SLIDE 97

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

  • Dependency rule #3: a node can be

activated only if all of its children are already active

59/70 RAIM 2011, Perpignan

slide-98
SLIDE 98

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

  • Dependency rule #3: a node can be

activated only if all of its children are already active

59/70 RAIM 2011, Perpignan

slide-99
SLIDE 99

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

  • Dependency rule #3: a node can be

activated only if all of its children are already active

  • Dependency rule #4: a column can be

assembled into the father, if it is up-to-date wrt all the preceding panels and the father is active

59/70 RAIM 2011, Perpignan

slide-100
SLIDE 100

Parallelism: a new approach

A look at the whole tree:

  • fronts must be activated: the structure

is computed and memory is allocated

  • Dependency rule #3: a node can be

activated only if all of its children are already active

  • Dependency rule #4: a column can be

assembled into the father, if it is up-to-date wrt all the preceding panels and the father is active

  • Dependency rule #5: a column can be

treated if it is fully assembled regardless of the status of the rest of the tree

59/70 RAIM 2011, Perpignan

slide-101
SLIDE 101

Parallelism: the DAG-tree

Following a data-flow programming models, The elimination tree gets replaced by a DAG:

60/70 RAIM 2011, Perpignan

slide-102
SLIDE 102

Parallelism: scheduling

exec loop do call get task() select case(task_type) case (0) exit case (1) call do activate(...) case (2) call do panel(...) case (3) call do update(...) case (4) call do assemble(...) case (5) call do clean(...) end select end do

Each thread runs a loop where:

  • 1. it fetches a task
  • 2. it executes the fetched task

The resulting flow of tasks is

  • self-scheduled
  • asynchronous

61/70 RAIM 2011, Perpignan

slide-103
SLIDE 103

Parallelism: results

Matrix Rucci1

1 2 4 8 1 2 4 8

Speedup −− Intel Xeon

# of cores speedup

qrm spqr dgeqrf

Matrix ohne2

1 2 4 8 1 2 4 8

Speedup −− Intel Xeon

# of cores speedup

qrm spqr dgeqrf

62/70 RAIM 2011, Perpignan

slide-104
SLIDE 104

Parallelism: results

Matrix Rucci1

1 2 4 8 16 24 1 2 4 8 16 24

Speedup −− AMD Opteron

# of cores speedup

qrm spqr dgeqrf

Matrix ohne2

1 2 4 8 16 24 1 2 4 8 16 24

Speedup −− AMD Opteron

# of cores speedup

qrm spqr dgeqrf

63/70 RAIM 2011, Perpignan

slide-105
SLIDE 105

Parallelism: results

Matrix Rucci1

1 2 4 8 0.2 0.4 0.6 0.8 1

Performance −− Intel Xeon

# of cores fraction of dgemm

dgeqrf qrm

Matrix ohne2

1 2 4 8 0.2 0.4 0.6 0.8 1

Performance −− Intel Xeon

# of cores fraction of dgemm

dgeqrf qrm

64/70 RAIM 2011, Perpignan

slide-106
SLIDE 106

Parallelism: results

Matrix Rucci1

1 2 4 8 16 24 0.2 0.4 0.6 0.8 1

Performance −− AMD Opteron

# of cores fraction of dgemm

dgeqrf qrm

Matrix ohne2

1 2 4 8 16 24 0.2 0.4 0.6 0.8 1

Performance −− AMD Opteron

# of cores fraction of dgemm

dgeqrf qrm

65/70 RAIM 2011, Perpignan

slide-107
SLIDE 107

Parallelism: results

50 100 150 200 250 300 100 200 400 800

qr-mumps -- 799.8% spqr -- 773.1%

% CPU usage

time (sec.)

66/70 RAIM 2011, Perpignan

slide-108
SLIDE 108

Parallelism: results

67/70 RAIM 2011, Perpignan

slide-109
SLIDE 109

Biblio

◮ High Performance Computing and Grids in Action, chapter Exploiting Mixed

Precision Floating Point Hardware in Scientific Computations. 2007. [PDF].

◮ Marc Baboulin, Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou,

Julien Langou, Piotr Luszczek, and Stanimire Tomov. Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12):2526–2533, 2009. [doi:10.1016/j.cpc.2008.11.005].

◮ A. Buttari.

Fine granularity sparse QR factorization for multicore based systems. To appear in PARA2010 conference proceedings. [PDF]

slide-110
SLIDE 110

Biblio

◮ A. Buttari, J. Dongarra, J. Kurzak, P. Luszczek, and S. Tomov.

Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy. ACM Trans. Math. Softw., 34(4):1–22, 2008. [doi:10.1145/1377596.1377597].

◮ A. Buttari, J. Dongarra, J. Langou, J. Langou, P. Luszczek, and J. Kurzak.

Mixed precision iterative refinement techniques for the solution of dense linear systems.

  • Int. J. High Perform. Comput. Appl., 21(4):457–466, 2007.

[doi:10.1177/1094342007084026].

◮ A. Buttari, J. Langou, J. Kurzak, and J. Dongarra.

Parallel tiled qr factorization for multicore architectures. In PPAM’07: Proceedings of the 7th international conference on Parallel processing and applied mathematics, pages 639–648, Berlin, Heidelberg, 2008. Springer-Verlag. [doi:10.1007/978-3-540-68111-3 67].

slide-111
SLIDE 111

Biblio

◮ A. Buttari, J. Langou, J. Kurzak, and J. Dongarra.

A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput., 35(1):38–53, 2009. [doi:10.1016/j.parco.2008.10.002].

◮ J. Kurzak, A. Buttari, and J. Dongarra.

Solving systems of linear equations on the cell processor using cholesky factorization. IEEE Trans. Parallel Distrib. Syst., 19(9):1175–1186, 2008. [doi:10.1109/TPDS.2007.70813].

◮ J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra.

Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 113, New York, NY, USA, 2006. ACM. [doi:10.1145/1188455.1188573].