Computational Linear Algebra
in the age of Multicores
Alfredo Buttari, CNRS-IRIT Toulouse
RAIM 2011, Perpignan
Computational Linear Algebra in the age of Multicores Alfredo - - PowerPoint PPT Presentation
Computational Linear Algebra in the age of Multicores Alfredo Buttari , CNRS-IRIT Toulouse RAIM 2011, Perpignan Outline Mixed-precision Iterative Refinement Dense linear systems Sparse, direct solvers Sparse, iterative solvers Linear
RAIM 2011, Perpignan
2/70 RAIM 2011, Perpignan
3/70 RAIM 2011, Perpignan
4/70 RAIM 2011, Perpignan
4/70 RAIM 2011, Perpignan
4/70 RAIM 2011, Perpignan
4/70 RAIM 2011, Perpignan
5/70 RAIM 2011, Perpignan
1 2 3 4 5 6 7 8 9 10 11 1 2
Single/Double system _gemm _getrf _gesv
6/70 RAIM 2011, Perpignan
7/70 RAIM 2011, Perpignan
7/70 RAIM 2011, Perpignan
Newton's method f(x)
k k+1
8/70 RAIM 2011, Perpignan
9/70 RAIM 2011, Perpignan
9/70 RAIM 2011, Perpignan
9/70 RAIM 2011, Perpignan
10/70 RAIM 2011, Perpignan
1: LU← PA
2: solve Ly = Pb
3: solve Ux0 = y
4: r0 ← b − Ax0
5:
6:
7:
8:
11/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 1 2 3 4 5 6 problem size Gflop/s
Unsymmetric problems
single double mixed 1000 2000 3000 4000 5000 1 2 3 4 5 6 problem size Gflop/s
Symmetric problems
single double mixed
12/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 2 4 6 8 10 problem size Gflop/s
Symmetric problems
1000 2000 3000 4000 5000 1 2 3 4 5 6 7 8 problem size Gflop/s
Unsymmetric problems
single double mixed single double mixed
12/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 5 10 15 problem size Gflop/s
Symmetric problems
1000 2000 3000 4000 5000 2 4 6 8 10 12 14 16 problem size Gflop/s
Unsymmetric problems
single double mixed single double mixed
12/70 RAIM 2011, Perpignan
1000 2000 3000 4000 25 50 75 100 125 150 175 200
Size Gflop/s
SP peak SGEMM peak DP peak DSPOSV SPOSV
13/70 RAIM 2011, Perpignan
100 200 300 400 500 600 700 800 900 1000 10 20 30 40 50 60 70 80 90 100
problem size mmixed/quad performance
Quad vs Mixed -- Unsymmetric
14/70 RAIM 2011, Perpignan
10
2
10
4
10
6
10
8
10
10
5 10 15 20 25 30 35
Condition number # iterations
15/70 RAIM 2011, Perpignan
16/70 RAIM 2011, Perpignan
16/70 RAIM 2011, Perpignan
16/70 RAIM 2011, Perpignan
17/70 RAIM 2011, Perpignan
1: LU← PA
2: solve Ly = Pb
3: solve Ux0 = y
4: r0 ← b − Ax0
5:
6:
7:
8:
18/70 RAIM 2011, Perpignan
19/70 RAIM 2011, Perpignan
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Matrix n. Speedup wrt double precision
SunUltraSparc−IIe
3 3 2 2 3 9 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 1 2 3 4 Matrix n. Speedup wrt double precision
Intel PentiumIII
5 4 2 2 5 9 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 Matrix n. Speedup wrt double precision
AMD Opteron 246
5 4 2 2 4 10 2 2 20 2 Single prec. su Mixed prec. su 1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 Matrix n. Speedup wrt double precision
Intel Woodcrest
6 4 3 2 5 10 2 2 20 2 Single prec. su Mixed prec. su
20/70 RAIM 2011, Perpignan
21/70 RAIM 2011, Perpignan
1 2 3 4 5 6 7 8 9 10 0.2 0.4 0.6 0.8 1
relative memory consumption matrix number
22/70 RAIM 2011, Perpignan
23/70 RAIM 2011, Perpignan
24/70 RAIM 2011, Perpignan
1: for i = 0, 1, ... do 2:
r = b − Axi (εd)
3:
β = h1,0 = ||r||2 (εd)
4:
check convergence and exit if done
5:
for k = 1, . . . , mout do
6:
vk = r / hk,k−1 (εd)
7:
Perform one cycle of GMRESSP (min) in order to (approximately) solve Azk = vk, (initial guess zk = 0) (εs)
8:
r = A zk (εd)
9:
for j=1,. . . ,k do
10:
hj,k = rT vj (εd)
11:
r = r − hj,k vj (εd)
12:
end for
13:
hk+1,k = ||r||2 (εd)
14:
end for
15:
Define Zk = [z1, . . . , zk], Hk = {hi,j}1≤i≤k+1,1≤j≤k (εd)
16:
Find yk, the vector of size k, that minimizes ||βe1 − Hk yk||2 (εd)
17:
xi+1 = xi + Zk yk (εd)
18: end for
25/70 RAIM 2011, Perpignan
26/70 RAIM 2011, Perpignan
matrix number speedup
SP-DP/DP
1 2 3 4 5 6 1 2 3 4 5 6
matrix number speedup
SP-DP / DP-DP
1 2 3 4 5 6 0.5 1 1.5 2 2.5 Intel Woodcrest 3.0 GHz AMD Opteron 246 2.0 GHz IBM PowerPC 970 2.5 GHz Intel Woodcrest 3.0 GHz AMD Opteron 246 2.0 GHz IBM PowerPC 970 2.5 GHz
27/70 RAIM 2011, Perpignan
28/70 RAIM 2011, Perpignan
29/70 RAIM 2011, Perpignan
LAPACK LAPACK MT−BLAS thread 1 thread n ST−BLAS ST−BLAS thread 1 thread n
parallelism parallelism
30/70 RAIM 2011, Perpignan
31/70 RAIM 2011, Perpignan
32/70 RAIM 2011, Perpignan
32/70 RAIM 2011, Perpignan
33/70 RAIM 2011, Perpignan
34/70 RAIM 2011, Perpignan
k : k+b−1, k : k+b−1 ))
k : k+b−1, k : k+b−1 ), A ( k+b : n, k : k+b−1 ))
k+b : n, k : k+b−1 ), A ( k+b : n, k+b : n ))
35/70 RAIM 2011, Perpignan
35/70 RAIM 2011, Perpignan
kk PkkAkj
ik Pik
RAIM 2011, Perpignan
kk)Akj
ik )
RAIM 2011, Perpignan
for k = 1, 2..., min(p, q) do DGETRF(Akk, Lkk, Ukk, Pkk) for j = k + 1, k + 2, ..., q do DGESSM(Akj, Lkk, Pkk, Ukj) end for for i = k + 1, k + 1, ..., p do DTSTRF(Ukk, Aik, Pik) for j = k + 1, k + 2, ..., q do DSSSSM(Ukj, Aij, Lik, Pik) end for end for end for
for k = 1, 2..., min(p, q) do DGEQRT(Akk, Lkk, Ukk, Pkk) for j = k + 1, k + 2, ..., q do DLARFB(Akj, Lkk, Pkk, Ukj) end for for i = k + 1, k + 1, ..., p do DTSQRT(Ukk, Aik, Pik) for j = k + 1, k + 2, ..., q do DSSRFB(Ukj, Aij, Lik, Pik) end for end for end for
38/70 RAIM 2011, Perpignan
39/70 RAIM 2011, Perpignan
39/70 RAIM 2011, Perpignan
39/70 RAIM 2011, Perpignan
39/70 RAIM 2011, Perpignan
40/70 RAIM 2011, Perpignan
41/70 RAIM 2011, Perpignan
42/70 RAIM 2011, Perpignan
42/70 RAIM 2011, Perpignan
43/70 RAIM 2011, Perpignan
44/70 RAIM 2011, Perpignan
44/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70
Cholesky −− quad−socket, dual−core Opteron
problem size Gflop/s
PLASMA & ACML BLAS ACML Cholesky MKL Cholesky LAPACK & ACML BLAS
45/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70
QR −− quad−socket, dual−core Opteron
problem size Gflop/s
PLASMA & ACML BLAS ACML QR MKL QR LAPACK & ACML BLAS
46/70 RAIM 2011, Perpignan
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 10 20 30 40 50 60 70
LU −− quad−socket, dual−core Opteron
problem size Gflop/s
PLASMA & ACML BLAS ACML LU MKL LU LAPACK & ACML BLAS
47/70 RAIM 2011, Perpignan
N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU
48/70 RAIM 2011, Perpignan
N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU
48/70 RAIM 2011, Perpignan
N = (Lp,pPp,p . . . L2,pP2,p . . . L2,2P2,2L1,pP1,p . . . L1,1P1,1)−1 → A = NU
A∞xwp∞
A∞xpp∞
A∞
A∞
RAIM 2011, Perpignan
49/70 RAIM 2011, Perpignan
1 3 4 5 6 7 8 9
2 50/70 RAIM 2011, Perpignan
1 3 4 5 6 7 8 9
2
children nodes are assembled into the frontal matrix
partial or full factorization of the frontal matrix
50/70 RAIM 2011, Perpignan
51/70 RAIM 2011, Perpignan
52/70 RAIM 2011, Perpignan
52/70 RAIM 2011, Perpignan
52/70 RAIM 2011, Perpignan
53/70 RAIM 2011, Perpignan
done
54/70 RAIM 2011, Perpignan
55/70 RAIM 2011, Perpignan
56/70 RAIM 2011, Perpignan
57/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
58/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
59/70 RAIM 2011, Perpignan
60/70 RAIM 2011, Perpignan
exec loop do call get task() select case(task_type) case (0) exit case (1) call do activate(...) case (2) call do panel(...) case (3) call do update(...) case (4) call do assemble(...) case (5) call do clean(...) end select end do
61/70 RAIM 2011, Perpignan
1 2 4 8 1 2 4 8
Speedup −− Intel Xeon
# of cores speedup
qrm spqr dgeqrf
1 2 4 8 1 2 4 8
Speedup −− Intel Xeon
# of cores speedup
qrm spqr dgeqrf
62/70 RAIM 2011, Perpignan
1 2 4 8 16 24 1 2 4 8 16 24
Speedup −− AMD Opteron
# of cores speedup
qrm spqr dgeqrf
1 2 4 8 16 24 1 2 4 8 16 24
Speedup −− AMD Opteron
# of cores speedup
qrm spqr dgeqrf
63/70 RAIM 2011, Perpignan
1 2 4 8 0.2 0.4 0.6 0.8 1
Performance −− Intel Xeon
# of cores fraction of dgemm
dgeqrf qrm
1 2 4 8 0.2 0.4 0.6 0.8 1
Performance −− Intel Xeon
# of cores fraction of dgemm
dgeqrf qrm
64/70 RAIM 2011, Perpignan
1 2 4 8 16 24 0.2 0.4 0.6 0.8 1
Performance −− AMD Opteron
# of cores fraction of dgemm
dgeqrf qrm
1 2 4 8 16 24 0.2 0.4 0.6 0.8 1
Performance −− AMD Opteron
# of cores fraction of dgemm
dgeqrf qrm
65/70 RAIM 2011, Perpignan
50 100 150 200 250 300 100 200 400 800
qr-mumps -- 799.8% spqr -- 773.1%
time (sec.)
66/70 RAIM 2011, Perpignan
67/70 RAIM 2011, Perpignan
◮ High Performance Computing and Grids in Action, chapter Exploiting Mixed
Precision Floating Point Hardware in Scientific Computations. 2007. [PDF].
◮ Marc Baboulin, Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou,
Julien Langou, Piotr Luszczek, and Stanimire Tomov. Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12):2526–2533, 2009. [doi:10.1016/j.cpc.2008.11.005].
◮ A. Buttari.
Fine granularity sparse QR factorization for multicore based systems. To appear in PARA2010 conference proceedings. [PDF]
◮ A. Buttari, J. Dongarra, J. Kurzak, P. Luszczek, and S. Tomov.
Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy. ACM Trans. Math. Softw., 34(4):1–22, 2008. [doi:10.1145/1377596.1377597].
◮ A. Buttari, J. Dongarra, J. Langou, J. Langou, P. Luszczek, and J. Kurzak.
Mixed precision iterative refinement techniques for the solution of dense linear systems.
[doi:10.1177/1094342007084026].
◮ A. Buttari, J. Langou, J. Kurzak, and J. Dongarra.
Parallel tiled qr factorization for multicore architectures. In PPAM’07: Proceedings of the 7th international conference on Parallel processing and applied mathematics, pages 639–648, Berlin, Heidelberg, 2008. Springer-Verlag. [doi:10.1007/978-3-540-68111-3 67].
◮ A. Buttari, J. Langou, J. Kurzak, and J. Dongarra.
A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput., 35(1):38–53, 2009. [doi:10.1016/j.parco.2008.10.002].
◮ J. Kurzak, A. Buttari, and J. Dongarra.
Solving systems of linear equations on the cell processor using cholesky factorization. IEEE Trans. Parallel Distrib. Syst., 19(9):1175–1186, 2008. [doi:10.1109/TPDS.2007.70813].
◮ J. Langou, J. Langou, P. Luszczek, J. Kurzak, A. Buttari, and J. Dongarra.
Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 113, New York, NY, USA, 2006. ACM. [doi:10.1145/1188455.1188573].