Communication-avoiding LU and QR factorizations for multicore - - PowerPoint PPT Presentation

▶

Nov 15, 2023 159 likes •428 views

Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori Alok Kumar Gupta INRIA Saclay BCCS,Norway-5075 16th April 2010 Communication-avoiding LU and QR

SLIDE 1

Communication-avoiding LU and QR factorizations for multicore architectures

DONFACK Simplice INRIA Saclay Joint work with Laura Grigori Alok Kumar Gupta INRIA Saclay BCCS,Norway-5075

16th April 2010

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 1 / 25

SLIDE 2

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 2 / 25

SLIDE 3

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 3 / 25

SLIDE 4

Introduction

Architectural trends show an increasing communication cost compared to the time it takes to perform arithmetic

perations

Motivated the design of communication avoiding algorithms that minimize communication First results are CAQR [Demmel, Grigori, Hoemmen, Langou ’08] and CALU [Grigori, Demmel, Xiang ’08], implemented for distributed memory.

Our goal is to design multithreaded QR and LU factorizations for multicores based on communication avoiding algorithms.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 4 / 25

SLIDE 5

LU factorization with partial pivoting

Factorization on Pr by Pc grid of processors as implemented in SCALAPACK: For ib = 1 to n-1 step b A(ib) = A(ib:n, ib:n)

Compute panel factorization (pdgetf2)

O(nlog2Pr)

find pivot in each column, swap rows

Apply all row permutations (pdlaswp)

O(n/b(log2Pc + log2Pr))

broadcast pivot information along the rows
swap rows at left and right

Compute block row of U (pdtrsm)

O(n/blog2Pc)

broadcast right diagonal block of L of current

panel

Update trailing matrix (pdgemm)

O(n/b(log2Pc + log2Pr))

broadcast right block column of L
broadcast down block row of U

Pivoting requires communication among processors on distributed memory and synchronisation between threads on multicores.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 5 / 25

SLIDE 6

CALU and CAQR approach

Communication avoiding algorithms [Demmel, Grigori, Hoemmen, Langou, Xiang ’08] approach: Decrease communication required for pivoting and

vercome the latency bottleneck of classic algorithms by

performing the factorization of a block column (a tall and skinny matrix) as a reduction operation and doing some redundant computations

They are communication optimal in terms of both latency and bandwidth They lead to important speedups on distributed memory computers

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 6 / 25

SLIDE 7

Goal

Our goal Combine the main ideas to reduce communication in CALU and CAQR with : appropriate blocking task identification dynamic scheduling The reduction operation to use for a block-column factorization is based on a binary tree with asynchronous tasks : reduces synchronisation between threads (only O(log2(Pr))) avoids bus contention

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 7 / 25

SLIDE 8

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 8 / 25

SLIDE 9

CAQR

Each panel factorization is computed as a reduction

peration where at each node a QR factorization is

performed. The reduction tree is chosen depending on the underlying architecture. For a binary tree log2(Pr) steps are used.

Figure: Parallel TSQR

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 9 / 25

SLIDE 10

CAQR

Update the submatrix using the tree in log2(Pr) steps

Figure: The update of the trailing submatrix is triggered by the reduction tree used during panel factorization

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 10 / 25

SLIDE 11

CALU[Grigori, Demmel, Xiang ’08]

The panel factorization is performed in two steps: A preprocessing steps aims at identifying at low communication cost good pivot rows The pivot rows are permuted in the first positions of the panel and LU without pivoting of the panel is performed Figure: Stable parallel panel factorization

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 11 / 25

SLIDE 12

CALU (Stability)

1024 2048 4096 8192 100 200 300 400 500 600 700 average growth factor

P=256,b=32 P=256,b=16 P=128,b=64 P=128,b=32 P=128,b=16 P=64, b=128 P=64, b=64 P=64, b=32 P=64, b=16 GEPP n2/3 2*n2/3 3*n1/2

Figure: Stability of binary tree based CALU factorization for random matrices

Extensive tests performed on random matrices and a set

f special matrices using binary tree and flat tree show

that CALU is as stable as GEPP in practice.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 12 / 25

SLIDE 13

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 13 / 25

SLIDE 14

Multithreaded CALU

The matrix is partitioned in blocks of size Tr x b The computation of each block is associated with a task The task dependency graph is scheduled using a dynamic scheduler

Figure: Matrix 4 × 4 blocks and Tr = 2 and Corresponding task dependency graph

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 14 / 25

SLIDE 15

Multithreaded CALU

Panel factorization is performed in two steps: find good pivots at low communication cost, permute them and compute LU factorization

f the panel without pivoting.

The panel factorization stays on the critical path but it is done more efficiently

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 15 / 25

SLIDE 16

Multithreaded CALU (Execution)

Figure: Example of execution of CALU for a 105 × 1000 tall skinny matrix, using b = 100 and Tr = 1, on 8-core Figure: Example of execution of CALU for a 105 × 1000 tall skinny matrix, using b = 100 and Tr = 8, on 8-core

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 16 / 25

SLIDE 17

Multithreaded CAQR

Same approach as CALU but: Panel factorization is performed only once The update of the trailing matrix is triggered by the binary tree used for the panel factorization.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 17 / 25

SLIDE 18

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 18 / 25

SLIDE 19

Environments

Tests performed on: two-socket, quad-core machine based on Intel Xeon EMT64 processor running on Linux and on a four-socket, quad-core machine based on AMD Opteron processor Comparison with MKL-10.0.4.23 and PLASMA 2.0 (with default parameters) b = MIN(n, 100) has been chosen as block size

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 19 / 25

SLIDE 20

Performance of CALU

Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 8 cores

3 4 5 6 7 8 9 10 5 10 15 20 25 30 35 log2(n) GFlops/s Tall Skinny Matrix, CALU, m=10

MKL_dgetf2 MKL_dgetrf PLASMA_dgetrf CALU(Tr=4) CALU(Tr=8)

Figure: m=105 and varying n from 10 to 1000.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 20 / 25

SLIDE 21

Performance of CALU

Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 16 cores

3 4 5 6 7 8 9 10 5 10 15 20 25 30 35 40 45 log2(n) GFlops/s Tall Skinny Matrix, CALU, m=10

ACML_dgeqrf PLASMA_dgeqrf CALU(Tr=8) CALU(Tr=16)

Figure: m=105 and varying n from 10 to 1000.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 21 / 25

SLIDE 22

Performance of CAQR

Performance of CAQR, MKL_dgeqrf, PLASMA_dgeqrf on 8 cores

3 4 5 6 7 8 9 10 5 10 15 20 25 30 35 40 45 log2(n) GFlops/s Tall Skinny Matrix, CAQR, m=10

MKL_dgeqrf PLASMA_dgeqrf CAQR(Tr=2) CAQR(Tr=4) CAQR(Tr=8) TSQR

Figure: m=105 and varying n from 10 to 1000.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 22 / 25

SLIDE 23

Introduction

CALU and CAQR factorization

Multithreaded CALU and CAQR

Experimental section

Conclusion

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 23 / 25

SLIDE 24

Conclusion

Multithreaded CALU and CAQR lead to important improvements for tall and skinny matrices with respect to the corresponding routines in MKL and PLASMA. PLASMA becomes more efficient with increasing number

f columns.

No significant improvements obtained so far for square matrices. Prospects: Improve the performance of the trailing matrix update by increasing the block size to optimize BLAS3 operations. Compare with the recent approach of [Hadri, Ltaief, Agullo, Dongarra’09] for QR factorization, which uses a different reduction tree during panel factorization.

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 24 / 25

SLIDE 25

Thank you

Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 25 / 25