QR Factorization of Tall and Skinny Matrices in a Grid Computing - - PowerPoint PPT Presentation

qr factorization of tall and skinny matrices in a grid
SMART_READER_LITE
LIVE PREVIEW

QR Factorization of Tall and Skinny Matrices in a Grid Computing - - PowerPoint PPT Presentation

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel A GULLO (INRIA / LaBRI) Camille C OTI (Iowa State University) Jack D ONGARRA (University of Tennessee) Thomas H ERAULT (U. Paris Sud / U. of Tennessee / LRI /


slide-1
SLIDE 1

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas H´

ERAULT (U. Paris Sud / U. of Tennessee / LRI / INRIA)

Julien LANGOU (University of Colorado Denver) IPDPS, Atlanta, USA, April 19-23, 2010

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 1

slide-2
SLIDE 2

Introduction

Question

Can we speed up dense linear algebra applications using a computational grid ?

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 2

slide-3
SLIDE 3

Introduction

Building blocks

Tremendous computational power of grid infrastructures

⋆ BOINC: 2.4 Pflop/s, ⋆ Folding@home: 7.9 Pflop/s.

MPI-based linear algebra libraries

⋆ ScaLAPACK; ⋆ HP Linpack.

Grid-enabled MPI middleware

⋆ MPICH-G2; ⋆ PACX-MPI; ⋆ GridMPI. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 3

slide-4
SLIDE 4

Introduction

Past answers

Can we speed up dense linear algebra applications using a computational grid ?

⋆ GrADS project [Petitet et al., 2001]:

Grid enables to process larger matrices; For matrices that can fit in the (distributed) memory of a cluster, the use of a single cluster is optimal.

⋆ Study on a cloud infrastructure [Napper et al., 2009]

Linpack on Amazon EC2 commercial offer:

Under-calibrated components; Grid costs too much

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 4

slide-5
SLIDE 5

Introduction

Our approach

Principle

Confine intensive communications (ScaLAPACK calls) within the different geographical sites.

Method

Articulate:

⋆ Communication-Avoiding algorithms [Demmel et al., 2008]; ⋆ with a topology-aware middleware (QCG-OMPI).

Focus

⋆ QR factorization; ⋆ Tall and Skinny matrices. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 5

slide-6
SLIDE 6

Introduction

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 6

slide-7
SLIDE 7

Background

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 7

slide-8
SLIDE 8

Background TSQR / CAQR

Communication-Avoiding QR (CAQR) [Demmel et al., 2008]

Tall and Skinny QR (TSQR) CAQR

R

TSQR

UPDATES

Examples of applications for TSQR

⋆ panel factorization in CAQR; ⋆ block iterative methods (iterative methods with multiple

right-hand sides or iterative eigenvalue solvers);

⋆ linear least squares problems with a number of equations

extremely larger than the number of unknowns.

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

slide-9
SLIDE 9

Background TSQR / CAQR

Communication-Avoiding QR (CAQR) [Demmel et al., 2008]

Tall and Skinny QR (TSQR) CAQR

R

TSQR

UPDATES

Examples of applications for TSQR

⋆ panel factorization in CAQR; ⋆ block iterative methods (iterative methods with multiple

right-hand sides or iterative eigenvalue solvers);

⋆ linear least squares problems with a number of equations

extremely larger than the number of unknowns.

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

slide-10
SLIDE 10

Background TSQR / CAQR

Communication-Avoiding QR (CAQR) [Demmel et al., 2008]

Tall and Skinny QR (TSQR) CAQR

R

TSQR

UPDATES

Examples of applications for TSQR

⋆ panel factorization in CAQR; ⋆ block iterative methods (iterative methods with multiple

right-hand sides or iterative eigenvalue solvers);

⋆ linear least squares problems with a number of equations

extremely larger than the number of unknowns.

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

slide-11
SLIDE 11

Background TSQR / CAQR

Communication-Avoiding QR (CAQR) [Demmel et al., 2008]

Tall and Skinny QR (TSQR) CAQR

R

TSQR

UPDATES

Examples of applications for TSQR

⋆ panel factorization in CAQR; ⋆ block iterative methods (iterative methods with multiple

right-hand sides or iterative eigenvalue solvers);

⋆ linear least squares problems with a number of equations

extremely larger than the number of unknowns.

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 8

slide-12
SLIDE 12

Background QCG-OMPI

Topology-aware MPI middleware for the Grid

MPICH-G2

⋆ description of the topology through the concept of colors:

used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology;

⋆ based on MPICH.

QCG-OMPI

⋆ resource-aware grid meta-scheduler (QosCosGrid); ⋆ allocation of resources that match requirements expressed in a

“JobProfile” (amount of memory, CPU speed, network properties between groups of processes, . . . )

application always executed on an appropriate resource topology.

⋆ based on OpenMPI. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

slide-13
SLIDE 13

Background QCG-OMPI

Topology-aware MPI middleware for the Grid

MPICH-G2

⋆ description of the topology through the concept of colors:

used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology;

⋆ based on MPICH.

QCG-OMPI

⋆ resource-aware grid meta-scheduler (QosCosGrid); ⋆ allocation of resources that match requirements expressed in a

“JobProfile” (amount of memory, CPU speed, network properties between groups of processes, . . . )

application always executed on an appropriate resource topology.

⋆ based on OpenMPI. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

slide-14
SLIDE 14

Background QCG-OMPI

Topology-aware MPI middleware for the Grid

MPICH-G2

⋆ description of the topology through the concept of colors:

used to build topology-aware MPI communicators; the application has to adapt itself to the discovered topology;

⋆ based on MPICH.

QCG-OMPI

⋆ resource-aware grid meta-scheduler (QosCosGrid); ⋆ allocation of resources that match requirements expressed in a

“JobProfile” (amount of memory, CPU speed, network properties between groups of processes, . . . )

application always executed on an appropriate resource topology.

⋆ based on OpenMPI. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 9

slide-15
SLIDE 15

Articulation of TSQR with QCG-OMPI

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 10

slide-16
SLIDE 16

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

ScaLAPACK (panel factorization routine) - non optimized tree

Domain 2,4 Domain 1,1 Domain 1,2 Domain 1,3 Cluster 1 Domain 1,4 Domain 1,5 Domain 2,1 Domain 2,2 Domain 2,3 Cluster 2 Domain 3,1 Domain 3,2 Cluster 3 Illustration of ScaLAPACK PDEGQRF without reduce affinity

25 inter-cluster communications

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-17
SLIDE 17

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

ScaLAPACK (panel factorization routine) - non optimized tree

25 inter-cluster communications

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-18
SLIDE 18

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

ScaLAPACK (panel factorization routine) - optimized tree

Domain 1,1 Domain 1,2 Domain 1,3 Cluster 1 Domain 1,4 Domain 1,5 Domain 2,1 Domain 2,2 Domain 2,3 Cluster 2 Domain 2,4 Domain 3,1 Domain 3,2 Cluster 3 Illustration of ScaLAPACK PDEGQRF with reduce affinity

10 inter-cluster communications

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-19
SLIDE 19

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

TSQR - optimized tree

2 inter-cluster communications

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-20
SLIDE 20

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

TSQR - optimized tree

2 inter-cluster communications

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-21
SLIDE 21

Articulation of TSQR with QCG-OMPI Communication pattern

Communication pattern (M-by-3 matrix)

ScaLAPACK TSQR

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 11

slide-22
SLIDE 22

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

Articulation of TSQR with QCG-OMPI

ScaLAPACK-based TSQR

⋆ each domain is factorized with a ScaLAPACK call; ⋆ the reduction is performed by pairs of communicators; ⋆ the number of domains per cluster may vary from 1 to 64

(number of cores per cluster).

QCG JobProfile

⋆ processes are split into groups of equivalent computing power; ⋆ good network connectivity inside each group (low latency, high

bandwidth);

⋆ lower network connectivity between the groups.

→ Classical cluster of clusters approach (with a constraint on the relative size of the clusters to facilitate load balancing).

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 12

slide-23
SLIDE 23

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-24
SLIDE 24

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-25
SLIDE 25

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-26
SLIDE 26

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-27
SLIDE 27

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-28
SLIDE 28

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-29
SLIDE 29

Articulation of TSQR with QCG-OMPI Articulation of TSQR with QCG-OMPI

  • Comm. and computation breakdown (critical path)

Computing R

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 2N log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C TSQR log2(C) log2(C)(N2/2) (2MN2 − 2/3N3)/C + 2/3 log2(C)N3

Computing Q and R (on C clusters)

# inter-cluster msg inter-cluster vol. exchanged # FLOPs ScaLAPACK QR2 4N log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C TSQR 2 log2(C) 2 log2(C)(N2/2) (4MN2 − 4/3N3)/C + 4/3 log2(C)N3 ⋆ C: number of clusters; ⋆ 1 domain per cluster is assumed for these tables. Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 13

slide-30
SLIDE 30

Experiments

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 14

slide-31
SLIDE 31

Experiments Experimental environment

Experimental environment: Grid’5000

⋆ Four clusters – 32 nodes per cluster – 2 cores per node. ⋆ AMD Opteron 246 (2 GHz/ 1MB L2 cache) up to AMD Opteron

2218 (2.6 GHz / 2MB L2 cache).

⋆ Linux 2.6.30 – ScaLAPACK 1.8.0 – GotoBlas 1.26. ⋆ 256 cores total (theoretical peak 2048 Gflop/s – dgemm

upperbound 940 Gflop/s).

!"#$% &'"()$*+ ,'*-'*#) .'/01$2 3451/'-1#

Network

Latency (ms) Orsay Toulouse Bordeaux Sophia Orsay 0.07 7.97 6.98 6.12 Toulouse 0.03 9.03 8.18 Bordeaux 0.05 7.18 Sophia 0.06 Throughput (Mb/s) Orsay Toulouse Bordeaux Sophia Orsay 890 78 90 102 Toulouse 890 77 90 Bordeaux 890 83 Sophia 890

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 15

slide-32
SLIDE 32

Experiments ScaLAPACK performance

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 16

slide-33
SLIDE 33

Experiments ScaLAPACK performance

ScaLAPACK - N = 64

5 10 15 20 25 30 35 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 17

slide-34
SLIDE 34

Experiments ScaLAPACK performance

ScaLAPACK - N = 128

10 20 30 40 50 60 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 18

slide-35
SLIDE 35

Experiments ScaLAPACK performance

ScaLAPACK - N = 256

10 20 30 40 50 60 100000 1e+06 1e+07

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 19

slide-36
SLIDE 36

Experiments ScaLAPACK performance

ScaLAPACK - N = 512

10 20 30 40 50 60 70 80 90 100000 1e+06 1e+07

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 20

slide-37
SLIDE 37

Experiments TSQR performance

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 21

slide-38
SLIDE 38

Experiments TSQR performance - effect of the number of domains per cluster

TSQR - N = 64 - one cluster 5 10 15 20 25 30 35 40 64 32 16 8 4 2 1 Gflop/s Number of domains M = 8 388 608 M = 1 048 576 M = 131 072 M = 65 536

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 22

slide-39
SLIDE 39

Experiments TSQR performance - effect of the number of domains per cluster

TSQR - N = 64 - all four clusters

20 40 60 80 100 64 32 16 8 4 2 1

Gflop/s Number of domains per cluster

M = 33 554 432 M = 4 194 304 M = 524 288 M = 131 072

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 23

slide-40
SLIDE 40

Experiments TSQR performance - effect of the number of domains per cluster

TSQR - N = 512 - one cluster

20 40 60 80 100 64 32 16 8 4 2 1

Gflop/s Number of domains

M = 2 097 152 M = 1 048 576 M = 131 072 M = 65 536

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 24

slide-41
SLIDE 41

Experiments TSQR performance - effect of the number of domains per cluster

TSQR - N = 512 - all four clusters

50 100 150 200 250 300 350 64 32 16 8 4 2 1

Gflop/s Number of domains per cluster

M = 8 388 608 M = 2 097 152 M = 524 288 M = 262 144

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 25

slide-42
SLIDE 42

Experiments TSQR performance (optimum configuration)

TSQR - N = 64

20 40 60 80 100 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 26

slide-43
SLIDE 43

Experiments TSQR performance (optimum configuration)

TSQR - N = 128

20 40 60 80 100 120 140 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 27

slide-44
SLIDE 44

Experiments TSQR performance (optimum configuration)

TSQR - N = 256

20 40 60 80 100 120 140 160 180 100000 1e+06 1e+07

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 28

slide-45
SLIDE 45

Experiments TSQR performance (optimum configuration)

TSQR - N = 512

50 100 150 200 250 300 100000 1e+06 1e+07

Gflop/s Number of rows (M)

4 sites (128 nodes) 2 sites (64 nodes) 1 site (32 nodes)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 29

slide-46
SLIDE 46

Experiments TSQR vs ScaLAPACK performance

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 30

slide-47
SLIDE 47

Experiments TSQR vs ScaLAPACK performance

TSQR vs ScaLAPACK - N = 64

10 20 30 40 50 60 70 80 90 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

TSQR (best) ScaLAPACK (best)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 31

slide-48
SLIDE 48

Experiments TSQR vs ScaLAPACK performance

TSQR vs ScaLAPACK - N = 128

20 40 60 80 100 120 100000 1e+06 1e+07 1e+08

Gflop/s Number of rows (M)

TSQR (best) ScaLAPACK (best)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 32

slide-49
SLIDE 49

Experiments TSQR vs ScaLAPACK performance

TSQR vs ScaLAPACK - N = 256

20 40 60 80 100 120 140 160 180 100000 1e+06 1e+07

Gflop/s Number of rows (M)

TSQR (best) ScaLAPACK (best)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 33

slide-50
SLIDE 50

Experiments TSQR vs ScaLAPACK performance

TSQR vs ScaLAPACK - N = 512

50 100 150 200 250 300 100000 1e+06 1e+07

Gflop/s Number of rows (M)

TSQR (best) ScaLAPACK (best)

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 34

slide-51
SLIDE 51

Conclusion and future work

Outline

  • 1. Background
  • 2. Articulation of TSQR with QCG-OMPI
  • 3. Experiments

ScaLAPACK performance TSQR performance TSQR vs ScaLAPACK performance

  • 4. Conclusion and future work

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 35

slide-52
SLIDE 52

Conclusion and future work Conclusion

Conclusion

Can we speed up dense linear algebra applications using a computational grid ? Yes, at least for applications based on the QR factorization

  • f Tall and Skinny matrices.

Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 36

slide-53
SLIDE 53

Conclusion and future work Future directions

Future directions

⋆ What about square matrices (CAQR) ? ⋆ LU and Cholesky factorizations ? ⋆ Can we benefit from recursive kernels ? Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 37

slide-54
SLIDE 54

Conclusion and future work current investigations

N = 64 - one cluster

10 20 30 40 50 60 70 80 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 GFlops Matrix height (M) Comparison of QR algorithms, N = 64, 1 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 38

slide-55
SLIDE 55

Conclusion and future work current investigations

N = 64 - two clusters

20 40 60 80 100 120 140 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 GFlops Matrix height (M) Comparison of QR algorithms, N = 64, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 39

slide-56
SLIDE 56

Conclusion and future work current investigations

N = 64 - all four clusters

50 100 150 200 250 300 2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 GFlops Matrix height (M) Comparison of QR algorithms, N = 64, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 40

slide-57
SLIDE 57

Conclusion and future work current investigations

N = 512 - one cluster

20 40 60 80 100 120 140 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 GFlops Matrix height (M) Comparison of QR algorithms, N = 512, 1 cluster Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 41

slide-58
SLIDE 58

Conclusion and future work current investigations

N = 512 - two clusters

50 100 150 200 250 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 GFlops Matrix height (M) Comparison of QR algorithms, N = 512, 2 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 42

slide-59
SLIDE 59

Conclusion and future work current investigations

N = 512 - all four clusters

50 100 150 200 250 300 350 400 450 500 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 GFlops Matrix height (M) Comparison of QR algorithms, N = 512, 4 clusters Cholesky QR A v0 Householder A v2 PDGEQRF Householder A v1 CGS v0 CGS v1 Householder A v0 Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 43

slide-60
SLIDE 60

Conclusion and future work current investigations

Thanks

⋆ Questions ? Agullo - Coti - Dongarra - H´ erault - Langou QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment 44