Profiling High Performance Dense Linear Algebra Algorithms on - - PowerPoint PPT Presentation

profiling high performance dense linear algebra
SMART_READER_LITE
LIVE PREVIEW

Profiling High Performance Dense Linear Algebra Algorithms on - - PowerPoint PPT Presentation

Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency Hatem Ltaief 1 Luszczek 2 Jack Dongarra 2 Piotr 1 KAUST Supercomputing Laboratory Thuwal, Saudi Arabia 2 Innovative


slide-1
SLIDE 1

Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency

Hatem Ltaief1 Piotr Luszczek2 Jack Dongarra2

1KAUST Supercomputing Laboratory

Thuwal, Saudi Arabia

2Innovative Computing Laboratory

University of Tennessee Knoxville

EnaHPC’11 Conference Hamburg, Germany

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 1 / 28

slide-2
SLIDE 2

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-3
SLIDE 3

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-4
SLIDE 4

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-5
SLIDE 5

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-6
SLIDE 6

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-7
SLIDE 7

Outline

1

The ”K” Computer

2

A Look Back...

3

LAPACK: Block Algorithms

4

PLASMA: Tile Algorithms

5

Power Analysis

6

Summary and Future Work

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 2 / 28

slide-8
SLIDE 8

The ”K” Computer

Motivations

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 3 / 28

slide-9
SLIDE 9

The ”K” Computer

Motivations

10 MW needed to feed the baby Exascale roadmap says up to 20 MW Huge challenge: achieving 2 orders of magnitude in performance by

  • nly doubling the power rate

Co-designed Hardware and Software solutions

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 4 / 28

slide-10
SLIDE 10

The ”K” Computer

Motivations

10 MW needed to feed the baby Exascale roadmap says up to 20 MW Huge challenge: achieving 2 orders of magnitude in performance by

  • nly doubling the power rate

Co-designed Hardware and Software solutions

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 4 / 28

slide-11
SLIDE 11

The ”K” Computer

Motivations

10 MW needed to feed the baby Exascale roadmap says up to 20 MW Huge challenge: achieving 2 orders of magnitude in performance by

  • nly doubling the power rate

Co-designed Hardware and Software solutions

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 4 / 28

slide-12
SLIDE 12

The ”K” Computer

Motivations

10 MW needed to feed the baby Exascale roadmap says up to 20 MW Huge challenge: achieving 2 orders of magnitude in performance by

  • nly doubling the power rate

Co-designed Hardware and Software solutions

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 4 / 28

slide-13
SLIDE 13

A Look Back...

Software infrastructure and algorithmic design follow hardware evolution in time: 70’s - LINPACK, vector operations:

Level-1 BLAS operation

80’s - LAPACK, block, cache-friendly:

Level-3 BLAS operation

90’s - ScaLAPACK, distributed memory:

PBLAS Message passing

00’s:

PLASMA, many-cores friendly:

DAG scheduler, tile data layout, some extra kernels

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 5 / 28

slide-14
SLIDE 14

A Look Back...

Software infrastructure and algorithmic design follow hardware evolution in time: 70’s - LINPACK, vector operations:

Level-1 BLAS operation

80’s - LAPACK, block, cache-friendly:

Level-3 BLAS operation

90’s - ScaLAPACK, distributed memory:

PBLAS Message passing

00’s:

PLASMA, many-cores friendly:

DAG scheduler, tile data layout, some extra kernels

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 5 / 28

slide-15
SLIDE 15

A Look Back...

Software infrastructure and algorithmic design follow hardware evolution in time: 70’s - LINPACK, vector operations:

Level-1 BLAS operation

80’s - LAPACK, block, cache-friendly:

Level-3 BLAS operation

90’s - ScaLAPACK, distributed memory:

PBLAS Message passing

00’s:

PLASMA, many-cores friendly:

DAG scheduler, tile data layout, some extra kernels

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 5 / 28

slide-16
SLIDE 16

A Look Back...

Software infrastructure and algorithmic design follow hardware evolution in time: 70’s - LINPACK, vector operations:

Level-1 BLAS operation

80’s - LAPACK, block, cache-friendly:

Level-3 BLAS operation

90’s - ScaLAPACK, distributed memory:

PBLAS Message passing

00’s:

PLASMA, many-cores friendly:

DAG scheduler, tile data layout, some extra kernels

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 5 / 28

slide-17
SLIDE 17

LAPACK: Block Algorithms

Principles

Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 6 / 28

slide-18
SLIDE 18

LAPACK: Block Algorithms

Principles

Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 6 / 28

slide-19
SLIDE 19

LAPACK: Block Algorithms

Principles

Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 6 / 28

slide-20
SLIDE 20

LAPACK: Block Algorithms

Principles

Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 6 / 28

slide-21
SLIDE 21

LAPACK: Block Algorithms

Principles

Panel-Update Sequence Transformations are blocked/accumulated within the Panel (Level 2 BLAS) Transformations applied at once on the trailing submatrix (Level 3 BLAS) Parallelism hidden inside the BLAS Fork-join Model

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 6 / 28

slide-22
SLIDE 22

LAPACK: Block Algorithms

LU, QR, Cholesky

UPDATE PANEL

(a) First step.

F I N A L UPDATE PANEL

(b) Second step.

F I N A L PANEL

(c) Third step.

Figure: Panel-update sequences for the LAPACK one-sided factorizations.

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 7 / 28

slide-23
SLIDE 23

LAPACK: Block Algorithms

Hessenberg, TRD and BRD

UPDATE P A N E L

(a) First step.

UPDATE

P A N E L F I N A L

UPDATE

(b) Second step.

PANEL

F I N A L

(c) Third step.

Figure: Panel-update sequences for the LAPACK two-sided transformations.

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 8 / 28

slide-24
SLIDE 24

LAPACK: Block Algorithms

Fork-Join Paradigm

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 9 / 28

slide-25
SLIDE 25

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-26
SLIDE 26

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-27
SLIDE 27

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-28
SLIDE 28

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-29
SLIDE 29

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-30
SLIDE 30

PLASMA: Tile Algorithms

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures = ⇒ http://icl.cs.utk.edu/plasma/ Parallelism is brought to the fore May require the redesign of linear algebra algorithms Tile data layout translation Remove unnecessary synchronization points between Panel-Update sequences DAG execution where nodes represent tasks and edges define dependencies between them Dynamic runtime system environment QUARK

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 10 / 28

slide-31
SLIDE 31

PLASMA: Tile Algorithms

Data Layout Format

LAPACK: column-major format PLASMA: tile format

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 11 / 28

slide-32
SLIDE 32

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-33
SLIDE 33

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-34
SLIDE 34

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-35
SLIDE 35

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-36
SLIDE 36

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-37
SLIDE 37

PLASMA: Tile Algorithms

Dynamic Scheduling QUARK

Conceptually similar to out-of-order processor scheduling because it has: Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 12 / 28

slide-38
SLIDE 38

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-39
SLIDE 39

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-40
SLIDE 40

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-41
SLIDE 41

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-42
SLIDE 42

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-43
SLIDE 43

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-44
SLIDE 44

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-45
SLIDE 45

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-46
SLIDE 46

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-47
SLIDE 47

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-48
SLIDE 48

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-49
SLIDE 49

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-50
SLIDE 50

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-51
SLIDE 51

PLASMA: Tile Algorithms

PLASMA In a Nutshell

Parallel Linear Algebra for Scalable Multi-core Architectures Numerical software library Dense Linear Algebra

linear systems of equations least square problems singular value problems eigenvalue problems all precisions (S, D, C, Z) Linux, Windows, Mac OS, AIX

Multicore Systems

multicore multi-socket shared memory possibly NUMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 13 / 28

slide-52
SLIDE 52

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-53
SLIDE 53

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-54
SLIDE 54

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-55
SLIDE 55

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-56
SLIDE 56

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-57
SLIDE 57

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-58
SLIDE 58

PLASMA: Tile Algorithms

AX = B

LU-based solver: Gaussian elimination, non-symmetric Cholesky-based solver: symmetric positive definite LDLT-based solver: non-symmetric positive definite QR/LQ-based solver: least squares Matrix inversion using LU and Cholesky (statistics) Tall and skinny factorizations using tree reductions Mixed precision iterative refinement

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 14 / 28

slide-59
SLIDE 59

PLASMA: Tile Algorithms

A = X ∧ X T and A = UΣV

Symmetric Eigenvalue Problem (Two-stage reduction + QR Iteration) Singular Value Problem (Two-stage reduction + QR Algorithm) Generalized Symmetric Eigenvalue Problem... more later!

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 15 / 28

slide-60
SLIDE 60

PLASMA: Tile Algorithms

A = X ∧ X T and A = UΣV

Symmetric Eigenvalue Problem (Two-stage reduction + QR Iteration) Singular Value Problem (Two-stage reduction + QR Algorithm) Generalized Symmetric Eigenvalue Problem... more later!

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 15 / 28

slide-61
SLIDE 61

PLASMA: Tile Algorithms

A = X ∧ X T and A = UΣV

Symmetric Eigenvalue Problem (Two-stage reduction + QR Iteration) Singular Value Problem (Two-stage reduction + QR Algorithm) Generalized Symmetric Eigenvalue Problem... more later!

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 15 / 28

slide-62
SLIDE 62

Power Analysis

Machine Description

dori.cs.vt.edu from Virginia Tech (K. Cameron) Cluster of 8 nodes Each node contains a dual core AMD Opteron dual processors with 6 GB RAM and each core has 1 MB cache. ”Power to the people, not the chips” P. Luszczek

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 16 / 28

slide-63
SLIDE 63

Power Analysis

Machine Description

dori.cs.vt.edu from Virginia Tech (K. Cameron) Cluster of 8 nodes Each node contains a dual core AMD Opteron dual processors with 6 GB RAM and each core has 1 MB cache. ”Power to the people, not the chips” P. Luszczek

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 16 / 28

slide-64
SLIDE 64

Power Analysis

Machine Description

dori.cs.vt.edu from Virginia Tech (K. Cameron) Cluster of 8 nodes Each node contains a dual core AMD Opteron dual processors with 6 GB RAM and each core has 1 MB cache. ”Power to the people, not the chips” P. Luszczek

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 16 / 28

slide-65
SLIDE 65

Power Analysis

Machine Description

dori.cs.vt.edu from Virginia Tech (K. Cameron) Cluster of 8 nodes Each node contains a dual core AMD Opteron dual processors with 6 GB RAM and each core has 1 MB cache. ”Power to the people, not the chips” P. Luszczek

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 16 / 28

slide-66
SLIDE 66

Power Analysis

Power Monitoring with PowerPack

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 17 / 28

slide-67
SLIDE 67

Power Analysis

Power Rate of LAPACK Cholesky

50 100 150 200 250 20 40 60 80 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 18 / 28

slide-68
SLIDE 68

Power Analysis

Power Rate of PLASMA Cholesky

50 100 150 200 250 20 40 60 80 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 19 / 28

slide-69
SLIDE 69

Power Analysis

Power Rate of LAPACK QR

50 100 150 200 250

  • 50

50 100 150 200 250 300 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 20 / 28

slide-70
SLIDE 70

Power Analysis

Power Rate of PLASMA QR

50 100 150 200 250

  • 50

50 100 150 200 250 300 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 21 / 28

slide-71
SLIDE 71

Power Analysis

Power Rate of LAPACK TRD

50 100 150 200 250 300 50 100 150 200 250 300 350 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 22 / 28

slide-72
SLIDE 72

Power Analysis

Power Rate of PLASMA TRD

50 100 150 200 250 300 50 100 150 200 250 300 350 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 23 / 28

slide-73
SLIDE 73

Power Analysis

Power Rate of LAPACK BRD

50 100 150 200 250 300 200 400 600 800 1000 1200 1400 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 24 / 28

slide-74
SLIDE 74

Power Analysis

Power Rate of PLASMA BRD

50 100 150 200 250 300 200 400 600 800 1000 1200 1400 Power (Watts) Time (seconds) System CPU Memory Motherboad Fan Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 25 / 28

slide-75
SLIDE 75

Power Analysis

QUARK and DVFS

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 26 / 28

slide-76
SLIDE 76

Summary and Future Work

What’s next?

Non-symmetric eigenvalue problem Eigenvector computations RTE systems will have to do more than just scheduling (performing DVFS on-the-fly) Power analysis with MAGMA and DPLASMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 27 / 28

slide-77
SLIDE 77

Summary and Future Work

What’s next?

Non-symmetric eigenvalue problem Eigenvector computations RTE systems will have to do more than just scheduling (performing DVFS on-the-fly) Power analysis with MAGMA and DPLASMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 27 / 28

slide-78
SLIDE 78

Summary and Future Work

What’s next?

Non-symmetric eigenvalue problem Eigenvector computations RTE systems will have to do more than just scheduling (performing DVFS on-the-fly) Power analysis with MAGMA and DPLASMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 27 / 28

slide-79
SLIDE 79

Summary and Future Work

What’s next?

Non-symmetric eigenvalue problem Eigenvector computations RTE systems will have to do more than just scheduling (performing DVFS on-the-fly) Power analysis with MAGMA and DPLASMA

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 27 / 28

slide-80
SLIDE 80

Thank you!

Thank you!

Thank You!

Ltaief, Luszczek, Dongarra (KAUST, UTK) Energy Profiling of DLA Algorithms EnaHPC 2011 28 / 28