Comparative Study of One-Sided Factorizations with Multiple Software - - PowerPoint PPT Presentation

comparative study of one sided factorizations with
SMART_READER_LITE
LIVE PREVIEW

Comparative Study of One-Sided Factorizations with Multiple Software - - PowerPoint PPT Presentation

Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware Emmanuel A GULLO Jack D ONGARRA Bilel H ADRI Jakub K URZAK Hatem L TAIEF Piotr L USZCZEK Scheduling for Large-Scale Systems, Knoxville, TN, May


slide-1
SLIDE 1

Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware

Emmanuel AGULLO

Jack DONGARRA Bilel HADRI Jakub KURZAK Hatem LTAIEF Piotr LUSZCZEK

Scheduling for Large-Scale Systems, Knoxville, TN, May 13-15, 2009

PLASMA group Comparative Study of One-Sided Factorizations 1

slide-2
SLIDE 2

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 2

slide-3
SLIDE 3

Tile Algorithms

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 3

slide-4
SLIDE 4

Tile Algorithms Cholesky

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 4

slide-5
SLIDE 5

Tile Algorithms Cholesky

Tile Cholesky Factorization

  • ⋆ Basically identical to the block algorithm

(LAPACK).

⋆ Input matrix stored and processed by

square tiles.

⋆ Complex DAG. PLASMA group Comparative Study of One-Sided Factorizations 5

slide-6
SLIDE 6

Tile Algorithms Cholesky

Tile Cholesky Factorization - Static pipeline

⋆ Work partitioned in one dimension (by

block-rows).

⋆ Cyclic assignment of work across all steps of the

factorization (pipelining of factorization steps).

⋆ Process tracking by a global progress table. ⋆ Stall on dependencies (busy waiting). PLASMA group Comparative Study of One-Sided Factorizations 6

slide-7
SLIDE 7

Tile Algorithms QR & LU

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 7

slide-8
SLIDE 8

Tile Algorithms QR & LU

Tile QR (&LU) Factorization

  • ⋆ Different from the block algorithm.

⋆ Derived from out-of-core algorithm. ⋆ Input matrix stored and processed by

square tiles.

⋆ Complex DAG. PLASMA group Comparative Study of One-Sided Factorizations 8

slide-9
SLIDE 9

Tile Algorithms QR & LU

Tile QR Factorization - Static pipeline

  • !"#"

$%&'() **+%&'(* ,

  • %&'(..-%&'()

/"/" /" /" ** /" %&'() /"*#" /"$%&'(../"-%&'() /"**/"/"+%&'(*/" ,/" /" , ) ) 012123+ 412121212 01212 , ) 01 2123+ 4121241 2121 212 01 212 , , ) ) 012123 012123+ 41212121241212 , ) 01 2123 01 2123+ 41 2121 2124121241 212 01 212 , , /" /" /" ,

⋆ Work partitioned in one dimension (by

block-rows).

⋆ Cyclic assignment of work across all steps of the

factorization (pipelining of factorization steps).

⋆ Process tracking by a global progress table. ⋆ Stall on dependencies (busy waiting). PLASMA group Comparative Study of One-Sided Factorizations 9

slide-10
SLIDE 10

Experimental environment

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 10

slide-11
SLIDE 11

Experimental environment Libraries

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 11

slide-12
SLIDE 12

Experimental environment Libraries

Libraries

⋆ LAPACK: ◮ LAPACK 3.2 on Intel machine; ◮ LAPACK 3.1.1 on IBM machine; ⋆ SCALAPACK: ◮ SCALAPACK 1.8.0; ⋆ Vendor libraries: ◮ Intel MKL 10.1; ◮ IBM ESSL 4.3; ◮ IBM PESSL 3.3; ⋆ Tile algorithms: ◮ PLASMA ; ◮ TBLAS. PLASMA group Comparative Study of One-Sided Factorizations 12

slide-13
SLIDE 13

Experimental environment Libraries

Libraries

⋆ LAPACK: ◮ LAPACK 3.2 on Intel machine; ◮ LAPACK 3.1.1 on IBM machine; ⋆ SCALAPACK: ◮ SCALAPACK 1.8.0; ⋆ Vendor libraries: ◮ Intel MKL 10.1; ◮ IBM ESSL 4.3; ◮ IBM PESSL 3.3; ⋆ Tile algorithms: ◮ PLASMA ; ◮ TBLAS. PLASMA group Comparative Study of One-Sided Factorizations 12

slide-14
SLIDE 14

Experimental environment Hardware

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 13

slide-15
SLIDE 15

Experimental environment Hardware

Intel Xeon - 16 cores machine

⋆ Node: ◮ quad-socket quad-core Intel64 processors (16 cores). ⋆ Intel Xeon processor: ◮ quad-core; ◮ Frequency: 2,4 GHz. ⋆ Theoretical peak: ◮ 9.6 Gflop/s/core; ◮ 153.6 Gflop/s/node. ⋆ System and compilers: ◮ Linux 2.6.25; ◮ Intel Compilers 11.0. PLASMA group Comparative Study of One-Sided Factorizations 14

slide-16
SLIDE 16

Experimental environment Hardware

IBM Power6 - 32 cores machine

⋆ Node: ◮ 16 dual-core Power6 processors (32 cores). ⋆ Power6 processor: ◮ dual-core; ◮ each core 2-way SMT; ◮ L1: 64kB data + 64 kB instructions; ◮ L2: 4 MB per core, accessible by the other core; ◮ L3: 32 MB per processor, one controller per core (80 MB/s). ◮ Frequency: 4,7 GHz. ⋆ Theoretical peak: ◮ 18.8 Gflop/s/core; ◮ 601.6 Gflop/s/node. ⋆ System and compilers: ◮ AIX 5.3; ◮ xlf version 12.1; ◮ xlc version 10.1. PLASMA group Comparative Study of One-Sided Factorizations 15

slide-17
SLIDE 17

Experimental environment Metrics

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 16

slide-18
SLIDE 18

Experimental environment Metrics

Performance metrics (How to read the graphs)

⋆ Performance: Gflop/s (y-axis). ⋆ Plots scaled to the theoretical peak. ⋆ Parallel DGEMM. ⋆ Upper bound: embarrassingly parallel fastest core kernel: ◮ DPOTRF (LLT) → dgemm; ◮ DGEQRF (QR) → dssrfb; ◮ DGETRF (LU) → dssssm.

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq 16 cores 8 cores 4 cores 2 cores 1 core 100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

Intel64- DGEMM Power6- DGEMM

PLASMA group Comparative Study of One-Sided Factorizations 17

slide-19
SLIDE 19

Tuning

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 18

slide-20
SLIDE 20

Tuning PLASMA

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 19

slide-21
SLIDE 21

Tuning PLASMA

Degrees of freedom

Input parameters of the serial core kernels:

⋆ NB: tile size; ⋆ IB: internal blocking (for dssrfb and dssssm only). PLASMA group Comparative Study of One-Sided Factorizations 20

slide-22
SLIDE 22

Tuning PLASMA

Impact of NB - DPOTRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq NB=196 NB=168 NB=120 NB=84 NB=60

PLASMA group Comparative Study of One-Sided Factorizations 21

slide-23
SLIDE 23

Tuning PLASMA

Impact of NB - DPOTRF- Power6- 32 cores

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq NB=288 NB=224 NB=220 NB=192 NB=160 NB=120 NB=80 NB=60

PLASMA group Comparative Study of One-Sided Factorizations 22

slide-24
SLIDE 24

Tuning PLASMA

Impact of NB/IB - DGEQRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq NB=256 IB=64 NB=200 IB=40 NB=168 IB=56 NB=120 IB=40 NB=84 IB=28 NB=60 IB=20

PLASMA group Comparative Study of One-Sided Factorizations 23

slide-25
SLIDE 25

Tuning PLASMA

Impact of NB/IB - DGEQRF- Power6- 32 cores

50 100 150 200 250 300 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq NB=480 IB=96 NB=340 IB=68 NB=300 IB=60 NB=256 IB=64 NB=168 IB=56 NB=160 IB=40 NB=120 IB=40 NB=80 IB=40

PLASMA group Comparative Study of One-Sided Factorizations 24

slide-26
SLIDE 26

Tuning PLASMA

Impact of NB/IB - DGETRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssssm-seq NB=252 IB=28 NB=196 IB=28 NB=168 IB=28 NB=120 IB=24 NB=84 IB=28 NB=60 IB=20

PLASMA group Comparative Study of One-Sided Factorizations 25

slide-27
SLIDE 27

Tuning PLASMA

Impact of NB/IB - DGETRF- Power6- 32 cores

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq NB=480 IB=60 NB=340 IB=68 NB=280 IB=56 NB=240 IB=60 NB=196 IB=28 NB=168 IB=28 NB=120 IB=40 NB=80 IB=20

PLASMA group Comparative Study of One-Sided Factorizations 26

slide-28
SLIDE 28

Tuning PLASMA

Exhaustive search

For ”each” matrix size and number of cores:

  • 1. Time PLASMA on all NB/IB samples;
  • 2. Select the best sample.

Number of samples

⋆ |{(IB, NB) | IB|NB, 40 ≤ NB ≤ 500, 4 ≤ IB ≤ NB}|=1352; ⋆ all combinations cannot be explored on large executions;

→ need for a pruned search.

PLASMA group Comparative Study of One-Sided Factorizations 27

slide-29
SLIDE 29

Tuning PLASMA

Exhaustive search

For ”each” matrix size and number of cores:

  • 1. Time PLASMA on all NB/IB samples;
  • 2. Select the best sample.

Number of samples

⋆ |{(IB, NB) | IB|NB, 40 ≤ NB ≤ 500, 4 ≤ IB ≤ NB}|=1352; ⋆ all combinations cannot be explored on large executions;

→ need for a pruned search.

PLASMA group Comparative Study of One-Sided Factorizations 27

slide-30
SLIDE 30

Tuning PLASMA

Pruned search

Method

  • 1. Time serial core kernels (dgemm, dssrfb, dssssm).

1 2 3 4 5 6 7 8 9 50 100 150 200 250 300 NB dgemm-seq 100 200 300 400 500 50 100 150 200 250 2 4 6 8 10 12 14 dssrfb-seq NB IB 2 4 6 8 10 12 14

Intel64 - dgemm Power6 - dssrfb

  • 2. Pick up the ”best” NB or NB/IB samples (pruning);
  • 3. Select one per matrix size and number of cores.

PLASMA group Comparative Study of One-Sided Factorizations 28

slide-31
SLIDE 31

Tuning PLASMA

Pruned search

Method

  • 1. Time serial core kernels (dgemm, dssrfb, dssssm).

1 2 3 4 5 6 7 8 9 50 100 150 200 250 300 NB dgemm-seq 2 4 6 8 10 12 14 16 18 100 200 300 400 500 NB dssrfb-seq(NB,IB)

Intel64 - dgemm Power6 - dssrfb

  • 2. Pick up the ”best” NB or NB/IB samples (pruning);
  • 3. Select one per matrix size and number of cores.

PLASMA group Comparative Study of One-Sided Factorizations 28

slide-32
SLIDE 32

Tuning PLASMA

Pruned search

Method

  • 1. Time serial core kernels (dgemm, dssrfb, dssssm).

50 100 150 200 250 1 2 3 4 5 6 7 8 9 NB dgemm performance for NB peak performance selected NB 60 84 120 168 196 50 100 150 200 250 300 350 400 450 500 2 4 6 8 10 12 14 16 18 NB dssrf performance for (NB,IB) peak performance selected (NB,IB) pairs (480,96) (480,6) (340,68) (300,60) (256,64) (120,40) (80,40) (160,40) (168,56)

Intel64 - dgemm Power6 - dssrfb

  • 2. Pick up the ”best” NB or NB/IB samples (pruning);
  • 3. Select one per matrix size and number of cores.

PLASMA group Comparative Study of One-Sided Factorizations 28

slide-33
SLIDE 33

Tuning PLASMA

Exhaustive search VS pruned search Intel64- 16 cores - DPOTRF

20 40 60 80 100 120 140 1000 2000 3000 4000 5000 6000 7000 8000 Matrix size 16xdgemm-seq EXPLASMA PLASMA

PLASMA group Comparative Study of One-Sided Factorizations 29

slide-34
SLIDE 34

Tuning PLASMA

Exhaustive search VS pruned search Intel64- 16 cores - DGEQRF

20 40 60 80 100 120 140 1000 2000 3000 4000 5000 6000 7000 8000 Matrix size 16xdssrfb-seq EXPLASMA PLASMA

PLASMA group Comparative Study of One-Sided Factorizations 30

slide-35
SLIDE 35

Tuning PLASMA

Exhaustive search VS pruned search Intel64- 16 cores - DGETRF

20 40 60 80 100 120 140 1000 2000 3000 4000 5000 6000 7000 8000 Matrix size 16xdssssm-seq EXPLASMA PLASMA

PLASMA group Comparative Study of One-Sided Factorizations 31

slide-36
SLIDE 36

Tuning Summary

Other software

⋆ PLASMA: pruned search. ⋆ TBLAS: exhaustive search. ⋆ SCALAPACK, PESSL: exhaustive search. ⋆ LAPACK, MKL, ESSL: tuned by vendor. PLASMA group Comparative Study of One-Sided Factorizations 32

slide-37
SLIDE 37

Comparison against other libraries

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 33

slide-38
SLIDE 38

Comparison against other libraries Experiments on few cores

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 34

slide-39
SLIDE 39

Comparison against other libraries Experiments on few cores

DPOTRF- Intel64- 4 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 4xdgemm-seq DGEMM PLASMA TBLAS MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 35

slide-40
SLIDE 40

Comparison against other libraries Experiments on few cores

DPOTRF- Power6- 2 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 2xdgemm-seq DGEMM PLASMA TBLAS ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 36

slide-41
SLIDE 41

Comparison against other libraries Experiments on few cores

DGEQRF- Intel64- 4 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 4xdssrfb-seq DGEMM PLASMA TBLAS MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 37

slide-42
SLIDE 42

Comparison against other libraries Experiments on few cores

DGEQRF- Power6- 2 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 2xdssrfb-seq DGEMM PLASMA TBLAS ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 38

slide-43
SLIDE 43

Comparison against other libraries Experiments on few cores

DGETRF- Intel64- 4 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 4xdssssm-seq DGEMM PLASMA MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 39

slide-44
SLIDE 44

Comparison against other libraries Experiments on few cores

DGETRF- Power6- 2 cores

5 10 15 20 25 30 35 2000 4000 6000 8000 10000 12000 Matrix size 2xdssssm-seq DGEMM PLASMA ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 40

slide-45
SLIDE 45

Comparison against other libraries Experiments on a large number of cores

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 41

slide-46
SLIDE 46

Comparison against other libraries Experiments on a large number of cores

DPOTRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq DGEMM PLASMA TBLAS MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 42

slide-47
SLIDE 47

Comparison against other libraries Experiments on a large number of cores

DPOTRF- Power6- 32 cores

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq DGEMM PLASMA TBLAS ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 43

slide-48
SLIDE 48

Comparison against other libraries Experiments on a large number of cores

DGEQRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq DGEMM PLASMA TBLAS MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 44

slide-49
SLIDE 49

Comparison against other libraries Experiments on a large number of cores

DGEQRF- Power6- 32 cores

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq DGEMM PLASMA TBLAS ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 45

slide-50
SLIDE 50

Comparison against other libraries Experiments on a large number of cores

DGETRF- Intel64- 16 cores

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssssm-seq DGEMM PLASMA MKL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 46

slide-51
SLIDE 51

Comparison against other libraries Experiments on a large number of cores

DGETRF- Power6- 32 cores

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq DGEMM PLASMA ESSL PESSL SCALAPACK LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 47

slide-52
SLIDE 52

Comparison against other libraries PLASMA scalability

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 48

slide-53
SLIDE 53

Comparison against other libraries PLASMA scalability

PLASMA- DPOTRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq 16 cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 49

slide-54
SLIDE 54

Comparison against other libraries PLASMA scalability

PLASMA- DGEQRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq 16 cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 50

slide-55
SLIDE 55

Comparison against other libraries PLASMA scalability

PLASMA- DGETRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssssm-seq 16 cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 51

slide-56
SLIDE 56

Comparison against other libraries PLASMA scalability

PLASMA- DPOTRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 52

slide-57
SLIDE 57

Comparison against other libraries PLASMA scalability

PLASMA- DGEQRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 53

slide-58
SLIDE 58

Conclusion and current work

Outline

  • 1. Tile Algorithms

Cholesky Factorization QR (&LU) Factorizations

  • 2. Experimental environment

Libraries Hardware Metrics

  • 3. Tuning

PLASMA

  • 4. Comparison against other libraries

Experiments on few cores Experiments on a large number of cores PLASMA scalability

  • 5. Conclusion and current work

PLASMA group Comparative Study of One-Sided Factorizations 54

slide-59
SLIDE 59

Conclusion and current work Conclusion

Conclusion

⋆ Performance brought by tile algorithms:

Possible overheads:

  • extra-flops;
  • kernels not optimized.

Benefits:

  • better data reuse;
  • better scheduling opportunities.

⋆ Better scalability. ⋆ Importance of tuning:

→ efficient pruned search.

PLASMA group Comparative Study of One-Sided Factorizations 55

slide-60
SLIDE 60

Conclusion and current work Current work

Current work

⋆ Compute-intensive kernels:

successive BLAS-3 calls → single BLAS-3 call.

⋆ Dynamic scheduling:

→ Piotr’s presentation.

⋆ Improve scalability for small matrix sizes:

→ increase parallelism (tile TSQR).

⋆ Generalization to other linear algebra algorithms:

→ two-sided factorizations.

PLASMA group Comparative Study of One-Sided Factorizations 56

slide-61
SLIDE 61

Conclusion and current work Current work

Thanks

Questions?

PLASMA group Comparative Study of One-Sided Factorizations 57

slide-62
SLIDE 62

Scalability of other libraries

Outline

  • 1. Scalability of other libraries

TBLAS MKL- ESSL SCALAPACK- PESSL LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 58

slide-63
SLIDE 63

Scalability of other libraries TBLAS

Outline

  • 1. Scalability of other libraries

TBLAS MKL- ESSL SCALAPACK- PESSL LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 59

slide-64
SLIDE 64

Scalability of other libraries TBLAS

TBLAS- DPOTRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 60

slide-65
SLIDE 65

Scalability of other libraries TBLAS

TBLAS- DGEQRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 61

slide-66
SLIDE 66

Scalability of other libraries TBLAS

TBLAS- DPOTRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 62

slide-67
SLIDE 67

Scalability of other libraries TBLAS

TBLAS- DGEQRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 63

slide-68
SLIDE 68

Scalability of other libraries MKL- ESSL

Outline

  • 1. Scalability of other libraries

TBLAS MKL- ESSL SCALAPACK- PESSL LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 64

slide-69
SLIDE 69

Scalability of other libraries MKL- ESSL

MKL- DPOTRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 65

slide-70
SLIDE 70

Scalability of other libraries MKL- ESSL

MKL- DGEQRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 66

slide-71
SLIDE 71

Scalability of other libraries MKL- ESSL

MKL- DGETRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 67

slide-72
SLIDE 72

Scalability of other libraries MKL- ESSL

ESSL- DPOTRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 68

slide-73
SLIDE 73

Scalability of other libraries MKL- ESSL

ESSL- DGEQRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 69

slide-74
SLIDE 74

Scalability of other libraries MKL- ESSL

ESSL- DGETRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 70

slide-75
SLIDE 75

Scalability of other libraries SCALAPACK- PESSL

Outline

  • 1. Scalability of other libraries

TBLAS MKL- ESSL SCALAPACK- PESSL LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 71

slide-76
SLIDE 76

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DPOTRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdgemm-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 72

slide-77
SLIDE 77

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DPOTRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 73

slide-78
SLIDE 78

Scalability of other libraries SCALAPACK- PESSL

PESSL- DPOTRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 74

slide-79
SLIDE 79

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DGEQRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssrfb-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 75

slide-80
SLIDE 80

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DGETRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 76

slide-81
SLIDE 81

Scalability of other libraries SCALAPACK- PESSL

PESSL- DGEQRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 77

slide-82
SLIDE 82

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DGETRF- Intel64

20 40 60 80 100 120 140 2000 4000 6000 8000 10000 12000 Matrix size 16xdssssm-seq 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 78

slide-83
SLIDE 83

Scalability of other libraries SCALAPACK- PESSL

SCALAPACK- DGETRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 79

slide-84
SLIDE 84

Scalability of other libraries SCALAPACK- PESSL

PESSL- DGETRF- Power6

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 80

slide-85
SLIDE 85

Scalability of other libraries LAPACK

Outline

  • 1. Scalability of other libraries

TBLAS MKL- ESSL SCALAPACK- PESSL LAPACK

PLASMA group Comparative Study of One-Sided Factorizations 81

slide-86
SLIDE 86

Scalability of other libraries LAPACK

LAPACK- DPOTRF

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdgemm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 82

slide-87
SLIDE 87

Scalability of other libraries LAPACK

LAPACK- DGEQRF

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssrfb-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 83

slide-88
SLIDE 88

Scalability of other libraries LAPACK

LAPACK- DGETRF

100 200 300 400 500 600 2000 4000 6000 8000 10000 12000 Matrix size 32xdssssm-seq 32 cores 16 cores 8 cores 4 cores 2 cores 1 core

PLASMA group Comparative Study of One-Sided Factorizations 84