Dense Linear Algebra on Heterogeneous Platforms: State of the Art - - PowerPoint PPT Presentation

dense linear algebra on heterogeneous platforms state of
SMART_READER_LITE
LIVE PREVIEW

Dense Linear Algebra on Heterogeneous Platforms: State of the Art - - PowerPoint PPT Presentation

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala


slide-1
SLIDE 1

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Paolo Bientinesi

AICES, RWTH Aachen pauldj@aices.rwth-aachen.de

ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden

Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

slide-2
SLIDE 2

1

Setting the stage

2

Part 1: blocked algorithms

3

Part 2: multithreading, fork-join

4

Part 3: multihtreading, algorithms-by-blocks

5

Part 4: streaming

Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

slide-3
SLIDE 3

Dense Linear Algebra

M =           ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ××××××××××××          

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

slide-4
SLIDE 4

Dense Linear Algebra

M =           ××× × ××× × ×××××××××××× ××××× ××××× ×××××××××××× ××× ××××× ×× ×××× ××××× × ××× × ××× × ×××××××××××× ××××× ××××× ×××××××××××× ××× ××××× ×× ×××× ××××× ×          

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

slide-5
SLIDE 5

Dense Linear Algebra

M =           ×× × × × × × × ××× ×× × ×× × ×× × ×× ×× × × × ×× × × ××× × × × × × × × × ×× ××××× ×× × × ×× ××× × ××× × × × ×× ×× × ×× × ×× ×          

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

slide-6
SLIDE 6

Dense Linear Algebra

M =           × ×× ××× ×××× ×××× ×××× × ××××××× ×× ××××× ××××××××× ××××××× × ×××××××××× ××××××××××××          

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

slide-7
SLIDE 7

Dense Linear Algebra

M =           ×× ××× ××× ××× ××× ××× ××× ××× ××× ××× ××× ××          

Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

slide-8
SLIDE 8

Dense Linear Algebra

Linear systems

Ax = b, AX = B, least squares, . . .

Eigenproblems

Ax = λx, AX = BXΛ, SVD, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

slide-9
SLIDE 9

Dense Linear Algebra

Linear systems

Ax = b, AX = B, least squares, . . .

Eigenproblems

Ax = λx, AX = BXΛ, SVD, . . .

Support routines

factorizations, reductions, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

slide-10
SLIDE 10

Dense Linear Algebra

Matrix equations

AX + XB = C, A = A+A−1

2

, . . .

Linear systems

Ax = b, AX = B, least squares, . . .

Eigenproblems

Ax = λx, AX = BXΛ, SVD, . . .

Support routines

factorizations, reductions, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

slide-11
SLIDE 11

Organization in layers

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-12
SLIDE 12

Organization in layers

BLAS

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-13
SLIDE 13

Organization in layers

BLAS

BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R dot := α + xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-14
SLIDE 14

Organization in layers

BLAS

BLAS-2: y := y + Ax A, L ∈ Rn×n, x, y ∈ Rn y := L−1x BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R dot := α + xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-15
SLIDE 15

Organization in layers

BLAS

BLAS-3: C := C + AB A, B, C, L ∈ Rn×n C := L−1B BLAS-2: y := y + Ax A, L ∈ Rn×n, x, y ∈ Rn y := L−1x BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R dot := α + xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-16
SLIDE 16

Organization in layers

LAPACK

LU = A, LLT = A, QR = A, QT TQ = A, . . .

BLAS

BLAS-3: C := C + AB A, B, C, L ∈ Rn×n C := L−1B BLAS-2: y := y + Ax A, L ∈ Rn×n, x, y ∈ Rn y := L−1x BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R dot := α + xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-17
SLIDE 17

Organization in layers

  • ther libraries

ScaLAPACK, Elemental, PETSc, . . .

LAPACK

LU = A, LLT = A, QR = A, QT TQ = A, . . .

BLAS

BLAS-3: C := C + AB A, B, C, L ∈ Rn×n C := L−1B BLAS-2: y := y + Ax A, L ∈ Rn×n, x, y ∈ Rn y := L−1x BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R dot := α + xT y

Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

slide-18
SLIDE 18

Example: AX = B (full A)

AX = B Linear System LU = A LU Factorization LX = B Triangular System C = AB + C Gemm LX = B Triangular System C = AB + C Gemm C = AB + C Gemm

Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

slide-19
SLIDE 19

Why BLAS-3? Why GEMM?

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-20
SLIDE 20

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-21
SLIDE 21

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 1 2n 3n 2/3 BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-22
SLIDE 22

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A ∈ Rn×n, x, y ∈ Rn BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-23
SLIDE 23

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 2 2n2 n2 2 Level 1 2n 3n 2/3 BLAS-2: y := y + Ax A ∈ Rn×n, x, y ∈ Rn BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-24
SLIDE 24

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 2 2n2 n2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, ∈ Rn×n BLAS-2: y := y + Ax A ∈ Rn×n, x, y ∈ Rn BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-25
SLIDE 25

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 3 2n3 4n2 n/2 Level 2 2n2 n2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, ∈ Rn×n BLAS-2: y := y + Ax A ∈ Rn×n, x, y ∈ Rn BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-26
SLIDE 26

Why BLAS-3? Why GEMM?

BLAS #FLOPS

  • Mem. refs.

Ratio Level 3 2n3 4n2 n/2 Level 2 2n2 n2 2 Level 1 2n 3n 2/3 BLAS-3: C := C + AB A, B, C, ∈ Rn×n BLAS-2: y := y + Ax A ∈ Rn×n, x, y ∈ Rn BLAS-1: y := y + αx x, y ∈ Rn, α ∈ R

Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK.

Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

slide-27
SLIDE 27

Part 1: Blocked algorithms

Simple example: Cholesky factorization Input: Matrix A, symmetric and positive definite. Goal: Determine L (lower triangular matrix) such that LLT = A L = LT L LBL LBR

  • = ?

Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

slide-28
SLIDE 28

Cholesky factorization

iteration i

DONE DONE

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

slide-29
SLIDE 29

Cholesky factorization

iteration i + 1

❅ ❅ ❘ ✲ sqrt trsv syr unblocked algorithm CHOL TRSM SYRK blocked algorithm

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

slide-30
SLIDE 30

Cholesky factorization

iteration i + 1

DONE DONE unblocked algorithm DONE DONE blocked algorithm

Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

slide-31
SLIDE 31

Cholesky: unblocked vs. blocked algorithms

Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

slide-32
SLIDE 32

Part 2: Parallelism? fork-join

Solution #1: Multithreaded BLAS (+ vector instructions) Chol TRSM GEMM LU TRSM TRSM GEMM

Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

slide-33
SLIDE 33

Part 2: Parallelism? fork-join

Solution #1: Multithreaded BLAS (+ vector instructions) Chol TRSM GEMM LU TRSM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

slide-34
SLIDE 34

Multithreaded BLAS

Xeon, 32 physical cores, MKL

0.2 0.4 0.6 0.8 1 1000 2000 3000 4000 5000 6000 7000 8000 Efficiency Matrix dimension Efficiency of GEMM 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads

Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

slide-35
SLIDE 35

Development of LA libraries

  • New architecture / new architectural features

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-36
SLIDE 36

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-37
SLIDE 37

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-38
SLIDE 38

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-39
SLIDE 39

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-40
SLIDE 40

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

  • BLAS3, factorizations, AX=B

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-41
SLIDE 41

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

  • BLAS3, factorizations, AX=B
  • factorizations, AX=B

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-42
SLIDE 42

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

  • BLAS3, factorizations, AX=B
  • factorizations, AX=B
  • factorizations, AX=B, support routines

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-43
SLIDE 43

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

  • BLAS3, factorizations, AX=B
  • factorizations, AX=B
  • factorizations, AX=B, support routines

. . .

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-44
SLIDE 44

Development of LA libraries

  • New architecture / new architectural features
  • GEMM

→ peak performance [FFT, SpMV]

  • BLAS3, factorizations, AX=B

LINPACK benchmark

  • BLAS3, factorizations, AX=B
  • factorizations, AX=B
  • factorizations, AX=B, support routines

. . . . . .

  • eigenproblems

Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

slide-45
SLIDE 45

Examples 2005–2006: multicores

GEMM, mtBLAS HPL

2005-7: Blue Gene L dual cores (PowerPC A2) 2008-9: Roadrunner dual cores (Opteron) 2009: Jaguar 6-cores (Opteron) 2010-11: K Computer 8-cores SPARC64 . . .

libFLAME, PLASMA, MKL Eigensolvers: 2010

Paolo Bientinesi (AICES, RWTH Aachen) 14 / 34

slide-46
SLIDE 46

Examples 2005–2006: Cell

GEMM: 99% [FFT] no BLAS HPL

2008-9: Roadrunner >1 PetaFLOP 6k * (Opteron + 2 Cell)

no libs 2009: discontinued Eigensolvers: –

Paolo Bientinesi (AICES, RWTH Aachen) 15 / 34

slide-47
SLIDE 47

Examples 2005: GPGPUs

cuBLAS (*) HPL 2010: Tianhe-1A

2.5k dual-GPU (ATI Radeon)

CULA (*) libFLAME, MAGMA (multiGPU) Eigensolvers: 2010-11

Paolo Bientinesi (AICES, RWTH Aachen) 16 / 34

slide-48
SLIDE 48

Present Intel Xeon Phi (MIC)

GEMM (*), MKL HPL (predicted)

2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi)

  • Paolo Bientinesi (AICES, RWTH Aachen)

17 / 34

slide-49
SLIDE 49

Present Intel Xeon Phi (MIC)

GEMM (*), MKL HPL (predicted)

2013: Tianhe-2 16k * (2 Ivy Bridge + 3 Phi)

  • Also ARM, FPGAs, DSPs, . . .

Paolo Bientinesi (AICES, RWTH Aachen) 17 / 34

slide-50
SLIDE 50

Future 2017: Exascale (?)

???

heterogeneous architecture

Paolo Bientinesi (AICES, RWTH Aachen) 18 / 34

slide-51
SLIDE 51

Back to the algorithms: Chol. fact

Fork-join ⇒ unnecessary synchronizations

CHOL SYRK TRSM CHOL SYRK TRSM

Iteration 1 Iteration 2

Paolo Bientinesi (AICES, RWTH Aachen) 19 / 34

slide-52
SLIDE 52

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasks Solution #2: dependent tasks + scheduling B A X =

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

slide-53
SLIDE 53

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasks Solution #2: dependent tasks + scheduling B A X = A0 A1 A2 B0 B1 B2 X =

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

slide-54
SLIDE 54

Part 3: Parallelism? algorithms-by-blocks

Idea: decompose the tasks Solution #2: dependent tasks + scheduling B A X = A0 A1 A2 B0 B1 B2 X = A0X = B0 A1X = B1 A2X = B2

Paolo Bientinesi (AICES, RWTH Aachen) 20 / 34

slide-55
SLIDE 55

Storage by blocks

CHOL SYRK TRSM CHOL SYRK TRSM

Iteration 1 Iteration 2

Paolo Bientinesi (AICES, RWTH Aachen) 21 / 34

slide-56
SLIDE 56

Creating tasks

Iteration 1 CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

slide-57
SLIDE 57

Creating tasks

Iteration 2 CHOL TRSM TRSM SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

slide-58
SLIDE 58

Creating tasks

Iteration 3 CHOL TRSM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

slide-59
SLIDE 59

Creating tasks

Iteration 4 CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 22 / 34

slide-60
SLIDE 60

Dependencies

CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

slide-61
SLIDE 61

Dependencies

CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

slide-62
SLIDE 62

Dependencies

CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

slide-63
SLIDE 63

Dependencies

CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

slide-64
SLIDE 64

Dependencies

CHOL TRSM TRSM TRSM GEMM GEMM SYRK SYRK GEMM SYRK

Paolo Bientinesi (AICES, RWTH Aachen) 23 / 34

slide-65
SLIDE 65

DAG - Dependencies

4 × 4-tile matrix

✞ ✝ ☎ ✆

CHOL

❄ ✟ ✟ ✟ ✙ ❍❍ ❍ ❥ ✞ ✝ ☎ ✆

TRSM

✘ ✘ ✘ ✘ ✘ ✾ ✚ ✚ ❂ ❩ ❩ ⑦ ✞ ✝ ☎ ✆

TRSM

✘ ✘ ✘ ✘ ✘ ✾ ❩ ❩ ⑦ ❳❳❳❳ ❳ ③ ✞ ✝ ☎ ✆

TRSM

✘ ✘ ✘ ✘ ✘ ✾ ❩ ❩ ⑦ ❳❳❳❳ ❳ ③ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

GEMM

❄ ✞ ✝ ☎ ✆

GEMM

❄ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

GEMM

❄ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

CHOL

❅ ❅ ❘ PPPPP q ✞ ✝ ☎ ✆

TRSM PPPPP

q ❳❳❳❳❳❳❳ ❳ ③ ✞ ✝ ☎ ✆

TRSM PPPPP

q ❳❳❳❳❳❳❳ ❳ ③ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

GEMM

❄ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

CHOL

❍❍ ❍ ❥ ✞ ✝ ☎ ✆

TRSM

❍❍ ❍ ❥ ✞ ✝ ☎ ✆

SYRK

❄ ✞ ✝ ☎ ✆

CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 24 / 34

slide-66
SLIDE 66

Task Execution

4 × 4-tile matrix

Stage

Scheduled Tasks 1

CHOL

2

TRSM TRSM TRSM TRSM

3

SYRK GEMM SYRK GEMM

4

GEMM SYRK GEMM GEMM

5

GEMM SYRK CHOL

6

TRSM TRSM TRSM

7

SYRK GEMM SYRK GEMM

8

GEMM SYRK CHOL

9

TRSM TRSM

10

SYRK GEMM SYRK

11

CHOL

12

TRSM

13

SYRK

14

CHOL

Paolo Bientinesi (AICES, RWTH Aachen) 25 / 34

slide-67
SLIDE 67

SPD Inverse: Chol+Inv+GEMM

5 × 5-tile matrix

Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

slide-68
SLIDE 68

SPD Inverse: Chol+Inv+GEMM

5 × 5-tile matrix

Stage Scheduled Tasks 1

CHOL

2

TRSM TRSM TRSM TRSM

3

SYRK GEMM SYRK GEMM

4

GEMM SYRK GEMM GEMM

5

GEMM SYRK CHOL TRSM

6

TRSM TRSM TRSM TRSM

7

TRSM TRSM TRINV SYRK

8

GEMM SYRK GEMM GEMM

9

SYRK TTMM CHOL TRSM

10

TRSM TRSM TRSM TRSM

11

GEMM GEMM GEMM SYRK

12

GEMM SYRK TRSM CHOL

13

TRSM TRSM TRINV SYRK

14

TRSM GEMM GEMM GEMM

15

GEMM TRMM SYRK TRSM

16

TRSM TTMM CHOL TRSM

17

SYRK TRINV GEMM SYRK

18

GEMM GEMM GEMM TRMM

19

TRMM TRSM TRSM TRSM

20

TRSM TRSM TRSM TRSM

21

TTMM SYRK GEMM SYRK

22

TRINV GEMM GEMM TRINV

23

SYRK SYRK GEMM SYRK

24

TRMM GEMM TRMM GEMM

25

TRMM SYRK GEMM GEMM

26

TTMM GEMM TRMM TRMM

27

SYRK TRMM

28

TRMM

29

TTMM

Paolo Bientinesi (AICES, RWTH Aachen) 26 / 34

slide-69
SLIDE 69

Blocked vs. By-blocks

Paolo Bientinesi (AICES, RWTH Aachen) 27 / 34

slide-70
SLIDE 70

Blocked vs. By-blocks: crossover

10 20 30 40 50 60 70 80 2000 4000 6000 8000 10000 12000 14000 GFLOPS Matrix dimension Performance of Cholesky ( LLT = A ) SuperMatrix FLAME

Paolo Bientinesi (AICES, RWTH Aachen) 28 / 34

slide-71
SLIDE 71

Multithreaded BLAS vs. Algorithms-by-blocks

No absolute winner: crossover! ✓ Ease of use ✖ Synchronization ✓ Out of order execution ✓ Parallelism dictated by data dependencies ✖ Plateux libFLAME, MKL, PLASMA, . . . Heterogeneous systems ↔ schedulers

Paolo Bientinesi (AICES, RWTH Aachen) 29 / 34

slide-72
SLIDE 72

Part 4: Streaming

The entire problem does not fit even in main memory Example b :=

  • XT M −1X

−1 XT M −1y

Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

slide-73
SLIDE 73

Part 4: Streaming

The entire problem does not fit even in main memory Example b :=

  • XT M −1X

−1 XT M −1y

Linear regression with non-independent outcomes

Inputs M ∈ Rn×n, SPD(M), n ∈ [103, . . . , 104] X ∈ Rn×p, p ∈ [1, . . . , 20], full rank y ∈ Rn Output b ∈ Rp ⋆Sequence of thousands of problems⋆

Paolo Bientinesi (AICES, RWTH Aachen) 30 / 34

slide-74
SLIDE 74

Algorithm

LLT = M X := L−1X S := XT X GGT = S y := L−1y b := XT y b := G−1b b := G−T b

Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

slide-75
SLIDE 75

Algorithm – bottleneck?

LLT = M X := L−1X → to accelerator (TRSM) S := XT X GGT = S y := L−1y b := XT y b := G−1b b := G−T b

Paolo Bientinesi (AICES, RWTH Aachen) 31 / 34

slide-76
SLIDE 76

double+triple buffering

GPU CPU HDD

t Read b+3 Send b+2 GPU trsm b+1 GPU trsm b Read b+2 Recv b-1 Send b+1 CPU comp b-1 Write b-1 Recv b CPU b

CPU ⇄ GPU transfer HDD ⇄ CPU transfer GPU computation CPU computation Data dependencies Asynchronous dispatch

Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

slide-77
SLIDE 77

double+triple buffering

b-1

β

b b-1 b-2 b+1 b+2 b+3 b-3

HDD CPU/RAM GPU

b-1 b-2 b-1 b-3 Results r Data X

b

trsm α

b+2

A

b-1

B

b+1

C

Paolo Bientinesi (AICES, RWTH Aachen) 32 / 34

slide-78
SLIDE 78

Summary

Dense linear algebra: it’s matter of layers GEMM = foundations, THE building block Irrespective of the architecture, need for highly optimized BLAS How to use threads/cores? Multithreaded BLAS vs. algorithms-by-blocks Large/many problems but limited memory ⇒ streaming

Paolo Bientinesi (AICES, RWTH Aachen) 33 / 34

slide-79
SLIDE 79

References

MKL http://software.intel.com/en-us/intel-mkl cuBLAS, CULA https://developer.nvidia.com/cublas libFLAME http://wiki.cs.utexas.edu/flame BLIS http://code.google.com/p/blis/ OpenBLAS http://xianyi.github.io/OpenBLAS/ Magma http://icl.cs.utk.edu/magma/ Plasma http://icl.cs.utk.edu/plasma/

Paolo Bientinesi (AICES, RWTH Aachen) 34 / 34