SuperMatrix: A Multithreaded Runtime Scheduling System for - - PowerPoint PPT Presentation

supermatrix a multithreaded runtime scheduling
SMART_READER_LITE
LIVE PREVIEW

SuperMatrix: A Multithreaded Runtime Scheduling System for - - PowerPoint PPT Presentation

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan, Field G. Van Zee, Robert van de Geijn, Paolo Bientinesi, Enrique S. Quintana-Ort and Gregorio Quintana-Ort Software Engineering Seminar Luc Humair


slide-1
SLIDE 1

SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks

Ernie Chan, Field G. Van Zee, Robert van de Geijn, Paolo Bientinesi, Enrique S. Quintana-Ortí and Gregorio Quintana-Ortí

Software Engineering Seminar – Luc Humair

slide-2
SLIDE 2

Motivation

  • Multicore architectures demand concurrent algorithms
  • Complicated and error prone linear algebra libraries

2

slide-3
SLIDE 3

Motivation

  • SuperMatrix offers level of abstraction for algorithms-by-block:

– Automatic parallelization – Straight forward implementation of algorithms-by-block

3

slide-4
SLIDE 4

Motivation

  • SuperMatrix offers level of abstraction for algorithms-by-block:

– Automatic parallelization – Straight forward implementation of algorithms-by-block

  • Work with blocked matrices (FLAME/FLASH API)
  • Dependency analysis
  • Out of order scheduling

4

slide-5
SLIDE 5

Inversion of a SPD Matrix

Given symmetric positive definite matrix

n n

R A

5

1 1 1 1 2 1 1 1 3

 A

slide-6
SLIDE 6

Inversion of a SPD Matrix

Given symmetric positive definite matrix 1. Cholesky factorization (CHOL)

n n

R A

 U U A

T

6

1 1 1 1 2 1 1 1 3 1 1 1 1 1.4

 A  U

slide-7
SLIDE 7

Inversion of a SPD Matrix

Given symmetric positive definite matrix 1. Cholesky factorization (CHOL) 2. Inversion of triangular matrix (TRINV)

n n

R A

 U U A

T

1

:

U R

7

1 1 1 1 2 1 1 1 3 1 1 1 1 1.4 1

  • 1
  • 0.7

1 0.7

 A  U  R

slide-8
SLIDE 8

Inversion of a SPD Matrix

Given symmetric positive definite matrix 1. Cholesky factorization (CHOL) 2. Inversion of triangular matrix (TRINV) 3. Triangular transpose matrix mult. (TTMM)

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

8

1 1 1 1 2 1 1 1 3 1 1 1 1 1.4 1

  • 1
  • 0.7

1 0.7 2.5

  • 1
  • 0.5
  • 1

1

  • 0.5

0.5

 A  U  R 

1

A

slide-9
SLIDE 9

I U U U UU U URR U AA

T T T T T T

   

    1 2 3 , 1 1

Inversion of a SPD Matrix – Proof

SPD matrix 1. (CHOL) 2. (TRINV) 3. (TTMM) Proof:

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

9

slide-10
SLIDE 10

I U U U UU U URR U AA

T T T T T T

   

    1 2 3 , 1 1

Inversion of a SPD Matrix – Proof

SPD matrix 1. (CHOL) 2. (TRINV) 3. (TTMM) Proof:

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

10

slide-11
SLIDE 11

I U U U UU U URR U AA

T T T T T T

   

    1 2 3 , 1 1

Inversion of a SPD Matrix – Proof

SPD matrix 1. (CHOL) 2. (TRINV) 3. (TTMM) Proof:

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

11

slide-12
SLIDE 12

I U U U UU U URR U AA

T T T T T T

   

    1 2 3 , 1 1

Inversion of a SPD Matrix – Proof

SPD matrix 1. (CHOL) 2. (TRINV) 3. (TTMM) Proof:

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

12

slide-13
SLIDE 13

I U U U UU U URR U AA

T T T T T T

   

    1 2 3 , 1 1

Inversion of a SPD Matrix – Proof

SPD matrix 1. (CHOL) 2. (TRINV) 3. (TTMM) Proof:

n n

R A

 U U A

T

1

:

U R

T

RR A 

 : 1

13

slide-14
SLIDE 14

One variant of computing (CHOL)

14

Source: Paper

slide-15
SLIDE 15

One variant of computing (CHOL)

15

Source: Paper

slide-16
SLIDE 16

First iteration (4x4 matrix blocks)

A1,1 A1,2 A1,3 A1,4 (A2,1) A2,2 A2,3 A2,4 (A3,1) (A3,2) A3,3 A3,4 (A4,1) (A4,2) (A4,3) A4,4

16

Source: Paper

slide-17
SLIDE 17

First iteration (4x4 matrix blocks)

A1,1 A1,2 A1,3 A1,4 (A2,1) A2,2 A2,3 A2,4 (A3,1) (A3,2) A3,3 A3,4 (A4,1) (A4,2) (A4,3) A4,4

17

Source: Paper

slide-18
SLIDE 18

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

18

Source: Paper

slide-19
SLIDE 19

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

19

Source: Paper

slide-20
SLIDE 20

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

20

Source: Paper

slide-21
SLIDE 21
  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

21

Source: Paper

slide-22
SLIDE 22
  • GEMM

Matrix-Matrix multiplication

  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

22

Source: Paper

slide-23
SLIDE 23

Dependency Graph:

  • GEMM

Matrix-Matrix multiplication

  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

23

Source: Paper

slide-24
SLIDE 24

Dependency Graph:

  • GEMM

Matrix-Matrix multiplication

  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

24

Source: Paper

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)

CHOL0

CHOL(A2,2)

TRSM1

Inv(A2,2) A2,3

TRSM2

Inv(A2,2) A2,4

(A3,1) (A3,2) SYRK3

A3,3 – AT

2,3A2,3

GEMM4

A3,4 – AT

2,3A2,4

(A4,1) (A4,2) (A3,2) SYRK7

A4,4 – AT

2,4A2,4

slide-25
SLIDE 25

Dependency Graph:

  • GEMM

Matrix-Matrix multiplication

  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

25

Source: Paper

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)

CHOL0

CHOL(A2,2)

TRSM1

Inv(A2,2) A2,3

TRSM2

Inv(A2,2) A2,4

(A3,1) (A3,2) SYRK3

A3,3 – AT

2,3A2,3

GEMM4

A3,4 – AT

2,3A2,4

(A4,1) (A4,2) (A3,2) SYRK7

A4,4 – AT

2,4A2,4

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2)

CHOL0

CHOL(A3,3)

TRSM2 Inv(A3,3) A3,4 (A4,1) (A4,2) (A4,3) SYRK3

A4,4 – AT

3,4A3,4

slide-26
SLIDE 26

Dependency Graph:

  • GEMM

Matrix-Matrix multiplication

  • SYRK

Symmetric rank-k update

First iteration (4x4 matrix blocks)

Computations:

CHOL0

CHOL(A1,1)

TRSM1

Inv(A1,1) A1,2

TRSM2

Inv(A1,1) A1,3

TRSM3

Inv(A1,1) A1,4

(A2,1) SYRK4

A2,2 – AT

1,2A1,2

GEMM5

A2,3 – AT

1,2A1,3

GEMM6

A2,4 – AT

1,2A1,4

(A3,1) (A3,2) SYRK7

A3,3 – AT

1,3A1,3

GEMM8

A3,4 – AT

1,3A1,4

(A4,1) (A4,2) (A4,3) SYRK9

A4,4 – AT

1,4A1,4

  • CHOL

Cholesky factorization

  • TRSM

Triangular solves with multiple right hand sides

26

Source: Paper

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)

CHOL0

CHOL(A2,2)

TRSM1

Inv(A2,2) A2,3

TRSM2

Inv(A2,2) A2,4

(A3,1) (A3,2) SYRK3

A3,3 – AT

2,3A2,3

GEMM4

A3,4 – AT

2,3A2,4

(A4,1) (A4,2) (A3,2) SYRK7

A4,4 – AT

2,4A2,4

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2)

CHOL0

CHOL(A3,3)

TRSM2 Inv(A3,3) A3,4 (A4,1) (A4,2) (A4,3) SYRK3

A4,4 – AT

3,4A3,4

(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2) (A3,3) (A3,4) (A4,1) (A4,2) (A4,3)

CHOL0

CHOL(A4,4)

slide-27
SLIDE 27
  • Out of order scheduling of FLASH code
  • Part of FLASH API

SuperMatrix

  • Hyper Matrices

FLASH API

  • API for other numerical libraries (eg.

GogoBLAS, MKL)

  • New notation for expressing algorithms

FLAME API

Layers of SuperMatrix

27

slide-28
SLIDE 28

FLASH implementation of (CHOL)

Implementation: Procedural:

28

Source: Paper

slide-29
SLIDE 29

FLASH implementation of (CHOL)

Implementation: Procedural:

29

Source: Paper

slide-30
SLIDE 30

FLASH implementation of (CHOL)

Implementation: Procedural:

30

Source: Paper

slide-31
SLIDE 31

Computing SPD-1

31

Source: Paper

slide-32
SLIDE 32

Computing SPD-1

32

Source: Paper

slide-33
SLIDE 33

Computing SPD-1

CHOL

U U A

T

TRINV

1

:

U R

TTMM

T

RR A 

 : 1

33

Source: Paper

slide-34
SLIDE 34

Computing SPD-1

CHOL

U U A

T

TRINV

1

:

U R

TTMM

T

RR A 

 : 1

34

Source: Paper

slide-35
SLIDE 35

How to analyze dependencies

  • FLAME/FLASH API: Last operand

is overridden

  • Preceding ones are strictly inputs
  • Distinct between three types of

dependencies: – Flow dependencies – Anti-dependencies – Output dependencies

35

Source: Paper

slide-36
SLIDE 36

Flow dependencies

  • Read-after-write dependency

S1: A = B + C S2: D = A + E

S1 must complete the write before S2 can read.

36

Source: Paper

slide-37
SLIDE 37

Flow dependencies

  • Read-after-write dependency

S1: A = B + C S2: D = A + E

S1 must complete the write before S2 can read.

Flow dependency A00 A01 A02 A11 A12 A22 Intra-iterational Inter-iterational

37

Source: Paper

slide-38
SLIDE 38

Flow dependencies

  • Read-after-write dependency

S1: A = B + C S2: D = A + E

S1 must complete the write before S2 can read.

Flow dependency A00 A01 A02 A11 A12 A22 Intra-iterational Inter-iterational

38

Source: Paper A11 A12

slide-39
SLIDE 39

Anti-dependencies

  • Write-after-read dependency

S3: A = B + C S4: C = B + E

S3 must complete the read before S4 can write.

Flow dependency Intra-iterational Inter-iterational

39

Source: Paper

slide-40
SLIDE 40

Anti-dependencies

  • Write-after-read dependency

S3: A = B + C S4: C = B + E

S3 must complete the read before S4 can write.

Anti-dependency Flow dependency Intra-iterational Inter-iterational

40

Source: Paper A00 A01 A02 A11 A12 A22

slide-41
SLIDE 41

Anti-dependencies

  • Write-after-read dependency

S3: A = B + C S4: C = B + E

S3 must complete the read before S4 can write.

Anti-dependency Flow dependency Intra-iterational Inter-iterational

41

Source: Paper A00 A01 A02 A11 A12 A22 A01

slide-42
SLIDE 42

Output dependencies

  • Write-after-Write dependency
  • Last operand is also input
  • perand, thus can be interpreted

as flow dependency.

S5: A = B + C S6: A = D + E

S6 must override A after S5.

Anti-dependency Flow dependency Intra-iterational Inter-iterational

42

Source: Paper

slide-43
SLIDE 43

SuperMatrix

Workflow of SuperMatrix

Pending Jobs DAG

Get Work Return available jobs

Worker Pool

(Threads)

Logical unit

43

slide-44
SLIDE 44

SuperMatrix

Workflow of SuperMatrix

Pending Jobs DAG

Get Work Return available jobs

Call of FLASH_* routine

Enqueue job

Worker Pool

(Threads)

Logical unit

44

slide-45
SLIDE 45

SuperMatrix

Workflow of SuperMatrix

Pending Jobs DAG

Get Work Return available jobs

Call of FLASH_* routine

Enqueue job

Call of FLASH_Queue_exec();

Start working

Worker Pool

(Threads)

Logical unit

45

slide-46
SLIDE 46

Performance of SPD-inv

Architecture: Eight nodes, each with two 1.5 Ghz Intel Itanium2 processors. Peak performance is 96 GFLOPs/sec Block size: 192

46

Source: Paper

slide-47
SLIDE 47

Performance of (CHOL) variants

Architecture: Eight nodes, each with two 1.5 Ghz Intel Itanium2 processors. Peak performance is 96 GFLOPs/sec Block size: 192

  • SuperMatrix:

Linked with serial GotoBLAS

  • FLAME & LAPACK:

Linked with multithreaded GotoBLAS

47

Source: Paper

slide-48
SLIDE 48

Conclusion

  • Easy to implement algorithms
  • Less error prone
  • Pretty good performance results
  • Detailed instructions how to install

48

slide-49
SLIDE 49

Further material

  • Homepage of the FLAME project

http://www.cs.utexas.edu/users/flame/

  • Overview over FLASH, FLAME and SuperMatrix:

High performance computing for computational science:8th

international conference, Toulouse, France, June 24-27, 2008. revised selected papers Pages 228 – 239

http://books.google.co.uk/books?id=us_EQmotveYC&lpg=PR3 &pg=PR3#v=onepage&q&f=false

49

slide-50
SLIDE 50

Thanks for your attention! Questions?