SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks
Ernie Chan, Field G. Van Zee, Robert van de Geijn, Paolo Bientinesi, Enrique S. Quintana-Ortí and Gregorio Quintana-Ortí
SuperMatrix: A Multithreaded Runtime Scheduling System for - - PowerPoint PPT Presentation
SuperMatrix: A Multithreaded Runtime Scheduling System for Algorithms-by-Blocks Ernie Chan, Field G. Van Zee, Robert van de Geijn, Paolo Bientinesi, Enrique S. Quintana-Ort and Gregorio Quintana-Ort Software Engineering Seminar Luc Humair
Ernie Chan, Field G. Van Zee, Robert van de Geijn, Paolo Bientinesi, Enrique S. Quintana-Ortí and Gregorio Quintana-Ortí
2
3
4
n n
5
1 1 1 1 2 1 1 1 3
n n
T
6
1 1 1 1 2 1 1 1 3 1 1 1 1 1.4
n n
T
1
7
1 1 1 1 2 1 1 1 3 1 1 1 1 1.4 1
1 0.7
n n
T
1
T
: 1
8
1 1 1 1 2 1 1 1 3 1 1 1 1 1.4 1
1 0.7 2.5
1
0.5
1
T T T T T T
1 2 3 , 1 1
n n
T
1
T
: 1
9
T T T T T T
1 2 3 , 1 1
n n
T
1
T
: 1
10
T T T T T T
1 2 3 , 1 1
n n
T
1
T
: 1
11
T T T T T T
1 2 3 , 1 1
n n
T
1
T
: 1
12
T T T T T T
1 2 3 , 1 1
n n
T
1
T
: 1
13
14
Source: Paper
15
Source: Paper
A1,1 A1,2 A1,3 A1,4 (A2,1) A2,2 A2,3 A2,4 (A3,1) (A3,2) A3,3 A3,4 (A4,1) (A4,2) (A4,3) A4,4
16
Source: Paper
A1,1 A1,2 A1,3 A1,4 (A2,1) A2,2 A2,3 A2,4 (A3,1) (A3,2) A3,3 A3,4 (A4,1) (A4,2) (A4,3) A4,4
17
Source: Paper
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
18
Source: Paper
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
19
Source: Paper
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
20
Source: Paper
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
21
Source: Paper
Matrix-Matrix multiplication
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
22
Source: Paper
Matrix-Matrix multiplication
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
23
Source: Paper
Matrix-Matrix multiplication
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
24
Source: Paper
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)
CHOL0
CHOL(A2,2)
TRSM1
Inv(A2,2) A2,3
Inv(A2,2) A2,4
(A3,1) (A3,2) SYRK3
A3,3 – AT
2,3A2,3
GEMM4
A3,4 – AT
2,3A2,4
(A4,1) (A4,2) (A3,2) SYRK7
A4,4 – AT
2,4A2,4
Matrix-Matrix multiplication
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
25
Source: Paper
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)
CHOL0
CHOL(A2,2)
TRSM1
Inv(A2,2) A2,3
Inv(A2,2) A2,4
(A3,1) (A3,2) SYRK3
A3,3 – AT
2,3A2,3
GEMM4
A3,4 – AT
2,3A2,4
(A4,1) (A4,2) (A3,2) SYRK7
A4,4 – AT
2,4A2,4
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2)
CHOL0
CHOL(A3,3)
TRSM2 Inv(A3,3) A3,4 (A4,1) (A4,2) (A4,3) SYRK3
A4,4 – AT
3,4A3,4
Matrix-Matrix multiplication
Symmetric rank-k update
CHOL0
CHOL(A1,1)
TRSM1
Inv(A1,1) A1,2
Inv(A1,1) A1,3
Inv(A1,1) A1,4
(A2,1) SYRK4
A2,2 – AT
1,2A1,2
GEMM5
A2,3 – AT
1,2A1,3
GEMM6
A2,4 – AT
1,2A1,4
(A3,1) (A3,2) SYRK7
A3,3 – AT
1,3A1,3
GEMM8
A3,4 – AT
1,3A1,4
(A4,1) (A4,2) (A4,3) SYRK9
A4,4 – AT
1,4A1,4
Cholesky factorization
Triangular solves with multiple right hand sides
26
Source: Paper
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1)
CHOL0
CHOL(A2,2)
TRSM1
Inv(A2,2) A2,3
Inv(A2,2) A2,4
(A3,1) (A3,2) SYRK3
A3,3 – AT
2,3A2,3
GEMM4
A3,4 – AT
2,3A2,4
(A4,1) (A4,2) (A3,2) SYRK7
A4,4 – AT
2,4A2,4
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2)
CHOL0
CHOL(A3,3)
TRSM2 Inv(A3,3) A3,4 (A4,1) (A4,2) (A4,3) SYRK3
A4,4 – AT
3,4A3,4
(A1,1) (A1,2) (A1,3) (A1,4) (A2,1) (A2,2) (A2,3) (A2,4) (A3,1) (A3,2) (A3,3) (A3,4) (A4,1) (A4,2) (A4,3)
CHOL0
CHOL(A4,4)
GogoBLAS, MKL)
27
28
Source: Paper
29
Source: Paper
30
Source: Paper
31
Source: Paper
32
Source: Paper
T
1
T
: 1
33
Source: Paper
T
1
T
: 1
34
Source: Paper
35
Source: Paper
S1: A = B + C S2: D = A + E
36
Source: Paper
S1: A = B + C S2: D = A + E
Flow dependency A00 A01 A02 A11 A12 A22 Intra-iterational Inter-iterational
37
Source: Paper
S1: A = B + C S2: D = A + E
Flow dependency A00 A01 A02 A11 A12 A22 Intra-iterational Inter-iterational
38
Source: Paper A11 A12
S3: A = B + C S4: C = B + E
Flow dependency Intra-iterational Inter-iterational
39
Source: Paper
S3: A = B + C S4: C = B + E
Anti-dependency Flow dependency Intra-iterational Inter-iterational
40
Source: Paper A00 A01 A02 A11 A12 A22
S3: A = B + C S4: C = B + E
Anti-dependency Flow dependency Intra-iterational Inter-iterational
41
Source: Paper A00 A01 A02 A11 A12 A22 A01
S5: A = B + C S6: A = D + E
Anti-dependency Flow dependency Intra-iterational Inter-iterational
42
Source: Paper
43
Call of FLASH_* routine
44
Call of FLASH_* routine
Call of FLASH_Queue_exec();
45
Architecture: Eight nodes, each with two 1.5 Ghz Intel Itanium2 processors. Peak performance is 96 GFLOPs/sec Block size: 192
46
Source: Paper
Architecture: Eight nodes, each with two 1.5 Ghz Intel Itanium2 processors. Peak performance is 96 GFLOPs/sec Block size: 192
47
Source: Paper
48
international conference, Toulouse, France, June 24-27, 2008. revised selected papers Pages 228 – 239
49