CPU+GPU Load Balance Guided by Execution Time Prediction
Jean-François Dollinger, Vincent Loechner
Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr
19 January 2015
1 / 29
CPU+GPU Load Balance Guided by Execution Time Prediction - - PowerPoint PPT Presentation
CPU+GPU Load Balance Guided by Execution Time Prediction Jean-Franois Dollinger, Vincent Loechner Inria CAMUS, ICube Lab., University of Strasbourg jean-francois.dollinger@inria.fr, vincent.loechner@inria.fr 19 January 2015 1 / 29 Outline 1
1 / 29
1 / 29
2 / 29
3 / 29
4 / 29
5 / 29
6 / 29
Extract scop
#pragma scop for(...) for(...) for(...) S0(...); #pragma endscop
O ine pro ling Pro le kernels Pro le memcpy
version 0
memcpy duration
Ranking tables Bandwidth table
Runtime prediction
Kernel duration
version 2
version 0 version 1
Application binary
... /* scop */ call schedule(...) call dispatch(...) /* endscop */ ...
Static code generation Launch PPCG
2
Parallelize and chunk (ppcg)
#pragma omp parallel for for(t0 = lb; t0 < ub; ...) for(...) for(...) S0(...);
Launch PLUTO Build templates
version 0 2
version 0
version 1
Kernel duration GPU CPU Schedule
version 2
7 / 29
8 / 29
9 / 29
10 / 29
11 / 29
12 / 29
13 / 29
1000 2000 3000 4000 5000 6000 7000 10-3 10-2 10-1 100 101 102 103 104 105 106 107 bandwidth (MB/s) size (KB) real dev-host GTX590 real host-dev GTX590
14 / 29
1000 2000 3000 4000 5000 6000 7000 10-3 10-2 10-1 100 101 102 103 104 105 106 107 bandwidth (MB/s) size (KB) real dev-host GTX680 real host-dev GTX680
15 / 29
0.01 0.1 1 100 101 102 103 104 105 execution time per iteration (ns) number of blocks gemm 32x16 - GTX 590 real profiled 16 / 29
0.01 0.1 1 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 (e)xecution time per iteration (ns) (p)arameter size syrk2 - GTX 590 real ei=piβ+ui 17 / 29
18 / 29
19 / 29
20 / 29
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6
step
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6
step
21 / 29
5 10 15 20 25 30 g e m m 2 m m 3 m m s y r k s y r 2 k d
t g e n g e s u m m v m v t g e m v e r speedup CPU 1GPU CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs
22 / 29
0.2 0.4 0.6 0.8 1 g e m m 2 m m 3 m m s y r k s y r 2 k d
t g e n g e s u m m v m v t g e m v e r imbalance CPU+1GPU CPU+2GPUs CPU+3GPUs CPU+4GPUs
23 / 29
1 2 3 4 5 6 C P U 1 G P U C P U + 1 G P U C P U + 2 G P U s C P U + 3 G P U s C P U + 4 G P U s speedup syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all)
24 / 29
0.2 0.4 0.6 0.8 1 C P U + 1 G P U C P U + 2 G P U s C P U + 3 G P U s C P U + 4 G P U s imbalance syr2k (c1) syr2k (c2) syr2k (c3) syr2k (c4) syr2k (c5) syr2k (c6) syr2k (c7) syr2k (c8) syr2k (c9) syr2k (all)
25 / 29
26 / 29
27 / 29
28 / 29
29 / 29