Portable Designs for Performance Using the Hybrid Task Graph Scheduler
Tim Blattner NIST | ITL | SSD | ISG
Portable Designs for Performance Using the Hybrid Task Graph - - PowerPoint PPT Presentation
Portable Designs for Performance Using the Hybrid Task Graph Scheduler Tim Blattner NIST | ITL | SSD | ISG Disclaimer No approval or endorsement of any commercial product by NIST is intended or implied. Certain commercial software, products,
Tim Blattner NIST | ITL | SSD | ISG
2018-03-28 GPU Technology Conference 2
2018-03-28 GPU Technology Conference 3
2018-03-28 GPU Technology Conference 4
2018-03-28 GPU Technology Conference 5
} Multicore, GPU, and cluster computing
} Modest programming effort
} Achieve 80-90% attainable performance } Built on abstractions and software
¨ Accessible performance model
IBM “Minsky” Power8+ NVLink 4 Tesla P100 GPUs, GTC DC 2016
50000-500-MultiGPU-Block-Panel-LUD Compute time: 87.671914 s Creation time: 0.000038 s 885.256672 GFLOPS Execution Pipeline0 Compute time: 87.670352 s Creation time: 0.000633 s
Graph Input GausElimTask x1 computeTime: 5.395365 s waitTime: 82.273382 s maxQueueSize: 1 Bookkeeper x1 computeTime: 0.194452 s waitTime: 87.458737 s maxQueueSize: 98 GausElimRuleUpper GausElimRuleLower DecompositionRule FactorLowerTask x20 computeTime: 5.177253 s waitTime: 82.482351 s maxQueueSize: 87 Bookkeeper x1 computeTime: 0.029889 s waitTime: 87.630479 s maxQueueSize: 25 CopyFactorMatrixRule CopyUpdateRuleUpperWindow CopyUpdateRuleUpper CudaCopyInPanelTaskLower x1 computeTime: 3.973354 s waitTime: 51.270483 s maxQueueSize: 1 memoryWaitTime: 32.424224 sec CudaCopyInPanelWindowTask x1 computeTime: 10.855282 s waitTime: 76.720687 s maxQueueSize: 72 memoryWaitTime: 0.085142 sec CudaCopyInPanelTaskUpper x1 computeTime: 32.626862 s waitTime: 22.028843 s maxQueueSize: 25 memoryWaitTime: 33.011533 sec Bookkeeper x1 computeTime: 0.013551 s waitTime: 87.646589 s maxQueueSize: 24 MatrixMulRule MatrixMulPanelTask x1 computeTime: 82.606453 s waitTime: 4.580509 s maxQueueSize: 77 CudaCopyOutPanelTask x1 computeTime: 39.324830 s waitTime: 48.332059 s maxQueueSize: 6 Bookkeeper x1 computeTime: 0.084473 s waitTime: 87.572424 s maxQueueSize: 7 Bookkeeper x1 computeTime: 0.006953 s waitTime: 87.652483 s maxQueueSize: 29 GatherBlockRule GausElimRule UpdateFactorRule} 10+ per CPU
} GPUs
2018-03-28 GPU Technology Conference 6
2018-03-28 GPU Technology Conference 7
} Fine vs coarse-grained parallelism
} OpenMP, OpenACC
} OpenCV, OpenBLAS, …
} OpenMP } StarPU, Legion, …
} Kokkos
8
} Multi-GPU } Multiple producers – multiple consumers } Significant effort to implement
} The scalable programmer } Experimentation for performance
Q01 read Q12 FFT/Disp. Q23 BK 1 Threads ≥ 1 ≥ 1 1
Q01 read Q12 copier Q23 FFT Q34 BK1 Q45 Disp
GPU0 Pipeline
BKn Q34 FFT Q23 copier Q12 read Q01 Q45 Disp
GPUn Pipeline
Q56 CCF
. . .
≥ 1
6
2018-03-28 GPU Technology Conference
2018-03-28 GPU Technology Conference 9
} Modify traversal strategies } Decomposition strategies
10
2018-03-28 GPU Technology Conference
2018-03-28 GPU Technology Conference 11
2018-03-28 GPU Technology Conference 12
} GPU (NVIDIA/AMD), FPGA, …
} Manages complex data dependencies } Maintains state of computation
} Binds task to NVIDIA CUDA GPU
} Creates copies of a task graph
¨ Each copy bound to a specified GPU
} Attaches memory edge to a task
¨ getMemory(“nameOfEdge”) ¨ Binds memory allocation to address space
¨ CPU, GPU, etc.
2018-03-28 GPU Technology Conference 13
2018-03-28 GPU Technology Conference 14
R eadStreamT ask x1 1 MADT ask x40 char get 1 W riteR esultT ask x1 MM(static): DataBlock x1
2018-03-28 GPU Technology Conference 15
2018-03-28 GPU Technology Conference 16
} >300 time series images
Read FFT BK PCIAM
CCF
MemManagerFFT
2018-03-28 GPU Technology Conference 17
2018-03-28 GPU Technology Conference 18
2018-03-28 19
GPU Technology Conference
2018-03-28 GPU Technology Conference 20
2018-03-28 GPU Technology Conference 21
} &×( = *
A × B = C
Pk i=1(Ar,i × Bi,c) = Cr,c
A B C
Cr,c Ar,1 Ar,k B1,c Bk,c
2018-03-28 GPU Technology Conference 22
BK1 LoadA LoadB BK2 MatMul: Ar,k × Bk,c = Ck
r,c
BK3 Accumulate: Ck
r,c+ = Ck+1 r,c
WriteC
BK1 LoadA LoadB CopyA CopyB BK2 MatMul: Ar,k × Bk,c = Ck
r,c
CopyC BK1 LoadA LoadB CopyA CopyB BK2 MatMul: Ar,k × Bk,c = Ck
r,c
CopyC
MemManagerA MemManagerB MemManagerC
BK3 Accumulate: Ck
r,c+ = Ck+1 r,c
WriteC
MemManagerA MemManagerB MemManagerC
A × B = C A B C1...k
A1,k Am,k Bk,1 Bk,n k k
2018-03-28 GPU Technology Conference 23
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 1024 2048 4096 8192
Time (s) Block Size
64k Matrix Multiplication HTGS vs cuBLAS-XT
1 GPU HTGS 1 GPU cuBLAS-XT 2 GPUs HTGS 2 GPUs cuBLAS-XT 3 GPUs HTGS 3 GPUs cuBLAS-XT 4 GPUs HTGS 4 GPUs cuBLAS-XT
2018-03-28 GPU Technology Conference 24
2 4 6 8 10 12 14 16 18 20 1024 2048 4096 8192
DP TFlops Block Size
64k Matrix Multiplication DP TFlops HTGS vs cuBLAS-XT
1 GPU HTGS 1 GPU cuBLAS-XT 2 GPUs HTGS 2 GPUs cuBLAS-XT 3 GPUs HTGS 3 GPUs cuBLAS-XT 4 GPUs HTGS 4 GPUs cuBLAS-XT
2018-03-28 GPU Technology Conference 25
2018-03-28 GPU Technology Conference 26
Execution Pipeline0 Execution Pipeline1 Execution Pipeline2 Execution Pipeline3
Graph Input MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule Bookkeeper Compute Time (sec): 0.001108 Bookkeeper Compute Time (sec): 0.001070 Bookkeeper Compute Time (sec): 0.000992 Bookkeeper Compute Time (sec): 0.000942 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000117 ReadMatrixTask(B) x1 Compute Time (sec): 0.000071 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167 Bookkeeper ReadMatrixTask(A) x1 Compute Time (sec): 0.000110 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 44.487608 Bookkeeper Compute Time (sec): 0.002163 ReadMatrixTask(B) x1 Compute Time (sec): 0.000063 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 45.635919 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 60.707234 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 60.628967 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000100 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 46.352207 Bookkeeper Compute Time (sec): 0.004679 ReadMatrixTask(B) x1 Compute Time (sec): 0.000072 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 47.774502 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 63.083254 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.875404 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000090 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 45.127101 Bookkeeper Compute Time (sec): 0.006831 ReadMatrixTask(B) x1 Compute Time (sec): 0.000069 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 43.879753 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 59.599396 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 59.547325 MatrixDistributeRule MatrixDistributeRule MatrixAccumulateRule MatrixOutputRule MatrixAccumTask x40 Compute Time (sec): 5.591368 OutputTask x1 Compute Time (sec): 13.525106 Graph OutputCudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167
2018-03-28 GPU Technology Conference 27
Execution Pipeline0 Execution Pipeline1 Execution Pipeline2 Execution Pipeline3
Graph Input MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule MatrixDecompositionRule Bookkeeper Compute Time (sec): 0.000212 Bookkeeper Compute Time (sec): 0.000222 Bookkeeper Compute Time (sec): 0.000179 Bookkeeper Compute Time (sec): 0.000164 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000042 ReadMatrixTask(B) x1 Compute Time (sec): 0.000026 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527 Bookkeeper ReadMatrixTask(A) x1 Compute Time (sec): 0.000040 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 15.731764 Bookkeeper Compute Time (sec): 0.000800 ReadMatrixTask(B) x1 Compute Time (sec): 0.000037 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.041090 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.187993 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.616045 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000037 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 14.407581 Bookkeeper Compute Time (sec): 0.000236 ReadMatrixTask(B) x1 Compute Time (sec): 0.000018 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 16.507695 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 31.880078 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 29.501592 MatrixDistributeRule MatrixDistributeRule ReadMatrixTask(A) x1 Compute Time (sec): 0.000035 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 15.615444 Bookkeeper Compute Time (sec): 0.000335 ReadMatrixTask(B) x1 Compute Time (sec): 0.000031 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 13.946837 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 30.324032 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 27.317598 MatrixDistributeRule MatrixDistributeRule MatrixAccumulateRule MatrixOutputRule MatrixAccumTask x40 Compute Time (sec): 2.024726 OutputTask x1 Compute Time (sec): 13.195293 Graph OutputCudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527
2018-03-28 GPU Technology Conference 28
CudaCopyInTask(MatrixA) x1 Compute Time (sec): 47.520161 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 46.471075 Bookkeeper Compute Time (sec): 0.003061 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 62.444316 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 62.405167 CudaCopyInTask(MatrixA) x1 Compute Time (sec): 16.456675 CudaCopyInTask(MatrixB) x1 Compute Time (sec): 14.621152 Bookkeeper Compute Time (sec): 0.000356 MatrixLoadRule MatrixMulBlkTask x1 Compute Time (sec): 32.042008 CudaCopyOutTask(MatrixC) x1 Compute Time (sec): 28.071527
2018-03-28 GPU Technology Conference 29
} Decreases with traversal of main diagonal
} Block; block+panel
} Block; block+panel; block+panel+window
2018-03-28 GPU Technology Conference 30
2018-03-28 31
} BK1
¨ produces for factor tasks for blocks that
¨ Enables concurrent execution of Factor,
GPU Technology Conference
2018-03-28 32
GPU Technology Conference
2018-03-28 GPU Technology Conference 33
2018-03-28 34
GPU Technology Conference
2018-03-28 35
GPU Technology Conference
2018-03-28 36 GPU Technology Conference
2018-03-28 GPU Technology Conference 37
2018-03-28 38
} Keep factored blocks in GPU memory until all computation is complete } Blocks that still require update must be copied to/from GPU multiple times
GPU Technology Conference
2018-03-28 39
} MemManagerL keeps memory on GPU until it is used for entire update } MemManagerU and MemManagerUpdate release memory each iteration
¨ Requires re-copy blocks to/from the GPU
GPU Technology Conference
2018-03-28 40 GPU Technology Conference
2018-03-28 41
} Average max Q size ~10-20 } Copy in for update max Q size 347
GPU Technology Conference
Sliding Window Gauss Factor Update Iterate
2018-03-28 GPU Technology Conference 42
2018-03-28 GPU Technology Conference 43
GaussElim BK1 FactorL BK2 BK3 CopyInU CopyInW CopyInL BK4 FactorU and Update CopyOut BK5
MML MMU MMW
2018-03-28 GPU Technology Conference 44
} Uses sliding window for updating panels of the
}
}
2018-03-28 GPU Technology Conference 45
2018-03-28 46 GPU Technology Conference
2018-03-28 GPU Technology Conference 47
2018-03-28 GPU Technology Conference 48
} 1 pipeline per IP address } CPU core affinity when processing data from IP
MemManagerFFT
2018-03-28 GPU Technology Conference 49
} Fixed force detection } Pre-conditioning } Scattering } …
2018-03-28 GPU Technology Conference 50
} Compute each angle in parallel } Parallel reduction
} Use L-BFGS-B
¨ Limited-memory quasi-Newton code for
Graph Input Bookkeeper x1 computeTime: 0.993558 sec waitTime: 23.366514 sec maxQueueSize: 1 FwdCTR ule BfgsR ule FwdCTT ask x40 computeTime: 6.852719 sec waitTime: 17.504303 sec maxQueueSize: 111 BfgsT ask x1 computeTime: 15.133390 sec waitTime: 9.227225 sec maxQueueSize: 1 Bookkeeper x1 computeTime: 2.674292 sec waitTime: 21.204316 sec maxQueueSize: 50 Like get Graph Output OutputR ule AccumulateR ule AccumulateLike x10 computeTime: 0.672575 sec waitTime: 23.659966 sec maxQueueSize: 5 MM(static): LikeMem x1 computeTime: 0.705370 sec waitTime: 23.424898 sec maxQueueSize: 48
51
Compute with views Views from image tiles
(HTGS) Load image tiles
2018-03-28 GPU Technology Conference
52
2018-03-28 GPU Technology Conference
53
2018-03-28 GPU Technology Conference
2018-03-28 GPU Technology Conference 54 54
2018-03-28 GPU Technology Conference 55
2018-03-28 GPU Technology Conference 56
57
} Well-defined inputs and outputs
} Clear critical path
2018-03-28 GPU Technology Conference
58
2018-03-28 GPU Technology Conference
59
2018-03-28 GPU Technology Conference
60
} Integration into NVIDIA profiler (?) } Time selector + generate bar graphs
} Fill in the task functionality
2018-03-28 GPU Technology Conference
2018-03-28 GPU Technology Conference 61
2018-03-28 GPU Technology Conference 62