SLIDE 15 Da Data prefetch: : Improving g GPU utilization
A B C
Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3
registers holding current tile of A shared mem. holding current tile of B t1
t2
t3
prefetch next tile A to registers next tile becomes current tile in next iteration prefetch next tile B to registers load next tile to shared mem. before next iteration.
{
thread block
{
thread block
calculation on current tile
LD C LD NextB LD NextA Compute LD NextA Compute ST C Threads Sync. LD NextB LD NextA Compute LD NextA Compute Threads Sync. LD NextB LD NextA Compute LD NextA Compute Threads Sync.
Data prefetch Version 3: Outer Product + Shared Mem. + Data Prefectch Prefetch the data needed for the next iteration.
- Adding prefectch data for the next
iteration improves latency hiding and GPU utilization.
Load Use Load data in shared memory in a more efficient way.
- Mem. transaction efficiency = 100%
Keeping the original data use pattern in outer product version.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 10K 15K 20K 25K 30K Speedup Matrix Size (n)
cuBLAS BLASX TSM2-V0 TSM2-V1 TSM2-V2 TSM2-V2 Tall-and-skinny GEMM with K=8 on Nvidia Tesla K40c