“Cartoon” parallel architectures; CPUs and GPUs
CSE 6230, Fall 2014 Th Sep 11
- Thanks to Jee Choi (a senior PhD student) for a big “assist”
1
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 - - PowerPoint PPT Presentation
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
CSE 6230, Fall 2014 Th Sep 11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
14
14
14
(“SIMT”)
14
“Sandy Bridge-EP”
14
“Sandy Bridge-EP”
(single)
14
“Sandy Bridge-EP”
(single)
(single)
15
“Sandy Bridge-EP”
15
“Sandy Bridge-EP”
15
“Sandy Bridge-EP”
15
“Sandy Bridge-EP”
Intel Xeon E5-2687W NVIDIA K20X Difference # Cores/SMX 8 14 1.75× Clock frequency (max) 3.8 GHz 735 MHz 0.20× SIMD Width 256-bits Thread processors 2688 SP + 896 DP Performance (single precision) 8 cores × 3.8 GHz × (8 Add + 8 Mul) = 2688 × 735 MHz × 2 (FMA) = 8.12× Performance (double precision) 8 cores × 3.8 GHz × (4 Add + 4 Mul) = 896 × 735 MHz × 2 (FMA) = 5.42× Memory bandwidth 51.2 GB/s 250 GB/s 4.88× TDP 150 W 235 W 1.57×
17
17
17
18
19
20
21
22
23
24
threadIdx.x 1 2 3 1 2 3 1 2 3 blockIdx.x 1 3 global ID 1 2 3 2
threadIdx.x 1 2 3 1 2 3 1 2 3 blockIdx.x 1 3 global ID
A
1 2 3 2
blockIdx.y, and blockIdx.z ¡
– blockDim.x, blockDim.y, blockDim.z ¡ – threadIdx.x, threadIdx.y, threadIdx.z ¡
thread thread block grid variables shared memory global memory local memory constant memory (read-only) texture memory (read-only)
__global__ void test (int* in, int* out, int N) { int gId = threadIdx.x + blockDim.x * blockIdx.x;
}
{ int N = 1048576; in tbSize = 256;
dim3 block (tbSize);
cudaThreadSynchronize (); }
int main (int argc, char** argv) { /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudaMalloc ((void**) &d_in, N * sizeof (int)); cudaMalloc ((void**) &d_out, N * sizeof (int));
cudaMemcpy (d_in, h_in, N * sizeof (int), cudaMemcpyHostToDevice);
. . . /* copy data back to host */ cudaMemcpy (h_out, d_out, N * sizeof (int), cudaMemcpyDeviceToHost); /* free memory */ free (h_in); free (h_out) cudaFree (d_in); cudaFree (d_out); }
allocate memory
Copy data from CPU to GPU Copy data from GPU to CPU free memory
__global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y;
in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads();
int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else {
/* take care of boundary conditions */
} in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }
why do we need this? shared memory
__global__ mysteryFunction (int* in) { int tidx, tidy, gIdx, gIdy; tidx = threadIdx.x; tidy = threadIdx.y; gIdX = tidx + blockDim.x * blockIdx.x; gIdY = tidy + blockDim.y * blockIdx.y;
in[gIdX + gIdY * blockDim.x * gridDim.x]; __syncthreads();
int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else {
/* take care of boundary conditions */
} in[gIdX + gIdY * blockDim.x * gridDim.x] = temp; }
44
Latency (W) Bandwidth (λ)
Latency (W) Bandwidth (λ) tens of thousands of in-flight requests!!!
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot (5)
>= pivot (5)
2 1 1 2 2 1 1 2 1 1 1 1
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1 2 2 2 1 1 1 3 1 1 2 2
Do a cumulative sum on < pivot and >= pivot This should be done in shared memory in parallel
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1 2 2 2 1 1 1 3 1 1 2 2
This tells us how much space and where each thread block needs to store its values
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1
temporary array
start end
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1
temporary array
start end
atomic fetch-and-add (FAA)
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1
temporary array
start end
atomic fetch-and-add (FAA)
4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1
thread block 0 thread block 1 thread block 2 thread block 3 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 < pivot
>= pivot
2 3 1
4 3 2 5 temporary array
start end
atomic fetch-and-add (FAA)