CUDA 6.0
Manuel Ujaldón
Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow
1
CUDA 6.0 Manuel Ujaldn Associate Professor, Univ. of Malaga (Spain) - - PowerPoint PPT Presentation
CUDA 6.0 Manuel Ujaldn Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow 1 Acknowledgements To the great Nvidia people, for sharing with me ideas, material, figures,
Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow
1
2
CUDA 6.0 overview. Optimizations for Kepler.
CUDA 6.0 announcements. New hardware features in Maxwell.
2
3 3
4
5
100.000.000 CUDA-capable GPUs 150.000 CUDA downloads 1 supercomputer 60 university courses 4.000 academic papers
500.000.000 CUDA-capable GPUs 2.100.000 CUDA downloads 52 supercomputers 780 courses 40.000 academic papers
5
6 6
7 7
8 8
9 9
Side-by-side source and disassembly view showing. New analysis passes (per SM activity level), generates a kernel analysis report.
10 10
11
12 12
13 13
14 14
15 15
16 16
17
18
Include files together to build
18
19 19
To facilitate code reuse. To reduce the compilation time.
20
source device libraries to call user-defined device callback functions.
20
21
22
22
23
int main() { float *data; setup(data); A <<< ... >>> (data); B <<< ... >>> (data); C <<< ... >>> (data); cudaDeviceSynchronize(); return 0; } __global__ void B(float *data) { do_stuff(data); X <<< ... >>> (data); Y <<< ... >>> (data); Z <<< ... >>> (data); cudaDeviceSynchronize(); do_more_stuff(data); }
main
A B C
X Y Z
23
24
24
25
Finite Pending Launch Buffer Virtualized Extended Pending Launch Buffer (PLB)
25
26 26
27
Back to Back Launches (usecs) Launch and Synchronize (usecs)
27
28
29
29
30
31
31
32 32
33 33
34
GPU-GPU latency (microseconds) Message size (bytes) Side number Total execution time (seconds)
34
35
36
DDR3 GDDR5 Main memory Video memory Dual-, tri- or quad-channel (~100 GB/s.) 256, 320, 384 bits (~300 GB/s.) PCI-express (~10 GB/s.) Kepler+ GPU
DDR3 GDDR5 Unified memory
36
37 37
38
Required Limitations GPU Operating System Windows Linux Linux on ARM Mac OSX Kepler (GK10x+) or Maxwell (GM10x+) Limited performance in CCC 3.0 and CCC 3.5 64 bits 7 or 8 WDDM & TCC no XP/Vista Kernel 2.6.18+ All CUDA-supported distros, not ARM ARM64 Not supporte ported in CUDA 6.0
38
39
Zero-Copy (pinned memory) Unified Virtual Addressing Unified Memory CUDA call Allocation fixed in Local access for PIC-e access for Other features Coherency Full support in cudaMallocHost(&A, 4); cudaMalloc(&A, 4); cudaMallocManaged(&A, 4); Main memory (DDR3) Video memory (GDDR5) Both CPU Home GPU CPU and home GPU All GPUs Other GPUs Other GPUs Avoid swapping to disk No CPU access On access CPU/GPU migration At all times Between GPUs Only at launch & sync. CUDA 2.2 CUDA 1.0 CUDA 6.0
39
40 40
41
void sortfile (FILE *fp, int N) { char *data; data = (char *) malloc)N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile (FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>> (data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
41
42
dataElem prop1 prop2 *text
“Hello, world”
dataElem prop1 prop2 *text
“Hello, world”
struct dataElem { int prop1; int prop2; char *text; }
Two addresses and two copies
42
43
dataElem prop1 prop2 *text
“Hello, world”
dataElem prop1 prop2 *text
“Hello, world”
void launch(dataElem *elem) { dataElem *g_elem; char *g_text; int textlen = strlen(elem->text); // Allocate storage for struct and text cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); // Copy up each piece separately, including new “text” pointer value cudaMemcpy(g_elem, elem, sizeof(dataElem)); cudaMemcpy(g_text, elem->text, textlen); cudaMemcpy(&(g_elem->text), &g_text, sizeof(g_text)); // Finally we can launch our kernel, but // CPU and GPU use different copies of “elem” kernel<<< ... >>>(g_elem); }
Two addresses and two copies
43
44
void launch(dataElem *elem) { kernel<<< ... >>>(elem); }
dataElem prop1 prop2 *text
“Hello, world”
44
45
key value next key value next key value next key value next key value next key value next All accesses via PCI-express bus
45
46
key value next key value next key value next
46
47 47
48 48
49
50
50
51 51
52
Sep'07 Feb'10 Jul'10 Abr'11 Oct'11 Nov'11 Dic'12 Jun'13 Oct'13
52
53 53
54 54
55 55
56 56
57 57
58 58
59 59
60 60
61 61