Institute of Computational Science
Efficient CPU↔GPU data transfers CUDA 6.0 Unified Virtual Memory
Juraj Kardoš
(University of Lugano)
July 9, 2014
Juraj Kardoš Efficient GPU data transfers July 9, 2014 1 / 40
CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - - PowerPoint PPT Presentation
Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU GPU data transfers Motivation Impact of
Institute of Computational Science
(University of Lugano)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 1 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 2 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 4 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 5 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 6 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 7 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 8 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 10 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 11 / 40
(Ivy Bridge EX)
(Tesla K40)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 12 / 40
(Ivy Bridge EX)
(Tesla K40)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 13 / 40
(Ivy Bridge EX)
(Tesla K40)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 14 / 40
(Ivy Bridge EX)
(Tesla K40)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 15 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 16 / 40
//allocate memory w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost);
//allocate memory cudaMallocHost(&w0, szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost);
Juraj Kardoš Efficient GPU data transfers July 9, 2014 17 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 18 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 19 / 40
GDDR5
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 20 / 40
//allocate memory w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost); //host function f(wO);
//allocate memory cudaMallocManaged(&w0, szarrayb); //kernel compute wave13pt_d <<<...>>>( ..., w0, ...); //host function f(w0);
Juraj Kardoš Efficient GPU data transfers July 9, 2014 21 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
DDR3
GDDR5
PCI-Express
Juraj Kardoš Efficient GPU data transfers July 9, 2014 22 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
DDR3
GDDR5
PCI-Express
Juraj Kardoš Efficient GPU data transfers July 9, 2014 23 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
DDR3
GDDR5
PCI-Express
Juraj Kardoš Efficient GPU data transfers July 9, 2014 24 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
DDR3
GDDR5
PCI-Express
Juraj Kardoš Efficient GPU data transfers July 9, 2014 25 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 26 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 27 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 28 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 29 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 29 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 30 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
DDR3
GDDR5
PCI-Express
~4 TFLOPS
(Tesla K40)
GDDR5
Juraj Kardoš Efficient GPU data transfers July 9, 2014 31 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
System memory GPU0 memory
PCI-Express
~4 TFLOPS
(Tesla K40)
GPU1 memory
0x0000 0xFFFF
0x0000 0xFFFF 0x0000 0xFFFF
Juraj Kardoš Efficient GPU data transfers July 9, 2014 32 / 40
~670 GFLOPS
(Ivy Bridge EX)
~4 TFLOPS
(Tesla K40)
PCI-Express
~4 TFLOPS
(Tesla K40)
0x0000 0xFFFF System memory GPU1 memory GPU0 memory
Juraj Kardoš Efficient GPU data transfers July 9, 2014 33 / 40
//allocate memory on gpu0 and gpu1 cudaSetDevice(gpuid_0); cudaMalloc(&gpu0_buf, buf_size); cudaSetDevice(gpuid_1); cudaMalloc(&gpu1_buf, buf_size); //enable P2P cudaSetDevice(gpuid_0); cudaDeviceEnablePeerAccess(gpuid_1, 0); cudaSetDevice(gpuid_1); cudaDeviceEnablePeerAccess(gpuid_0, 0); //P2P copy cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault)
Juraj Kardoš Efficient GPU data transfers July 9, 2014 34 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 35 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 36 / 40
GPU 0 32 GB DDR3
12 GB GDDR5
PCI-Express
GPU 1
12 GB GDDR5
GPU 0 32 GB DDR3
12 GB GDDR5
PCI-Express
GPU 1
12 GB GDDR5
Network card Network card
Juraj Kardoš Efficient GPU data transfers July 9, 2014 37 / 40
Juraj Kardoš Efficient GPU data transfers July 9, 2014 38 / 40
1 How many PCI-E lanes 1 GPU can consume? Suppose you have 40 PCI-E lanes and 4
2 Given that UVM is slower than explicit copying, what it could still be good for? 3 What is better to use for multi-gpu application: P2P memory transfers, GPUDirect or
Juraj Kardoš Efficient GPU data transfers July 9, 2014 39 / 40
1 1 GPU usually can use up to 16× lanes. With 4 GPUs in a single system, there will be
2 UVM simplifies GPU porting, allowing you omit explicit memory copies during intensive
3 CUDA-aware MPI uses P2P and GPUDirect as underlying engines. Thus, CUDA-aware
Juraj Kardoš Efficient GPU data transfers July 9, 2014 40 / 40