CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - - PowerPoint PPT Presentation

cuda 6 0 unified virtual memory
SMART_READER_LITE
LIVE PREVIEW

CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - - PowerPoint PPT Presentation

Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU GPU data transfers Motivation Impact of


slide-1
SLIDE 1

Institute of Computational Science

Efficient CPU↔GPU data transfers CUDA 6.0 Unified Virtual Memory

Juraj Kardoš

(University of Lugano)

July 9, 2014

Juraj Kardoš Efficient GPU data transfers July 9, 2014 1 / 40

slide-2
SLIDE 2

Motivation

Impact of data transfers on overall application performance

Juraj Kardoš Efficient GPU data transfers July 9, 2014 2 / 40

slide-3
SLIDE 3

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-4
SLIDE 4

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-5
SLIDE 5

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-6
SLIDE 6

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-7
SLIDE 7

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-8
SLIDE 8

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-9
SLIDE 9

???

When GPU↔CPU memory transfers are performed?

When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void) Initializing __device__ variables

Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40

slide-10
SLIDE 10

PCIe

Juraj Kardoš Efficient GPU data transfers July 9, 2014 4 / 40

slide-11
SLIDE 11

PCI Express overview

Computer expansion bus Point-to-point connection Lane sharing Single bus (x1)

500 MB/s per lane (PCI-e v2)

Multiple lanes (x2, x4, x8, x16, x32)

8 GB/s for a 16 lane bus

Juraj Kardoš Efficient GPU data transfers July 9, 2014 5 / 40

slide-12
SLIDE 12

Generations of PCI-Express

PCI Express version Per Lane Bandwidth x16 Bandwidth 1.0 (2003) 250 MB/s 4 GB/s 2.0 (2007) 500 MB/s 8 GB/s 3.0 (2010) 984 MB/s 15 GB/s 4.0 (2014-15) 1969 MB/s 31 GB/s

Juraj Kardoš Efficient GPU data transfers July 9, 2014 6 / 40

slide-13
SLIDE 13

PCI-E Bandwidth Test

Juraj Kardoš Efficient GPU data transfers July 9, 2014 7 / 40

slide-14
SLIDE 14

Remember PCI-E Lanes?

Juraj Kardoš Efficient GPU data transfers July 9, 2014 8 / 40

slide-15
SLIDE 15
slide-16
SLIDE 16

Types of data transfers in CUDA

Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 10 / 40

slide-17
SLIDE 17

Types of data transfers in CUDA

Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 11 / 40

slide-18
SLIDE 18

Pageable and pinned memory transfer

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

12 GB

GDDR5

288 GB/sec 42 GB /sec 8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 12 / 40

slide-19
SLIDE 19

Pageable and pinned memory transfer

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

12 GB

GDDR5

288 GB/sec 42 GB /sec 8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 13 / 40

slide-20
SLIDE 20

Pageable and pinned memory transfer

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

12 GB

GDDR5

288 GB/sec 42 GB /sec 8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 14 / 40

slide-21
SLIDE 21

Pageable and pinned memory transfer

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

12 GB

GDDR5

288 GB/sec 42 GB /sec 8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 15 / 40

slide-22
SLIDE 22

Pageable and pinned memory transfer

Juraj Kardoš Efficient GPU data transfers July 9, 2014 16 / 40

slide-23
SLIDE 23

Pageable and pinned memory transfer

//allocate memory w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost);

Listing 1: Pageable

//allocate memory cudaMallocHost(&w0, szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost);

Listing 2: Pinned

Juraj Kardoš Efficient GPU data transfers July 9, 2014 17 / 40

slide-24
SLIDE 24

Pageable and pinned memory transfer - Summary

Pageable memory - user memory space, requires extra mem-copy Pinned memory - kernel memory space Pinned memory performs better (higher bandwidth) Do not over-allocate pinned memory - reduces amount of physical memory available for OS

Juraj Kardoš Efficient GPU data transfers July 9, 2014 18 / 40

slide-25
SLIDE 25

Types of data transfers in CUDA

Pageable or pinned Explicit or implicit (UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 19 / 40

slide-26
SLIDE 26

Unified Memory

Developer view on memory model Still two distinct physical memories on HW level

12 GB

GDDR5

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

Unified Memory

Juraj Kardoš Efficient GPU data transfers July 9, 2014 20 / 40

slide-27
SLIDE 27

Unified Memory - Usage

//allocate memory w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpyHostToDevice); //kernel compute wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpyDeviceToHost); //host function f(wO);

Listing 3: Explicit memory

//allocate memory cudaMallocManaged(&w0, szarrayb); //kernel compute wave13pt_d <<<...>>>( ..., w0, ...); //host function f(w0);

Listing 4: UVM

Juraj Kardoš Efficient GPU data transfers July 9, 2014 21 / 40

slide-28
SLIDE 28

Unified Memory - Use Case

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

32 GB

DDR3

12 GB

GDDR5

288 GB/sec 42 GB/sec 8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 22 / 40

slide-29
SLIDE 29

Unified Memory - Use Case

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

32 GB

DDR3

12 GB

GDDR5

8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 23 / 40

slide-30
SLIDE 30

Unified Memory - Use Case

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

32 GB

DDR3

12 GB

GDDR5

8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 24 / 40

slide-31
SLIDE 31

Unified Memory - Use Case

How does UVM perform when compared to explicit memory movements?

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU

~4 TFLOPS

(Tesla K40)

32 GB

DDR3

12 GB

GDDR5

8 GB/sec

PCI-Express

Juraj Kardoš Efficient GPU data transfers July 9, 2014 25 / 40

slide-32
SLIDE 32

Implicit memory transfers: UVM

Juraj Kardoš Efficient GPU data transfers July 9, 2014 26 / 40

slide-33
SLIDE 33

Implicit memory transfers: UVM

How does UVM perform in case of multi-threading?

Juraj Kardoš Efficient GPU data transfers July 9, 2014 27 / 40

slide-34
SLIDE 34

Implicit memory transfers: UVM

UVM Implements CS - threads are serialized, performance degradation

Juraj Kardoš Efficient GPU data transfers July 9, 2014 28 / 40

slide-35
SLIDE 35

UVM - Summary

Simplifies programming model, but... Performance issue D -> H CS in multi-threaded application

Juraj Kardoš Efficient GPU data transfers July 9, 2014 29 / 40

slide-36
SLIDE 36

UVM - Summary

Simplifies programming model, but... Performance issue D -> H CS in multi-threaded application

Juraj Kardoš Efficient GPU data transfers July 9, 2014 29 / 40

slide-37
SLIDE 37

Types of data transfers in CUDA

Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 30 / 40

slide-38
SLIDE 38

Peer to peer data transfers overview

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU 0

~4 TFLOPS

(Tesla K40)

32 GB

DDR3

12 GB

GDDR5

PCI-Express

GPU 1

~4 TFLOPS

(Tesla K40)

12 GB

GDDR5

Juraj Kardoš Efficient GPU data transfers July 9, 2014 31 / 40

slide-39
SLIDE 39

Peer to peer data transfers - Unified Virtual Addressing

CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU 0

~4 TFLOPS

(Tesla K40)

System memory GPU0 memory

PCI-Express

GPU 1

~4 TFLOPS

(Tesla K40)

GPU1 memory

0x0000 0xFFFF

0x0000 0xFFFF 0x0000 0xFFFF

Juraj Kardoš Efficient GPU data transfers July 9, 2014 32 / 40

slide-40
SLIDE 40

Peer to peer data transfers - Unified Virtual Addressing

UVA maps memories into single address space CPU

~670 GFLOPS

(Ivy Bridge EX)

GPU 0

~4 TFLOPS

(Tesla K40)

PCI-Express

GPU 1

~4 TFLOPS

(Tesla K40)

0x0000 0xFFFF System memory GPU1 memory GPU0 memory

Juraj Kardoš Efficient GPU data transfers July 9, 2014 33 / 40

slide-41
SLIDE 41

P2P Memory Transfer - Usage

//allocate memory on gpu0 and gpu1 cudaSetDevice(gpuid_0); cudaMalloc(&gpu0_buf, buf_size); cudaSetDevice(gpuid_1); cudaMalloc(&gpu1_buf, buf_size); //enable P2P cudaSetDevice(gpuid_0); cudaDeviceEnablePeerAccess(gpuid_1, 0); cudaSetDevice(gpuid_1); cudaDeviceEnablePeerAccess(gpuid_0, 0); //P2P copy cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault)

Listing 5: P2P

Juraj Kardoš Efficient GPU data transfers July 9, 2014 34 / 40

slide-42
SLIDE 42

Peer to peer data transfers - Summary

P2P and UVA can be used to both simplify and accelerate CUDA programs One address space for all CPU and GPU memory

Determine physical memory location from pointer value Simplified library interface – cudaMemcopy()

Faster memory copies between GPUs with less host overhead

Juraj Kardoš Efficient GPU data transfers July 9, 2014 35 / 40

slide-43
SLIDE 43

Types of data transfers in CUDA

Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 36 / 40

slide-44
SLIDE 44

GPU direct overview

Eliminate CPU bandwidth and latency bottlenecks using remote direct memory access transfers between GPUs and other PCIe devices CPU

GPU 0 32 GB DDR3

12 GB GDDR5

PCI-Express

GPU 1

12 GB GDDR5

CPU

GPU 0 32 GB DDR3

12 GB GDDR5

PCI-Express

GPU 1

12 GB GDDR5

Network card Network card

Node 0 Node 1

Juraj Kardoš Efficient GPU data transfers July 9, 2014 37 / 40

slide-45
SLIDE 45

General recommendations

PCI-E is efficient only starting from reasonably large data buffer UVM simplifies programming model but may result in worse performance It’s always a good idea to know when underlying runtime routes data though intermediate buffer (additional copying) and avoid that (pinned memory, GPUDirect) It’s always a good idea to compute something, while data is being transferred (asynchronous)

Juraj Kardoš Efficient GPU data transfers July 9, 2014 38 / 40

slide-46
SLIDE 46

Control questions

1 How many PCI-E lanes 1 GPU can consume? Suppose you have 40 PCI-E lanes and 4

  • GPUs. How many lanes there will be available per GPU, if they all are transferring data

simultaneously?

2 Given that UVM is slower than explicit copying, what it could still be good for? 3 What is better to use for multi-gpu application: P2P memory transfers, GPUDirect or

CUDA-aware MPI?

Juraj Kardoš Efficient GPU data transfers July 9, 2014 39 / 40

slide-47
SLIDE 47

Control questions: answers

1 1 GPU usually can use up to 16× lanes. With 4 GPUs in a single system, there will be

8× lanes link per GPU in average, i.e. 2 times less than with single GPU in system. Note this when building your GPU servers.

2 UVM simplifies GPU porting, allowing you omit explicit memory copies during intensive

GPU kernels code development.

3 CUDA-aware MPI uses P2P and GPUDirect as underlying engines. Thus, CUDA-aware

MPI might better suite MPI applications, while single-node programs could be written in simpler way with CUDA P2P.

Juraj Kardoš Efficient GPU data transfers July 9, 2014 40 / 40