CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - PowerPoint PPT Presentation

Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU ↔ GPU data transfers

Motivation Impact of data transfers on overall application performance Juraj Kardoš Efficient GPU data transfers July 9, 2014 2 / 40

??? When transferring input/output arrays Where else? Loading kernel binary code (implicitly, by driver) Loading kernel arguments (transferred into GPU constant memory upon kernel launch, implicitly, by driver) Passing return scalar value, e.g. reduction result (remember __global__ functions are always void ) Initializing __device__ variables Juraj Kardoš Efficient GPU data transfers July 9, 2014 3 / 40 When GPU ↔ CPU memory transfers are performed?

PCIe Juraj Kardoš Efficient GPU data transfers July 9, 2014 4 / 40

PCI Express overview Computer expansion bus Point-to-point connection Lane sharing Single bus (x1) 500 MB/s per lane (PCI-e v2) Multiple lanes (x2, x4, x8, x16, x32) 8 GB/s for a 16 lane bus Juraj Kardoš Efficient GPU data transfers July 9, 2014 5 / 40

Generations of PCI-Express PCI Express July 9, 2014 Efficient GPU data transfers Juraj Kardoš 31 GB/s 1969 MB/s 15 GB/s 984 MB/s 8 GB/s 500 MB/s 4 GB/s 250 MB/s Bandwidth x16 Bandwidth Per Lane version 6 / 40 1 . 0 (2003) 2 . 0 (2007) 3 . 0 (2010) 4 . 0 (2014-15)

PCI-E Bandwidth Test Juraj Kardoš Efficient GPU data transfers July 9, 2014 7 / 40

Remember PCI-E Lanes? Juraj Kardoš Efficient GPU data transfers July 9, 2014 8 / 40

Types of data transfers in CUDA Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 10 / 40

Types of data transfers in CUDA Pageable or pinned Explicit or implicit (automatic, UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 11 / 40

Pageable and pinned memory transfer Juraj Kardoš July 9, 2014 Efficient GPU data transfers 12 / 40 12 GB GDDR5 42 GB /sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)

Pageable and pinned memory transfer Juraj Kardoš Efficient GPU data transfers July 9, 2014 16 / 40

Pageable and pinned memory transfer //allocate memory July 9, 2014 Efficient GPU data transfers Juraj Kardoš Listing 2: Pinned cudaMemcpyDeviceToHost); //memcopy wave13pt_d <<<...>>>( ..., w0_dev, ...); //kernel compute cudaMemcpyHostToDevice); //memcopy cudaMalloc(&w0_dev, szarrayb); //allocate memory cudaMallocHost(&w0, szarrayb); Listing 1: Pageable //kernel compute w0 = (real*)malloc( szarrayb); cudaMalloc(&w0_dev, szarrayb); //memcopy cudaMemcpyDeviceToHost); cudaMemcpyHostToDevice); wave13pt_d <<<...>>>( ..., w0_dev, ...); //memcopy 17 / 40 cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓

Pageable and pinned memory transfer - Summary Pageable memory - user memory space, requires extra mem-copy Pinned memory - kernel memory space Pinned memory performs better (higher bandwidth) Do not over-allocate pinned memory - reduces amount of physical memory available for OS Juraj Kardoš Efficient GPU data transfers July 9, 2014 18 / 40

Types of data transfers in CUDA Pageable or pinned Explicit or implicit (UVM) Synchronous or asynchronous Peer to peer (between GPUs of the same host) GPUDirect (between GPU and network interface) Juraj Kardoš Efficient GPU data transfers July 9, 2014 19 / 40

Unified Memory Developer view on memory model Still two distinct physical memories on HW level Juraj Kardoš Efficient GPU data transfers July 9, 2014 20 / 40 Unified Memory 12 GB GDDR5 CPU GPU ~670 GFLOPS ~4 TFLOPS (Ivy Bridge EX) (Tesla K40)

Unified Memory - Usage f(wO); July 9, 2014 Efficient GPU data transfers Juraj Kardoš Listing 4: UVM f(w0); //host function wave13pt_d <<<...>>>( ..., w0, ...); //kernel compute cudaMallocManaged(&w0, szarrayb); //allocate memory Listing 3: Explicit memory //host function //allocate memory cudaMemcpyDeviceToHost); //memcopy wave13pt_d <<<...>>>( ..., w0_dev, ...); //kernel compute cudaMemcpyHostToDevice); //memcopy cudaMalloc(&w0_dev, szarrayb); w0 = (real*)malloc( szarrayb); 21 / 40 cudaMemcpy(w0_dev, w0, szarrayb, ← ֓ cudaMemcpy(w0, w0_dev, szarrayb, ← ֓

Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 22 / 40 32 GB 12 GB DDR3 GDDR5 42 GB/sec 288 GB/sec CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)

Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 23 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)

Unified Memory - Use Case Juraj Kardoš July 9, 2014 Efficient GPU data transfers 24 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)

Unified Memory - Use Case How does UVM perform when compared to explicit memory movements? July 9, 2014 Efficient GPU data transfers Juraj Kardoš 25 / 40 32 GB 12 GB DDR3 GDDR5 CPU GPU 8 GB/sec ~670 GFLOPS ~4 TFLOPS PCI-Express (Ivy Bridge EX) (Tesla K40)

Implicit memory transfers: UVM Juraj Kardoš Efficient GPU data transfers July 9, 2014 26 / 40

Implicit memory transfers: UVM How does UVM perform in case of multi-threading? Juraj Kardoš Efficient GPU data transfers July 9, 2014 27 / 40

CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - PowerPoint PPT Presentation

Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU GPU data transfers Motivation Impact of

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Lecture 19: Virtual Memory Virtual Memory concept, Virtual- physical translation, page table,

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Unified memory Talk outline [28 slides] GPGPU 2015: High Performance Computing with CUDA

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Devices with nesCheck Daniele MIDI, Mathias PAYER, Elisa BERTINO AsiaCCS 2017 Purdue due Univer

Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng,

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER

Virtual Field TRips Frantz Lucien, Intrepid Sea, Air, and Space Museum Charissa Ruth,

Early Literacy Development Kathleen Whitbread, Ph.D., University of Saint Joseph Myths You

Beyo Beyond nd 16GB: GB: Out-of of-Co Core Stenc ncil Co Compu putations ns Istvn Z

1 Typical query on a memory Request language Data and type operators: = <= ~ != Find

Detecting and Eliminating Memory Leaks Using Cyclic Memory Allocation Huu Hai Nguyen and Martin

CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) - PowerPoint PPT Presentation

Institute of Computational Science CUDA 6.0 Unified Virtual Memory Juraj Kardo (University of Lugano) July 9, 2014 Juraj Kardo Efficient GPU data transfers July 9, 2014 1 / 40 Efficient CPU GPU data transfers Motivation Impact of

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

CUDA 8 AND BEYOND Mark Harris, April 5, 2016 INTRODUCING CUDA 8 Pascal Support Unified Memory

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Lecture 19: Virtual Memory Virtual Memory concept, Virtual- physical translation, page table,

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

CUDA ON MOBILE Yogesh Kini, GTC 2016 Typical pipeline ABSTRACT CUDA Interop APIs Unified

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Unified memory Talk outline [28 slides] GPGPU 2015: High Performance Computing with CUDA

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Devices with nesCheck Daniele MIDI, Mathias PAYER, Elisa BERTINO AsiaCCS 2017 Purdue due Univer

Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng,

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY Nikolay Sakharnykh, 3/27/2018 SINGLE POINTER

Virtual Field TRips Frantz Lucien, Intrepid Sea, Air, and Space Museum Charissa Ruth,

Early Literacy Development Kathleen Whitbread, Ph.D., University of Saint Joseph Myths You

Beyo Beyond nd 16GB: GB: Out-of of-Co Core Stenc ncil Co Compu putations ns Istvn Z

1 Typical query on a memory Request language Data and type operators: = &lt;= ~ != Find

Detecting and Eliminating Memory Leaks Using Cyclic Memory Allocation Huu Hai Nguyen and Martin

1 Typical query on a memory Request language Data and type operators: = <= ~ != Find