OPENCL AT NVIDIA BEST PRACTICES, LEARNINGS AND PLANS Nikhil Joshi - - PowerPoint PPT Presentation

opencl at nvidia
SMART_READER_LITE
LIVE PREVIEW

OPENCL AT NVIDIA BEST PRACTICES, LEARNINGS AND PLANS Nikhil Joshi - - PowerPoint PPT Presentation

OPENCL AT NVIDIA BEST PRACTICES, LEARNINGS AND PLANS Nikhil Joshi and Karthik Raghavan Ravi, May 8, 2017 Insights Memory management MultiGPU/multi-queue use cases Best Practices AGENDA Efficient memory management Efficient data


slide-1
SLIDE 1

Nikhil Joshi and Karthik Raghavan Ravi, May 8, 2017

OPENCL AT NVIDIA – BEST PRACTICES, LEARNINGS AND PLANS

slide-2
SLIDE 2

2

AGENDA

Insights Memory management MultiGPU/multi-queue use cases Best Practices Efficient memory management Efficient data transfers Plans/Updates New development Performance improvements

slide-3
SLIDE 3

3

INSIGHTS

slide-4
SLIDE 4

4

ALLOCATING MEMORY

CL_MEM_ALLOC_HOST_PTR

“This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory”

Spec does not specify

Type of host memory (pinned vs pageable) Device accessibility (by default assumed) Memory placement (host vs device)

What the OpenCL spec says (and does not)

slide-5
SLIDE 5

5

ALLOCATING MEMORY

Perf characteristics change based on Type of host memory (Pinned vs pageable) Size of the buffer Choice of API (Read/WriteBuffer vs Map/Unmap)

Data transfer performance characteristics

slide-6
SLIDE 6

6

MEMORY MANAGEMENT

GPU friendly

Memory placement close to GPU

Lazy

Deferred until memory are actually needed on device

Host memory pinning

Treats pinned memory as scarce resource Not always pinned

NVIDIA Interpretation and Heuristics

slide-7
SLIDE 7

7

FINER CONTROL OVER MEMORY MGMT

Current heuristic optimal for common GPU-bound use cases, but not all use cases For example:

  • Fully async copies between host and device
  • Sparse access from kernel

New extension under preview that provides greater control over memory to better

  • ptimize for each use case. Production expected 3Q17.

New extension (tentatively “cl_nv_create_buffer”)

slide-8
SLIDE 8

8

MULTIGPU/MULTIQUEUE USE CASES

Current driver tuned for

single gpu, single command-queue use cases max perf, optimal latency

Not optimized for

MultiGPU/multi-command queue use-cases Pageable memcpy (naïve copies, very large buffers)

slide-9
SLIDE 9

9

COMING UP

Currently rearchitecting parts of the driver to improve performance on these scenarios, expected 3Q17 Fast paths will continue to remain the same

slide-10
SLIDE 10

10

BEST PRACTICES

slide-11
SLIDE 11

11

BEST PRACTICES

Gives all functionality of clCreateBuffer. In addition, provides knobs for memory placement [trading off access latency vs data migration cost]

  • close to GPU
  • fast access from GPU
  • ideal for heavy access from kernel
  • close to CPU
  • saves GPU memory and data migration cost
  • ideal for sparse access from kernel

cl_nv_create_buffer

slide-12
SLIDE 12

12

BEST PRACTICES

host allocation [trading off speed vs availability]

  • Pinned memory
  • fast, async copies between GPU
  • this is a scarce system resource
  • Pageable memory
  • easily available
  • Not as fast as pinned copies

cl_nv_create_buffer

slide-13
SLIDE 13

13

BEST PRACTICES

Choosing cl_mem_flags and cl_map_flags appropriately can save unnecessary data movement and improve performance CL_MEM_READ_ONLY CL_MEM_WRITE_ONLY CL_MAP_WRITE_INVALIDATE_REGION

cl_mem_flags & cl_map_flags

slide-14
SLIDE 14

14

BEST PRACTICES

Prefer Map/Unmap over Read/Write

Map/Unmap internally uses pinned memory Pinned memcpy bandwidth near SOL

Read/WriteBuffer

Perf depends on nature of host memory Pinned memory perf comparable to Map/Unmap Pageable memory bandwidth 30%-50% of pinned memcpy bandwidth *Upcoming improvements will bridge some of the gap to pinned copy performance

Read/WriteBuffer vs Map/UnmapBuffer

slide-15
SLIDE 15

15

PLANS & UPDATES

slide-16
SLIDE 16

16

UPDATES

New development

OpenCL 1.2+ preview support Upcoming cl_nv_create_buffer extension

Perf improvements

Improvements for multiGPU, multiple command queues use-cases Better pageable memcpy

slide-17
SLIDE 17

17

OPENCL 1.2+

OpenCL 2.0 features preview, available 378+

Device-Side-Enqueues Shared Virtual Memory - Coarse Grained Buffer Generic Address Spaces 3D writes

NOTE

Not OpenCL 2.0 conformant 1.2+ features are experimental, not intended to be used in production

slide-18
SLIDE 18

18

QUESTIONS?

slide-19
SLIDE 19

19

PREVIOUS TALKS

“Using OpenCL for Performance-Portable, Hardware-Agnostic, Cross-Platform Video Processing” by Dennis Adams (Sony Creative Software Inc.) “Boosting Image Processing Performance in Adobe Photoshop with GPGPU Technology” by Joseph Hsieh (Adobe)

Focused on Applications

Focused on Kernel performance Focused on Kernel performance

slide-20
SLIDE 20

20

PREVIOUS TALKS

“Better Than All the Rest: Finding Max-Performance GPU Kernels Using Auto-Tuning” by Cedric Nugteren (SURFsara HPC centre) “Auto-Tuning OpenCL Matrix-Multiplication: K40 versus K80” by Cedric Nugteren (SURFsara)

Focused on Kernel performance

Focused on Kernel performance Focused on Kernel performance

slide-21
SLIDE 21

21

PREVIOUS TALKS

“Performance Considerations for OpenCL on NVIDIA GPUs” by Karthik Raghavan Ravi (NVIDIA)

Driver/Runtime performance

slide-22
SLIDE 22

22

MORE OPPORTUNITIES TO DISCUSS

H7109: Creating Efficient OpenCL Software [Connect With The Experts sessions]

  • 5/8, 1PM-2PM, Lower Level Pod B
  • 5/9, 4PM-5PM, Lower Level Pod C

Focused on Kernel performance Focused on Kernel performance