S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, - - PowerPoint PPT Presentation

s8837 opencl at nvidia
SMART_READER_LITE
LIVE PREVIEW

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, - - PowerPoint PPT Presentation

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, March 26, 2018 Power optimizations Performance tuning Data transfer AGENDA Whats New? cl_nv_create_buffer MultiGPU improvements Upcoming 2 POWER OPTIMIZATIONS 3


slide-1
SLIDE 1

Nikhil Joshi, March 26, 2018

S8837 OPENCL AT NVIDIA – RECENT IMPROVEMENTS AND PLANS

slide-2
SLIDE 2

2

AGENDA

Power optimizations Performance tuning Data transfer What’s New? cl_nv_create_buffer MultiGPU improvements Upcoming

slide-3
SLIDE 3

3

POWER OPTIMIZATIONS

slide-4
SLIDE 4

4

POWER OPTIMIZATION

Work-load patterns vary

Bursty vs continuous work-loads

Driver heuristics

Designed for performance, not for power Leads to higher power consumption Run-To-Completion always?

Existing behaviour

slide-5
SLIDE 5

5

POWER OPTIMIZATIONS

Revamping heuristic to optimize for power Key goals

Default behaviour that suits wider use-cases Lower CPU and GPU utilization when there is no work Potentially finer grained control for addressing specific, unusual cases *Work in progress, expect production in Q2’18.

New behaviour

slide-6
SLIDE 6

6

DATA TRANSFER

slide-7
SLIDE 7

7

DATA TRANSFER

Different perf characteristics based on

Type of host memory (Pinned vs pageable) Size of the buffer Choice of API (Read/WriteBuffer vs Map/Unmap)

Perf tuning

slide-8
SLIDE 8

8

DATA TRANSFER

Pinned/ Page-locked memory

Guaranteed to be in memory and not swapped out Limited by RAM size

Pageable memory

Typically malloced memory Can be swapped out Not limited by RAM size, but limited VA space

Type of memory

slide-9
SLIDE 9

9

DATA TRANSFER

Pageable Vs Pre-pinned

2 4 6 8 10 12 14 32 544 1056 1568 2080 2592 3104 3616 4128 4640 5152 5664 6176 6688 7200 7712 8224 8736 9248 9760 10272 10784 11296 11808 12320 12832 13344 13856 14368 14880 15392 15904

Bandwidth in GB/s Size in KB

Pageable vs Pre-pinned Memcpy Bandwidth

Pageable WriteBuffer Pinned WriteBuffer

slide-10
SLIDE 10

10

DATA TRANSFER

Use pinned memory for fast async copies

Pinned memcpy 2-3x faster than pageable memcpy. Can be truly async, pageable memcpy may not. Power efficient.

Use pinned memory judiciously

Scarce resource. Overuse may affect system stability.

Best Practices

slide-11
SLIDE 11

11

DATA TRANSFER

Prefer Map/Unmap over Read/WriteBuffer

Read/WriteBuffer requires memory to be allocated and pre-pinned. Map/Unmap internally allocates pinned memory. Pinned memcpy bandwidth close to the peak performance.

Best Practices (..contd)

slide-12
SLIDE 12

12

DATA TRANSFER

Avoid small-sized (<200 KB) copies

Small copies have poor bandwidth. DMA setup overhead larger than actual copy cost. Can not saturate PCI-E bandwidth.

Prefer larger sizes

Better bandwidth Fewer copies

Best Practices (..contd)

slide-13
SLIDE 13

13

MEMORY OWNERSHIP AND PLACEMENT

slide-14
SLIDE 14

14

ALLOCATING MEMORY

CL_MEM_ALLOC_HOST_PTR

“This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory”

Spec DOES NOT specify

Type of host memory (pinned vs pageable) Memory placement (host vs device)

What the OpenCL spec says (and does not)

slide-15
SLIDE 15

15

ALLOCATING PINNED MEMORY

Use CL_MEM_ALLOC_HOST_PTR

cl_mem mem = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, NULL); void* host_ptr = clEnqueueMapBuffer(command_queue, mem, …);

Existing way on NVIDIA

slide-16
SLIDE 16

16

ALLOCATING MEMORY

Implementation defined behavior

  • not guaranteed to be consistent across platforms

Does not guarantee pinned host memory Memory placement close to GPU

Designed for performance Allocations limited by GPU RAM, while CPU RAM can be still available

Existing way - limitations

slide-17
SLIDE 17

17

CL_NV_CREATE_BUFFER

New extension with new set of flags to control

cl_mem clCreateBufferNV (cl_context context, cl_mem_flags flags, cl_mem_flags_NV flags_NV, size_t size, void *host_ptr, cl_int *errcode_ret); cl_mem_flags_NV

CL_MEM_PINNED_NV CL_MEM_LOCATION_HOST_NV

*Available for production in Q2’18.

New extension for memory allocation

slide-18
SLIDE 18

18

CL_NV_CREATE_BUFFER

CL_MEM_PINNED_NV

+ Guaranteed pinned host memory on mapping + Fast async data copies + Kernel access through GPU memory, hence faster

  • Scarce resource, subject to availability

Allocating pinned memory

slide-19
SLIDE 19

19

CL_NV_CREATE_BUFFER

CL_MEM_LOCATION_HOST_NV

+ Places memory close to CPU + Saves GPU memory + Suitable for sparse kernel accesses

  • GPU access through host memory, hence slower

Allocating GPU accessible host memory

slide-20
SLIDE 20

20

CL_NV_CREATE_BUFFER

Pinned Memory performance

Same as existing pinned memory performance. Read/WriteBuffer and Map/Unmap perf at peak.

Performance

slide-21
SLIDE 21

21

CL_NV_CREATE_BUFFER

Easier to use, same performance

2 4 6 8 10 12 14 32 544 1056 1568 2080 2592 3104 3616 4128 4640 5152 5664 6176 6688 7200 7712 8224 8736 9248 9760 10272 10784 11296 11808 12320 12832 13344 13856 14368 14880 15392 15904

Bandwidth in MB/s Size in KB

Existing vs New pre-pinned memory allocations Read/WriteBuffer Bandwidth

Pinned WriteBuffer Pinned_NV WriteBuffer

slide-22
SLIDE 22

22

MULTI-GPU IMPROVEMENTS

slide-23
SLIDE 23

23

PINNED MEMORY ACCESS

Mapping buffer on a command queue

  • Gives optimal performance on that device.

Mapping on one and using on different device

  • Works, but incurs performance penalty.

To get the best performance,

  • Need to map buffers on each device separately.
  • Not suitable for multiGPU use-cases.

MultiGPU use-cases - existing way

slide-24
SLIDE 24

24

PINNED MEMORY ACCESS

No need to map on each device separately Mapping on one and using it on another device

  • As optimal as using it on the same device

Note - Need to ensure event dependencies as before

*Available for production in Q2’18

MultiGPU use-cases - New way

slide-25
SLIDE 25

25

PINNED MEMORY ACCESS

Using pinned mappings – Existing way

ptr = clEnqueueMapBuffer(cq1, buff, ….) // Use ptr on host clEnqueueWriteBuffer(cq1, buff2, …, ptr, …) clEnqueueUnmapMemObject(cq1, ..., buff, …, &ev1) ptr2 = clEnqueueMapBuffer(cq2, buff, …., &ev1, …) // Use ptr2 clEnqueueWriteBuffer(cq2, buff3, …, ptr2 …)

Existing vs New way

Using pinned mappings - New way

ptr = clEnqueueMapBuffer(cq1, buff, ….) // Use ptr on host clEnqueueWriteBuffer(cq1, buff2, …, ptr, …) // No need to map on cq2. Use ptr on cq2 clEnqueueWriteBuffer(cq2, buff3, …, ptr, …, &ev1, ..)

slide-26
SLIDE 26

26

SUMMARY

slide-27
SLIDE 27

27

OPENCL PERFORMANCE

Improvements over the years

Robust and more efficient CL-GL Interop Copy-Compute overlap using Map/Unmap Preview subset of 2.0 features * See references for previous talks on driver/runtime improvements.

Multi-year effort

slide-28
SLIDE 28

28

OPENCL PERFORMANCE

Improvements this year

cl_create_buffer_nv extension for explicit control of allocation attributes Improvements in multiGPU use-cases wrt pinned memory access

Continued effort

slide-29
SLIDE 29

29

UPCOMING

slide-30
SLIDE 30

30

UPCOMING

Power Optimizations Improve pageable memcpy performance Improve multiGPU, multi-command queue use-cases. Your use case? – Let’s talk offline

  • nikhilj@nvidia.com
slide-31
SLIDE 31

31

QUESTIONS ??

slide-32
SLIDE 32

32

PREVIOUS TALKS

“Better Than All the Rest: Finding Max-Performance GPU Kernels Using Auto-Tuning” by Cedric Nugteren (SURFsara HPC centre) “Auto-Tuning OpenCL Matrix-Multiplication: K40 versus K80” by Cedric Nugteren (SURFsara)

Focused on Kernel performance

Focused on Kernel performance Focused on Kernel performance

slide-33
SLIDE 33

33

PREVIOUS TALKS

“Using OpenCL for Performance-Portable, Hardware-Agnostic, Cross-Platform Video Processing” by Dennis Adams (Sony Creative Software Inc.) “Boosting Image Processing Performance in Adobe Photoshop with GPGPU Technology” by Joseph Hsieh (Adobe)

Focused on Applications

Focused on Kernel performance Focused on Kernel performance

slide-34
SLIDE 34

34

PREVIOUS TALKS

“Performance Considerations for OpenCL on NVIDIA GPUs” by Karthik Raghavan Ravi, GTC2016 “OpenCL at NVIDIA – Best Practices, Learnings and Plans ” by Karthik Raghavan Ravi, GTC2017

Driver/Runtime performance

slide-35
SLIDE 35

35

THANK YOU !!