S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, - PowerPoint PPT Presentation

S8837 OPENCL AT NVIDIA – RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, March 26, 2018

Power optimizations Performance tuning Data transfer AGENDA What’s New? cl_nv_create_buffer MultiGPU improvements Upcoming 2

POWER OPTIMIZATIONS 3

POWER OPTIMIZATION Existing behaviour Work-load patterns vary Bursty vs continuous work-loads Driver heuristics Designed for performance, not for power Leads to higher power consumption Run-To-Completion always? 4

POWER OPTIMIZATIONS New behaviour Revamping heuristic to optimize for power Key goals Default behaviour that suits wider use-cases Lower CPU and GPU utilization when there is no work Potentially finer grained control for addressing specific, unusual cases * Work in progress, expect production in Q2’18. 5

DATA TRANSFER 6

DATA TRANSFER Perf tuning Different perf characteristics based on Type of host memory (Pinned vs pageable) Size of the buffer Choice of API (Read/WriteBuffer vs Map/Unmap) 7

DATA TRANSFER Type of memory Pinned/ Page-locked memory Guaranteed to be in memory and not swapped out Limited by RAM size Pageable memory Typically malloced memory Can be swapped out Not limited by RAM size, but limited VA space 8

Bandwidth in GB/s 10 12 14 0 2 4 6 8 32 544 1056 1568 2080 2592 3104 3616 4128 4640 DATA TRANSFER 5152 Pageable Vs Pre-pinned Pageable vs Pre-pinned Memcpy Bandwidth 5664 6176 6688 7200 Size in KB 7712 8224 8736 9248 9760 10272 10784 11296 11808 12320 12832 13344 13856 14368 14880 15392 15904 Pinned WriteBuffer Pageable WriteBuffer 9

DATA TRANSFER Best Practices Use pinned memory for fast async copies Pinned memcpy 2-3x faster than pageable memcpy. Can be truly async, pageable memcpy may not. Power efficient. Use pinned memory judiciously Scarce resource. Overuse may affect system stability. 10

DATA TRANSFER Best Practices (..contd) Prefer Map/Unmap over Read/WriteBuffer Read/WriteBuffer requires memory to be allocated and pre-pinned. Map/Unmap internally allocates pinned memory. Pinned memcpy bandwidth close to the peak performance. 11

DATA TRANSFER Best Practices (..contd) Avoid small-sized (<200 KB) copies Small copies have poor bandwidth. DMA setup overhead larger than actual copy cost. Can not saturate PCI-E bandwidth. Prefer larger sizes Better bandwidth Fewer copies 12

MEMORY OWNERSHIP AND PLACEMENT 13

ALLOCATING MEMORY What the OpenCL spec says (and does not) CL_MEM_ALLOC_HOST_PTR “This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory” Spec DOES NOT specify Type of host memory (pinned vs pageable) Memory placement (host vs device) 14

ALLOCATING PINNED MEMORY Existing way on NVIDIA Use CL_MEM_ALLOC_HOST_PTR cl_mem mem = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, NULL); void* host_ptr = clEnqueueMapBuffer(command_queue, mem, …); 15

ALLOCATING MEMORY Existing way - limitations Implementation defined behavior - not guaranteed to be consistent across platforms Does not guarantee pinned host memory Memory placement close to GPU Designed for performance Allocations limited by GPU RAM, while CPU RAM can be still available 16

CL_NV_CREATE_BUFFER New extension for memory allocation New extension with new set of flags to control cl_mem clCreateBufferNV (cl_context context, cl_mem_flags flags, cl_mem_flags_NV flags_NV , size_t size, void *host_ptr, cl_int *errcode_ret); cl_mem_flags_NV CL_MEM_PINNED_NV CL_MEM_LOCATION_HOST_NV *Available for production in Q2’18. 17

CL_NV_CREATE_BUFFER Allocating pinned memory CL_MEM_PINNED_NV + Guaranteed pinned host memory on mapping + Fast async data copies + Kernel access through GPU memory, hence faster - Scarce resource, subject to availability 18

CL_NV_CREATE_BUFFER Allocating GPU accessible host memory CL_MEM_LOCATION_HOST_NV + Places memory close to CPU + Saves GPU memory + Suitable for sparse kernel accesses - GPU access through host memory, hence slower 19

CL_NV_CREATE_BUFFER Performance Pinned Memory performance Same as existing pinned memory performance. Read/WriteBuffer and Map/Unmap perf at peak. 20

CL_NV_CREATE_BUFFER Easier to use, same performance Existing vs New pre-pinned memory allocations Read/WriteBuffer Bandwidth Pinned WriteBuffer Pinned_NV WriteBuffer 14 12 Bandwidth in MB/s 10 8 6 4 2 0 32 544 1056 1568 2080 2592 3104 3616 4128 4640 5152 5664 6176 6688 7200 7712 8224 8736 9248 9760 10272 10784 11296 11808 12320 12832 13344 13856 14368 14880 15392 15904 Size in KB 21

MULTI-GPU IMPROVEMENTS 22

PINNED MEMORY ACCESS MultiGPU use-cases - existing way Mapping buffer on a command queue - Gives optimal performance on that device. Mapping on one and using on different device - Works, but incurs performance penalty. To get the best performance, - Need to map buffers on each device separately. - Not suitable for multiGPU use-cases. 23

PINNED MEMORY ACCESS MultiGPU use-cases - New way No need to map on each device separately Mapping on one and using it on another device - As optimal as using it on the same device Note - Need to ensure event dependencies as before *Available for production in Q2’18 24

PINNED MEMORY ACCESS Existing vs New way Using pinned mappings – Existing way Using pinned mappings - New way ptr = clEnqueueMapBuffer(cq1, buff, ….) ptr = clEnqueueMapBuffer (cq1, buff, ….) // Use ptr on host // Use ptr on host clEnqueueWriteBuffer(cq1, buff2, …, ptr, …) clEnqueueWriteBuffer (cq1, buff2, …, ptr, …) clEnqueueUnmapMemObject(cq1, ..., buff, …, &ev1) ptr2 = clEnqueueMapBuffer(cq2 , buff, …., &ev1, …) // Use ptr2 // No need to map on cq2. Use ptr on cq2 clEnqueueWriteBuffer(cq2, buff3, …, ptr2 …) clEnqueueWriteBuffer (cq2, buff3, …, ptr, …, &ev1, ..) 25

SUMMARY 26

OPENCL PERFORMANCE Multi-year effort Improvements over the years Robust and more efficient CL-GL Interop Copy-Compute overlap using Map/Unmap Preview subset of 2.0 features * See references for previous talks on driver/runtime improvements. 27

OPENCL PERFORMANCE Continued effort Improvements this year cl_create_buffer_nv extension for explicit control of allocation attributes Improvements in multiGPU use-cases wrt pinned memory access 28

UPCOMING 29

UPCOMING Power Optimizations Improve pageable memcpy performance Improve multiGPU, multi-command queue use-cases. Your use case? – Let’s talk offline - nikhilj@nvidia.com 30

QUESTIONS ?? 31

PREVIOUS TALKS Focused on Kernel performance “Better Than All the Rest: Finding Max -Performance GPU Kernels Using Auto- Tuning” Focused on Kernel performance Focused on Kernel performance by Cedric Nugteren (SURFsara HPC centre) “Auto -Tuning OpenCL Matrix- Multiplication: K40 versus K80” by Cedric Nugteren (SURFsara) 32

PREVIOUS TALKS Focused on Applications “Using OpenCL for Performance -Portable, Hardware-Agnostic, Cross-Platform Video Processing” Focused on Kernel performance Focused on Kernel performance by Dennis Adams (Sony Creative Software Inc.) “Boosting Image Processing Performance in Adobe Photoshop with GPGPU Technology” by Joseph Hsieh (Adobe) 33

PREVIOUS TALKS Driver/Runtime performance “Performance Considerations for OpenCL on NVIDIA GPUs” by Karthik Raghavan Ravi, GTC2016 “OpenCL at NVIDIA – Best Practices, Learnings and Plans ” by Karthik Raghavan Ravi, GTC2017 34

THANK YOU !! 35

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, - PowerPoint PPT Presentation

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, March 26, 2018 Power optimizations Performance tuning Data transfer AGENDA Whats New? cl_nv_create_buffer MultiGPU improvements Upcoming 2 POWER OPTIMIZATIONS 3

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

ICA Annual Conference Reykjavik 2015 Session on UNESCO/PERSIST Draft Guidelines for Selection of

FiPy A Finite Volume PDE Solver Using Python D. Wheeler, J. E. Guyer & J. A. Warren

Prospects for High-Speed Flow Simulations Graham V. Candler Aerospace Engineering & Mechanics

Automatic Realizations of Statically Safe Intra-Object Synchronization Schemes in MP-Eiffel

Future Memory Technologies Seminar WS2012/13 Benjamin Klenk 2013/02/08 Supervisor: Prof. Dr.

Memory Triggers and Autobiographical Landscape Photography Symposium Case Studies By

Intel Core i7 Memory Hierarchy Amanda Adkins, Brett Ammeson, James Anouna, Tony Garside, Lukas

Credit Suisse Financial Services Conference February 10, 2015 Goldman Sachs Presentation Slide

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, - PowerPoint PPT Presentation

S8837 OPENCL AT NVIDIA RECENT IMPROVEMENTS AND PLANS Nikhil Joshi, March 26, 2018 Power optimizations Performance tuning Data transfer AGENDA Whats New? cl_nv_create_buffer MultiGPU improvements Upcoming 2 POWER OPTIMIZATIONS 3

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

FOR THE BEST VDI USER EXPERIENCE NVIDIA VIRTUAL GPU PRODUCT POSITIONING NVIDIA GRID NVIDIA

NVIDIA NSIGHT ECLIPSE EDITION CHRISTOPH ANGERER, NVIDIA JULIEN DEMOUTH, NVIDIA WHAT YOU WILL

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA Quadro and NVS Video Walls NVIDIA Quadro and NVS Video Walls Using NVIDIA technology to

GENERATION OF GAMING TECHNOLOGY Samuel Lo, NVIDIA AI Technology Centre samuell@nvidia.com NVIDIA

Red Hat and the NVIDIA DGX: Tried, Tested, Trusted NVIDIA GTC 2019 Jeremy Eder, Andre Beausoleil,

NVIDIA INDEX IMPLEMENTING CLOUD SERVICES FOR MASSIVE DATA VISUALIZATION Marc Nienhaus (NVIDIA),

NVIDIA DESIGNWORKS Ankit Patel - ankitp@nvidia.com Prerna Dogra - pdogra@nvidia.com 1 Autonomous

ICA Annual Conference Reykjavik 2015 Session on UNESCO/PERSIST Draft Guidelines for Selection of

FiPy A Finite Volume PDE Solver Using Python D. Wheeler, J. E. Guyer &amp; J. A. Warren

Prospects for High-Speed Flow Simulations Graham V. Candler Aerospace Engineering &amp; Mechanics

Automatic Realizations of Statically Safe Intra-Object Synchronization Schemes in MP-Eiffel

Future Memory Technologies Seminar WS2012/13 Benjamin Klenk 2013/02/08 Supervisor: Prof. Dr.

Memory Triggers and Autobiographical Landscape Photography Symposium Case Studies By

Intel Core i7 Memory Hierarchy Amanda Adkins, Brett Ammeson, James Anouna, Tony Garside, Lukas

Credit Suisse Financial Services Conference February 10, 2015 Goldman Sachs Presentation Slide

FiPy A Finite Volume PDE Solver Using Python D. Wheeler, J. E. Guyer & J. A. Warren

Prospects for High-Speed Flow Simulations Graham V. Candler Aerospace Engineering & Mechanics