Exploit GPU-Specific Features Still at a High Level Seyong Lee and - PowerPoint PPT Presentation

Extended OpenACC Programming to Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future Technologies Group Oak Ridge National Laboratory http://ft.ornl.gov

Outline • Issues in OpenACC and Other Directive-Based GPU Programming Models • OpenACCe: Extended OpenACC to Support Architecture-Specific Features at High-Level – Extension to Better Support Unified Memory – Extension to Support Architecture-Specific Features • Implementation and Evaluation • Summary http://ft.ornl.gov/research/openarc 2 GTC15

Motivation • Scalable Heterogeneous Computing (SHC) – Enabled by graphics processors (e.g., NVIDIA CUDA, AMD APU), Intel Xeon Phi, or other non- traditional devices. – Emerging solution to respond to the constraints of energy, density, and device technology trends. – However, the complexity in SHC systems causes portability and productivity issues. http://ft.ornl.gov/research/openarc 3 GTC15

What is OpenACC? • Directive-based accelerator programming API standard to program accelerators – Consists of the compiler directives, library routines, and environment variables – Provide a high-level abstraction over architectural details and low-level programming complexities. • Allow parallel programmers to provide hints, known as “directives”, to the compiler, identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself. – Aimed at incremental development of accelerator code. 4 http://ft.ornl.gov/research/openarc 4 GTC15

Issues In OpenACC and Other Directive- Based Accelerator Programming Models • Too much abstraction puts significant burdens on performance tuning, debugging, scaling, etc. • We need in-depth evaluation and research on the directive-based, heterogeneous programming to address the two conflicting goals in SHC systems: productivity and performance . http://ft.ornl.gov/research/openarc 5 GTC15

OpenACCe: Extended OpenACC to Better Support Architecture-Specific Features • OpenACC Extension to Better Support Unified Memory • OpenACC Extension to Support Accelerator-Specific Features http://ft.ornl.gov/research/openarc 6 GTC15

OpenACC Extension to Better Support Unified Memory • Problem – Explicit GPU memory management in OpenACC (and other directive- based GPU programming models) can be still complex and error-prone. 100000 Normalized total execution time Normalized total transferred data size 10000 1000 Normalized Values 100 10 1 Execution time and transferred data size with OpenACC default memory management scheme normalized to those fully optimized OpenACC version http://ft.ornl.gov/research/openarc 7 GTC15

OpenACC Extension to Better Support Unified Memory (2) • Problem – Unified memory (NVIDIA CUDA 6 or AMD APUs) can simplify the complex and error-prone memory management in OpenACC. – However, the current OpenACC model will work well on unified memory only if the whole memory is shared by default. – Performance tradeoffs in existing unified memory systems need fine-grained control on using unified memory. http://ft.ornl.gov/research/openarc 8 GTC15

OpenACC Extension to Better Support Unified Memory (3) • Proposed Solution – Extend OpenACC with new library routines to explicitly manage unified memory: • Work on both separate memory systems and unified memory systems. • Allow hybrid OpenACC programming that selectively combine separate memory and unified memory. http://ft.ornl.gov/research/openarc 9 GTC15

Augmented OpenACC Runtime Routines to Support Unified Memory Runtime Routine Description acc_create_unified (pointer, size) Allocate unified memory if supported; otherwise, allocate CPU memory using malloc() acc_pcreate_unified (pointer, size) Same as acc_create_unified() if input does not present on the unified memory; otherwise, do nothing. acc_copyin_unified( pointer, size) Allocate unified memory and copy data from the input pointer if supported; otherwise, allocate CPU memory and copy data from the input pointer. acc_pcopyin_unified (ponter, size) Same as acc_copyin_unified() if input data not present on the unified memory; otherwise, do nothing. acc_delete_unified (pointer, size) Deallocate memory, which can be either unified memory or CPU memory Existing runtime routines and internal routines used Check whether the input data is on the unified memory; if not, perform the for data clauses intended operations. http://ft.ornl.gov/research/openarc 10 GTC15

Hybrid Example to Selectively Combine both Separate and Unified Memories float (*a)[N2]= (float(*)[N2]) malloc(..); float (*b)[N2]= (float(*)[N2]) acc_create_unified(..); ... #pragma acc data copy(b), create(a) for (k = 0; k < ITER; k++) { #pragma acc kernels loop independent ...//kernel-loop1 } //end of k-loop acc_delete_unified(a,...); acc_delete_unified(b,...); http://ft.ornl.gov/research/openarc 11 GTC15

OpenACC Extension to Support Accelerator-Specific Features • Problem – High-level abstraction in OpenACC does not allow user’s control over compiler-specific or architecture-specific features, incurring noticeable performance gap between OpenACC and low-level programming models (e.g., CUDA and OpenCL) OpenACC translated OpenACC translated by OpenARC by PGI Normalized execution 31 12.8 time over manual CUDA Performance of Rodinia LUD benchmark on a NVIDIA Tesla M2090 http://ft.ornl.gov/research/openarc 12 GTC15

OpenACC Extension to Support Accelerator-Specific Features (2) • Proposed Solution – Extend OpenACC with new, device-aware directives – Enable advanced interactions between users and compilers still at high-level. • Allow users high-level control over compiler translations. • Most extensions are optional; preserve portability • Can be used to understand/debug internal translation processes. http://ft.ornl.gov/research/openarc 13 GTC15

Device-Aware OpenACC Extension • Directive Extension for Device-Specific Memory Architectures #pragma openarc cuda [list of clauses] where clause is one of the followings: global constant, noconstant, texture, notexture, sharedRO, sharedRW, noshared, registerRO, registerRW, noregister http://ft.ornl.gov/research/openarc 14 GTC15

Device-Aware OpenACC Extension (2) • Multi-Dimensional Work-Sharing Loop Mapping – Nested work-sharing loops of the same type is allowed if tightly nested, and the OpenACC compiler applies static mapping of the tightly nested work-sharing loops. • Fine-Grained Synchronization – Add a new barrier directive (#pragma acc barrier) for local synchronizations (among workers in the same gang or vectors in the same worker) http://ft.ornl.gov/research/openarc 15 GTC15

OpenACCe Example #pragma acc kernels loop gang(N/BSIZE) copy(C) copyin(A, B) #pragma openarc cuda sharedRW(As, Bs) for(by = 0; by < (N/BSIZE); by++) { //by is mapped to blockIdx.y #pragma acc loop gang(N/BSIZE) for(bx = 0; bx < (N/BSIZE); bx++) { //bx is mapped to blockIdx.x float As[BSIZE][BSIZE]; float Bs[BSIZE][BSIZE]; #pragma acc loop worker(BSIZE) for(ty = 0; ty<BSIZE; ty++) { //ty is mapped to threadIdx.y #pragma acc loop worker(BSIZE) for(tx = 0; tx<BSIZE; tx++) { //tx is mapped to threadIdx.x … //computation part1 #pragma acc barrier … //computation part2 } } //end of the nested worker loops } } //end of the nested gang loops http://ft.ornl.gov/research/openarc 16 GTC15

Implementation • Proposed OpenACC extensions are fully implemented in the Open Accelerator Research Compiler (OpenARC). • OpenARC: Open-Sourced, High-Level Intermediate Representation (HIR)-Based, Extensible Compiler Framework. – Perform source-to-source translation from OpenACC C to target accelerator models (CUDA or OpenCL). – Can be used as a research framework for various study on directive-based accelerator computing. http://ft.ornl.gov/research/openarc 17 GTC15

Evaluation • Experimental Setup – 13 OpenACC programs from NPB, Rodinia, and kernel benchmarks are translated to CUDA programs by OpenARC. – Test Platforms • Unified memory test: – NVIDIA Tesla K40c and Intel Xeon E5520 CPU – NVCC V6.5 and GCC V4.4.7 • All the other tests: – NVIDIA Tesla M2090 GPU and Intel Xeon X5600 CPU – NVCC V5.0, GCC V4.4.6, PGI V13.6 http://ft.ornl.gov/research/openarc 18 GTC15

Performance of Standard OpenACC 31 12.8 6 Normalized Execution OpenARC 5 PGI 4 Time 3 2 1 0 Benchmarks The execution times are normalized to those of hand-written CUDA versions. Lower is better. http://ft.ornl.gov/research/openarc 19 GTC15

OpenACCe Performance OpenARC PGI OpenACCe 100 31 Normalized Execution 12.8 5 10 1.8 1.4 1.4 1.3 1.1 1 Time 1 0.1 LUD NW MATMUL Benchmark The execution times are normalized to those of hand-written CUDA versions. Lower is better. http://ft.ornl.gov/research/openarc 20 GTC15

Unified Memory vs. Separate Memory Unified Memory Separate Memory Normalized Execution Time 1 0.1 0.01 0.001 0.0001 Benchmark Execution times are normalized to no-memory-transfer-optimized versions. http://ft.ornl.gov/research/openarc 21 GTC15

Exploit GPU-Specific Features Still at a High Level Seyong Lee and - PowerPoint PPT Presentation

Extended OpenACC Programming to Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future Technologies Group Oak Ridge National Laboratory http://ft.ornl.gov Outline Issues in OpenACC and Other

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Zero Exploit Tolerance By Jamie Butler and Cody Pierce Confidential and Proprietary Who we are

aka Der Hacker und die 7 Geilein 27/03/2018 // Exploit Development for Dummies

Q: Exploit Hardening Made Easy Edward J. Schwartz, Thanassis Avgerinos, and David Brumley

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

INTRODUCTION TO PROGRAMMING LANGUAGES COMS W1001 Introduction to Information Science Boyi Xie

Improving the luminosity at injection energy C. Montag, A. Drees, G. Robert-Demolaize, A. Marusic,

Beam life time study for low energy run M. Blaskiewicz, M. Bai, A. Fedotov, W. Fisher, M. Minty,

County of Allegheny Karen Womack Workers Compensation Program Manager County of Allegheny

Q1-2018 RESULTS PRESENTATION TO FINANCIAL ANALYSTS APRIL 24 TH , 2018 Q1-2018 Results: Net income

Clinical Research Facility (CRF) TRANSLATIONAL RESEARCH INSTITUTE History of the CRF Took

Country experiences: Satellite Development in Cambodia Sopheak Cheang, Deputy General Director,

4Q18 Opportunity Day 12th February 2019 Agen Agenda da Financial Highlights Moving Forward

Exploit GPU-Specific Features Still at a High Level Seyong Lee and - PowerPoint PPT Presentation

Extended OpenACC Programming to Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future Technologies Group Oak Ridge National Laboratory http://ft.ornl.gov Outline Issues in OpenACC and Other

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Zero Exploit Tolerance By Jamie Butler and Cody Pierce Confidential and Proprietary Who we are

aka Der Hacker und die 7 Geilein 27/03/2018 // Exploit Development for Dummies

Q: Exploit Hardening Made Easy Edward J. Schwartz, Thanassis Avgerinos, and David Brumley

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

INTRODUCTION TO PROGRAMMING LANGUAGES COMS W1001 Introduction to Information Science Boyi Xie

Improving the luminosity at injection energy C. Montag, A. Drees, G. Robert-Demolaize, A. Marusic,

Beam life time study for low energy run M. Blaskiewicz, M. Bai, A. Fedotov, W. Fisher, M. Minty,

County of Allegheny Karen Womack Workers Compensation Program Manager County of Allegheny

Q1-2018 RESULTS PRESENTATION TO FINANCIAL ANALYSTS APRIL 24 TH , 2018 Q1-2018 Results: Net income

Clinical Research Facility (CRF) TRANSLATIONAL RESEARCH INSTITUTE History of the CRF Took

Country experiences: Satellite Development in Cambodia Sopheak Cheang, Deputy General Director,

4Q18 Opportunity Day 12th February 2019 Agen Agenda da Financial Highlights Moving Forward

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,