Exploit GPU-Specific Features Still at a High Level Seyong Lee and - - PowerPoint PPT Presentation

exploit gpu specific
SMART_READER_LITE
LIVE PREVIEW

Exploit GPU-Specific Features Still at a High Level Seyong Lee and - - PowerPoint PPT Presentation

Extended OpenACC Programming to Exploit GPU-Specific Features Still at a High Level Seyong Lee and Jeffrey S. Vetter Future Technologies Group Oak Ridge National Laboratory http://ft.ornl.gov Outline Issues in OpenACC and Other


slide-1
SLIDE 1

Extended OpenACC Programming to Exploit GPU-Specific Features Still at a High Level

Seyong Lee and Jeffrey S. Vetter Future Technologies Group Oak Ridge National Laboratory

http://ft.ornl.gov

slide-2
SLIDE 2

2 GTC15

http://ft.ornl.gov/research/openarc

Outline

  • Issues in OpenACC and Other Directive-Based GPU

Programming Models

  • OpenACCe: Extended OpenACC to Support

Architecture-Specific Features at High-Level

– Extension to Better Support Unified Memory – Extension to Support Architecture-Specific Features

  • Implementation and Evaluation
  • Summary
slide-3
SLIDE 3

3 GTC15

http://ft.ornl.gov/research/openarc

Motivation

  • Scalable Heterogeneous Computing (SHC)

– Enabled by graphics processors (e.g., NVIDIA CUDA, AMD APU), Intel Xeon Phi, or other non- traditional devices. – Emerging solution to respond to the constraints of energy, density, and device technology trends. – However, the complexity in SHC systems causes portability and productivity issues.

slide-4
SLIDE 4

4 GTC15

http://ft.ornl.gov/research/openarc

What is OpenACC?

  • Directive-based accelerator programming API standard

to program accelerators

– Consists of the compiler directives, library routines, and environment variables – Provide a high-level abstraction over architectural details and low-level programming complexities.

  • Allow parallel programmers to provide hints, known as “directives”, to

the compiler, identifying which areas of code to accelerate, without requiring programmers to modify or adapt the underlying code itself.

– Aimed at incremental development of accelerator code.

4

slide-5
SLIDE 5

5 GTC15

http://ft.ornl.gov/research/openarc

Issues In OpenACC and Other Directive- Based Accelerator Programming Models

  • Too much abstraction puts significant burdens on

performance tuning, debugging, scaling, etc.

  • We need in-depth evaluation and research on the

directive-based, heterogeneous programming to address the two conflicting goals in SHC systems: productivity and performance.

slide-6
SLIDE 6

6 GTC15

http://ft.ornl.gov/research/openarc

OpenACCe: Extended OpenACC to Better Support Architecture-Specific Features

  • OpenACC Extension to Better Support Unified

Memory

  • OpenACC Extension to Support Accelerator-Specific

Features

slide-7
SLIDE 7

7 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Extension to Better Support Unified Memory

  • Problem

– Explicit GPU memory management in OpenACC (and other directive- based GPU programming models) can be still complex and error-prone.

1 10 100 1000 10000 100000 Normalized Values Normalized total execution time Normalized total transferred data size

Execution time and transferred data size with OpenACC default memory management scheme normalized to those fully optimized OpenACC version

slide-8
SLIDE 8

8 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Extension to Better Support Unified Memory (2)

  • Problem

– Unified memory (NVIDIA CUDA 6 or AMD APUs) can simplify the complex and error-prone memory management in OpenACC. – However, the current OpenACC model will work well on unified memory only if the whole memory is shared by default. – Performance tradeoffs in existing unified memory systems need fine-grained control on using unified memory.

slide-9
SLIDE 9

9 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Extension to Better Support Unified Memory (3)

  • Proposed Solution

– Extend OpenACC with new library routines to explicitly manage unified memory:

  • Work on both separate memory systems and unified memory

systems.

  • Allow hybrid OpenACC programming that selectively combine

separate memory and unified memory.

slide-10
SLIDE 10

10 GTC15

http://ft.ornl.gov/research/openarc

Augmented OpenACC Runtime Routines to Support Unified Memory

Runtime Routine Description acc_create_unified (pointer, size) Allocate unified memory if supported; otherwise, allocate CPU memory using malloc() acc_pcreate_unified (pointer, size) Same as acc_create_unified() if input does not present on the unified memory; otherwise, do nothing. acc_copyin_unified( pointer, size) Allocate unified memory and copy data from the input pointer if supported; otherwise, allocate CPU memory and copy data from the input pointer. acc_pcopyin_unified (ponter, size) Same as acc_copyin_unified() if input data not present on the unified memory; otherwise, do nothing. acc_delete_unified (pointer, size) Deallocate memory, which can be either unified memory or CPU memory Existing runtime routines and internal routines used for data clauses Check whether the input data is on the unified memory; if not, perform the intended operations.

slide-11
SLIDE 11

11 GTC15

http://ft.ornl.gov/research/openarc

Hybrid Example to Selectively Combine both Separate and Unified Memories

float (*a)[N2]= (float(*)[N2]) malloc(..); float (*b)[N2]= (float(*)[N2]) acc_create_unified(..); ... #pragma acc data copy(b), create(a) for (k = 0; k < ITER; k++) { #pragma acc kernels loop independent ...//kernel-loop1 } //end of k-loop acc_delete_unified(a,...); acc_delete_unified(b,...);

slide-12
SLIDE 12

12 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Extension to Support Accelerator-Specific Features

  • Problem

– High-level abstraction in OpenACC does not allow user’s control over compiler-specific or architecture-specific features, incurring noticeable performance gap between OpenACC and low-level programming models (e.g., CUDA and OpenCL)

OpenACC translated by OpenARC OpenACC translated by PGI Normalized execution time over manual CUDA 31 12.8

Performance of Rodinia LUD benchmark on a NVIDIA Tesla M2090

slide-13
SLIDE 13

13 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Extension to Support Accelerator-Specific Features (2)

  • Proposed Solution

– Extend OpenACC with new, device-aware directives – Enable advanced interactions between users and compilers still at high-level.

  • Allow users high-level control over compiler translations.
  • Most extensions are optional; preserve portability
  • Can be used to understand/debug internal translation processes.
slide-14
SLIDE 14

14 GTC15

http://ft.ornl.gov/research/openarc

Device-Aware OpenACC Extension

  • Directive Extension for Device-Specific Memory

Architectures

#pragma openarc cuda [list of clauses] where clause is one of the followings: global constant, noconstant, texture, notexture, sharedRO, sharedRW, noshared, registerRO, registerRW, noregister

slide-15
SLIDE 15

15 GTC15

http://ft.ornl.gov/research/openarc

Device-Aware OpenACC Extension (2)

  • Multi-Dimensional Work-Sharing Loop Mapping

– Nested work-sharing loops of the same type is allowed if tightly nested, and the OpenACC compiler applies static mapping of the tightly nested work-sharing loops.

  • Fine-Grained Synchronization

– Add a new barrier directive (#pragma acc barrier) for local synchronizations (among workers in the same gang or vectors in the same worker)

slide-16
SLIDE 16

16 GTC15

http://ft.ornl.gov/research/openarc

OpenACCe Example

#pragma acc kernels loop gang(N/BSIZE) copy(C) copyin(A, B) #pragma openarc cuda sharedRW(As, Bs) for(by = 0; by < (N/BSIZE); by++) { //by is mapped to blockIdx.y #pragma acc loop gang(N/BSIZE) for(bx = 0; bx < (N/BSIZE); bx++) { //bx is mapped to blockIdx.x float As[BSIZE][BSIZE]; float Bs[BSIZE][BSIZE]; #pragma acc loop worker(BSIZE) for(ty = 0; ty<BSIZE; ty++) { //ty is mapped to threadIdx.y #pragma acc loop worker(BSIZE) for(tx = 0; tx<BSIZE; tx++) { //tx is mapped to threadIdx.x … //computation part1 #pragma acc barrier … //computation part2 } } //end of the nested worker loops } } //end of the nested gang loops

slide-17
SLIDE 17

17 GTC15

http://ft.ornl.gov/research/openarc

Implementation

  • Proposed OpenACC extensions are fully implemented in

the Open Accelerator Research Compiler (OpenARC).

  • OpenARC: Open-Sourced, High-Level Intermediate

Representation (HIR)-Based, Extensible Compiler Framework.

– Perform source-to-source translation from OpenACC C to target accelerator models (CUDA or OpenCL). – Can be used as a research framework for various study on directive-based accelerator computing.

slide-18
SLIDE 18

18 GTC15

http://ft.ornl.gov/research/openarc

Evaluation

  • Experimental Setup

– 13 OpenACC programs from NPB, Rodinia, and kernel benchmarks are translated to CUDA programs by OpenARC. – Test Platforms

  • Unified memory test:

–NVIDIA Tesla K40c and Intel Xeon E5520 CPU –NVCC V6.5 and GCC V4.4.7

  • All the other tests:

–NVIDIA Tesla M2090 GPU and Intel Xeon X5600 CPU –NVCC V5.0, GCC V4.4.6, PGI V13.6

slide-19
SLIDE 19

19 GTC15

http://ft.ornl.gov/research/openarc

Performance of Standard OpenACC

The execution times are normalized to those of hand-written CUDA

  • versions. Lower is better.

1 2 3 4 5 6 Normalized Execution Time Benchmarks OpenARC PGI

31 12.8

slide-20
SLIDE 20

20 GTC15

http://ft.ornl.gov/research/openarc

OpenACCe Performance

The execution times are normalized to those of hand-written CUDA

  • versions. Lower is better.

31 5 1.4 12.8 1.4 1.8 1.3 1.1 1 0.1 1 10 100 LUD NW MATMUL Normalized Execution Time Benchmark OpenARC PGI OpenACCe

slide-21
SLIDE 21

21 GTC15

http://ft.ornl.gov/research/openarc

Unified Memory vs. Separate Memory

0.0001 0.001 0.01 0.1 1 Normalized Execution Time Benchmark Unified Memory Separate Memory

Execution times are normalized to no-memory-transfer-optimized versions.

slide-22
SLIDE 22

22 GTC15

http://ft.ornl.gov/research/openarc

Summary

  • The proposed OpenACC extension to control architecture-

specific features demonstrates that directive-based GPU programming models can achieve both programmability and performance.

  • However, the proposed extension also exposes architectural

details, while it hides complex language syntax of low-level CUDA models, raising a question of right balance between productivity and performance.

  • Reference: Seyong Lee and Jeffrey S. Vetter, OpenARC: Extensible OpenACC

Compiler Framework for Directive-Based Accelerator Programming Study, WACCPD, 2014

slide-23
SLIDE 23

23 GTC15

http://ft.ornl.gov/research/openarc

Thank You!

Contact Us http://ft.ornl.gov/research/openarc Email: lees2@ornl.gov

slide-24
SLIDE 24

24 GTC15

http://ft.ornl.gov/research/openarc

OpenACC Example

#pragma acc data copy(a[SIZE]), create(b[SIZE]) for (step=0; step<ITER; step++) { #pragma acc kernels loop gang for(…) { //parallel loop body } }  Specify code region to be offloaded to GPU. Kernel region to be executed on GPU Specify data region where GPU memory is allocated at its entrance and de-allocated at its exit. Scope where GPU memory is allocated/ De-allocated Specify which data should be copied from CPU to GPU or vice versa. Specify how to execute the target loop on GPU

SC12: GPU-Programming Model Evaluation