Analysis of Performance Gap Between OpenACC and the Native Approach - PowerPoint PPT Presentation

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Wang †1 , James Lin †1 , Step ephen en Wa William Tang †2 , Stephane Ethier †2 , Bei Wang †2 , Simon See †1,3 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation GTC 2018, San Jose, USA March 27, 2018 1

Background • Sunway TaihuLight is now the No.1 supercomputer on the Top500 list. In the near future, Summit in ORNL will be the next leap in the leadership-class supercomputers. à Maintaining the single code on different supercomputers . • The real-world applications with OpenACC can achieve the portability across NVIDIA GPU and Sunway processors. GTC-P code is a case study. à We proposed to analyze the performance gap between the OpenACC version and the native programming approach on two different architectures. 2

GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors • KEY REFERENCE : W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. , “Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference , Salt Lake City, Utah, USA

The case study of GTC-P code with OpenACC • Charge : particle to grid interpolation (SCATTER) • Smooth/Poisson/Field : grid work (local stencil) • Push : • grid to particle interpolation (GATHER) • update position and velocity • Shift : in distributed memory environment, exchange particles among processors 4

The case study of GTC-P code with OpenACC • Challenges a . Memory - bound kernels b . Data hazard c . Random memory access • Methodology a . Decrease the memory bandwidth b . Use atomic operations or duplication and reduction c . Take full advantage of local memory 5

The performance of atomic operations on P100 and SW26010 NVIDIA GPU (P100) CUDA OpenACC 5.9 6.0 Elapsed Time (s) CUDA supports global atomics in a coalesced way by transposing in shared memory Sunway processor OpenACC code Serial code on 1 MPE (SW26010) on 64 CPE 504x slower !!! 4.7 2360.5 Elapsed Time (s) unacceptable Atomic operations on SW26010 are implemented by lock-and-unlock methodology. 6

Performance evaluation on NVIDIA P100 • The native atomicAdd instruction is used on P100 instead of compare-and- swap loop implemented with atomicCAS instruction on K80. • The performance gap of GTC-P between CUDA and OpenACC are narrowed with the hardware upgrade. 7

Implementation of the OpenACC version on SW26010 • Duplication and reduction algorithm is used instead of atomic operations, which is implemented with the help of the global variable acc_thread_id. • Using tile directive to coalesced access data by DMA request and fill the D 64KB LDM. M A Main Memory 8

Performance evaluation of the OpenACC version on SW26010 2500 Shift Lower is better Smooth • The performance is Field Poisson acceptable after Push 2000 Charge removing the atomic operations on SW26010. Elapsed time [sec] • Taking full advantage of 1500 DMA bandwidth is the key factor for the 1000 Baseline 1.1X memory-bound kernel. • Charge kernel is the 2.5X 500 hotspot of the OpenACC version. 0 9 Sequential�(MPE) OpenACC�(CPE) +Tile +SPM�library +w/o�atomics

Register level communication on SW26010 • The low-latency register communication mechanism is among the CPE cluster, which is the key factor for data locality. 10

The RLC optimization for the charge kernel on SW26010 irregular memory access pattern in the charge kernel • The index value are preconditioned on the MPE and then transfer to the first column of the CPE cluster. • Irregular access is implemented on the rest CPE by row communication. 11

The async optimization for the charge kernel on SW26010 • The irregular memory access implemented by RLC on CPE cluster and the rest part due to the limit of SPM space are running simultaneously. • Tuning the performance manually. 12

Performance tuning of the charge kernel on SW26010 74% Finally, we achieved around 4X speedup compared with OpenACC version and the native approach on SW26010 processors. 13

How about the scaling of the OpenACC version of GTC-P code on the real supercomputers? (Early Results) 14

Experiment results of scaling evaluation on GPU cluster in SJTU Weak Scaling 15

Experiment results of scaling evaluation on Titan supercomputer • One K20X per node • ”Gemini” internconnect • Strong scaling is to be done … 16

Experiment results of scaling evaluation on Sunway TaihuLight supercomputer 17

Summary • The case study demonstrated the portability of OpenACC on GPU and Chinese home-grown many-core processor. Although the algorithm on SW26010 has to be refractored compared with GPU. • The performance gap between the OpenACC version and CUDA of GTC-P on NVIDIA P100 is narrowed with the hardware upgrade. • The experiments showed that performance gap on SW26010 can not be ignored due to the lack of high-efficiency general software cache on the CPE cluster. We designed specific register level communication to fix the problem. 18

Reference • Performance and Portability Studies with OpenACC Accelerated Version of GTC-P . Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. The 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, December 16-18, 2016. • Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC . Yichao Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Journal of Computer Research and Development, 2018, 55(4). 19

Analysis of Performance Gap Between OpenACC and the Native Approach - PowerPoint PPT Presentation

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Wang 1 , James Lin 1 , Step ephen en Wa William Tang 2 , Stephane Ethier 2 , Bei Wang 2 , Simon See 1,3 1

ADVANCED OPENACC PROGRAMMING JEFF LARKIN, NVIDIA DEVELOPER TECHNOLOGIES AGENDA OpenACC Review

L8179 ZERO TO GPU HERO WITH OPENACC Jeff Larkin, GTC 2019, March 2019 OUTLINE Topics to be

OpenACC Birgitte Bryds HPC2N, Ume a University 12 December 2017 1 / 27 OpenACC Overview

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Programming OpenACC

OpenACC 2.0 and Beyond PGI Accelerator Compilers and Tools One Slide Intro to OpenACC Directives

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel Guray

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

S6540 High-Accuracy Quantum Chemistry Need for Speed: Accelerating High-Accuracy using OpenACC

COMPARING OPENACC AND OPENMP PERFORMANCE AND PROGRAMMABILITY JEFF LARKIN, NVIDIA GUIDO

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus,

THE PERFORMANCE GAP Introduction There is a gap in performance! There is a mismatch between the

Gender Pay Gap Reporting What is Gender Pay Gap? Gender Pay Gap is the difference between the

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

1 Gener neral al Managers Report port Januar ary y 2019 2019: A Look Ahead 3

Property Assessments Februa ry 2015 Background Last reassessment took place in 200 I during

half of 2019 Investor presentation Maurice Oostendorp, CEO Annemiek van Melick, CFO Key points

Status report WP1: User needs and possibilities SAMBA - Smarter Assets Management with Big Data

2018/2019 Interim Results Presentation 14 November 2018 We Link People Financial Review to a

SBI Holdings, Inc. First Quarter Financial Results (Fiscal Year Ending March 31, 2011) July 29,

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

The Blockchain Josh Vorick Bitcoin is a currency. Blockchain is a technology. What is a