Analysis of Performance Gap Between OpenACC and the Native Approach - - PowerPoint PPT Presentation

analysis of performance gap between openacc and the
SMART_READER_LITE
LIVE PREVIEW

Analysis of Performance Gap Between OpenACC and the Native Approach - - PowerPoint PPT Presentation

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Wang 1 , James Lin 1 , Step ephen en Wa William Tang 2 , Stephane Ethier 2 , Bei Wang 2 , Simon See 1,3 1


slide-1
SLIDE 1

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Step ephen en Wa Wang†1, James Lin†1, William Tang†2, Stephane Ethier†2, Bei Wang†2, Simon See†1,3

†1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation

1

GTC 2018, San Jose, USA March 27, 2018

slide-2
SLIDE 2

2

Background

  • Sunway TaihuLight is now the No.1 supercomputer on the Top500 list. In the near

future, Summit in ORNL will be the next leap in the leadership-class supercomputers. à Maintaining the single code on different supercomputers.

  • The real-world applications with OpenACC can achieve the portability across NVIDIA

GPU and Sunway processors. GTC-P code is a case study. à We proposed to analyze the performance gap between the OpenACC version and the native programming approach on two different architectures.

slide-3
SLIDE 3

GTC-P: Gyrokinetic Toroidal Code - Princeton

  • Developed by Princeton to accelerate progress in highly-scalable

plasma turbulence HPC Particle-in-Cell (PIC) codes

  • Modern “co-design” version of the comprehensive original GTC code

with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide

  • Includes present-day multi-petaflop supercomputers, including

Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors

  • KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski,
  • T. Hoefler and etc. ,“Extreme Scale Plasma Turbulence

Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA

slide-4
SLIDE 4

The case study of GTC-P code with OpenACC

  • Charge: particle to grid interpolation

(SCATTER)

  • Smooth/Poisson/Field: grid work

(local stencil)

  • Push:
  • grid to particle interpolation (GATHER)
  • update position and velocity
  • Shift: in distributed memory

environment, exchange particles among processors

4

slide-5
SLIDE 5

The case study of GTC-P code with OpenACC

  • Challenges

a. Memory-bound kernels

  • b. Data hazard

c. Random memory access

  • Methodology

a. Decrease the memory bandwidth

  • b. Use atomic operations or duplication and reduction

c. Take full advantage of local memory

5

slide-6
SLIDE 6

Sunway processor (SW26010) Serial code on 1 MPE OpenACC code

  • n 64 CPE

Elapsed Time (s) 4.7 2360.5

504x slower !!!

6

NVIDIA GPU (P100) CUDA OpenACC Elapsed Time (s) 5.9 6.0 CUDA supports global atomics in a coalesced way by transposing in shared memory

unacceptable

The performance of atomic operations on P100 and SW26010

Atomic operations on SW26010 are implemented by lock-and-unlock methodology.

slide-7
SLIDE 7

Performance evaluation on NVIDIA P100

7

  • The native atomicAdd

instruction is used on P100 instead of compare-and- swap loop implemented with atomicCAS instruction

  • n K80.
  • The performance gap of

GTC-P between CUDA and OpenACC are narrowed with the hardware upgrade.

slide-8
SLIDE 8

Implementation of the OpenACC version on SW26010

  • Duplication and reduction

algorithm is used instead of atomic operations, which is implemented with the help

  • f the global variable

acc_thread_id.

  • Using tile directive to

coalesced access data by DMA request and fill the 64KB LDM.

8

D M A

Main Memory

slide-9
SLIDE 9

Performance evaluation of the OpenACC version

  • n SW26010

500 1000 1500 2000 2500

Sequential(MPE) OpenACC(CPE) +w/oatomics +Tile +SPMlibrary

Lower is better

Baseline 1.1X 2.5X

Elapsed time [sec]

Charge Push Poisson Field Smooth Shift 9

  • The performance is

acceptable after removing the atomic

  • perations on SW26010.
  • Taking full advantage of

DMA bandwidth is the key factor for the memory-bound kernel.

  • Charge kernel is the

hotspot of the OpenACC version.

slide-10
SLIDE 10

Register level communication on SW26010

  • The low-latency register

communication mechanism is among the CPE cluster, which is the key factor for data locality.

10

slide-11
SLIDE 11

11

The RLC optimization for the charge kernel on SW26010

  • The index value are preconditioned
  • n the MPE and then transfer to

the first column of the CPE cluster.

  • Irregular access is implemented on

the rest CPE by row communication.

irregular memory access pattern in the charge kernel

slide-12
SLIDE 12
  • The irregular memory access implemented by RLC on CPE cluster and the

rest part due to the limit of SPM space are running simultaneously.

  • Tuning the performance manually.

12

The async optimization for the charge kernel on SW26010

slide-13
SLIDE 13

13

Performance tuning of the charge kernel on SW26010

Finally, we achieved around 4X speedup compared with OpenACC version and the native approach on SW26010 processors. 74%

slide-14
SLIDE 14

How about the scaling of the OpenACC version of GTC-P code on the real supercomputers? (Early Results)

14

slide-15
SLIDE 15

15

Weak Scaling

Experiment results of scaling evaluation on GPU cluster in SJTU

slide-16
SLIDE 16

16

Experiment results of scaling evaluation on Titan supercomputer

  • One K20X per

node

  • ”Gemini”

internconnect

  • Strong scaling is

to be done …

slide-17
SLIDE 17

17

Experiment results of scaling evaluation on Sunway TaihuLight supercomputer

slide-18
SLIDE 18
  • The case study demonstrated the portability of OpenACC on GPU and Chinese

home-grown many-core processor. Although the algorithm on SW26010 has to be refractored compared with GPU.

  • The performance gap between the OpenACC version and CUDA of GTC-P on

NVIDIA P100 is narrowed with the hardware upgrade.

  • The experiments showed that performance gap on SW26010 can not be ignored

due to the lack of high-efficiency general software cache on the CPE cluster. We designed specific register level communication to fix the problem.

Summary

18

slide-19
SLIDE 19

Reference

  • Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.

Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. The 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, December 16-18, 2016.

  • Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC.

Yichao Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Journal of Computer Research and Development, 2018, 55(4).

19