Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua - - PowerPoint PPT Presentation

porting atmospheric programs on various systems
SMART_READER_LITE
LIVE PREVIEW

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua - - PowerPoint PPT Presentation

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn 2017.10.23-27 @Schloss Dagstuhl Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges,


slide-1
SLIDE 1

Porting Atmospheric Programs

  • n Various Systems

Lin Gan

Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn

2017.10.23-27 @Schloss Dagstuhl

Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

slide-2
SLIDE 2

2

lingan@tsinghua.edu.cn www.thuhpgc.org

About me

NSCC-Wuxi

FPGA

Climate Modeling Geophysics Exploration Tsinghua University

Sunway GPU

Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications.

MIC

slide-3
SLIDE 3

3

lingan@tsinghua.edu.cn www.thuhpgc.org

Climate Modeling

Performance portable applications can be efficiently executed on a wide variety of HPC architectures without significant manual modifications.

a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

slide-4
SLIDE 4

4

lingan@tsinghua.edu.cn www.thuhpgc.org

[2010 SISC]: 2D SWE model 2k CPU cores

  • n BlueGene/L

[2011 JCP]: 2D SWE model 80k CPU cores

  • n Tianhe-1A

[2013 PPoPP]: 2D SWE model 100K CPU-GPU cores 0.8 DP-TF on Tianhe-1A [2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 DP-PF on Tianhe-2 [2015 ToC]: 3D Nonhyd model 1.2m CPU-MIC cores 1.74 DP-PF on Tianhe-2 [2016 Gordon Bell Prize] 3D Nonhyd model 10.6m Sunway cores both on TaihuLight Implicit solver Explicit solver

Our FPGA Work

  • 1. “GlobalAtmosphere SimulationbasedonReconfigurable Platform”, (short paper)in FCCM 2013
  • 2. “Accelerating Solvers for Global SWEs through Mixed-precision Data Flow Engines”, in FPL 2013
  • 3. “Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms”, in ACMTRETS
  • 4. “A Highly-Efficient and Green Data Flow Engine for Solving Euler Atmospheric Equations”, in FPL 2014
  • 5. “Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture ”,inIEEE MICRO2017

8 Altera Stratix5 D8 FPGAs 3x in perf. 1.2x in energy. 8 Altera Stratix5 D8 FPGAs 6x in perf. 5x in energy. Virtex-6 SX475T FPGA 4.3x in perf. 10x in energy.

Our atmosphere study

Programming?

slide-5
SLIDE 5

5

lingan@tsinghua.edu.cn www.thuhpgc.org

Platform

CPU 1x GPU 1x MIC 1x FPGA

Original LOC

969

  • Direct Porting
  • +96

+10 +644

Further Opti.

  • +90

+80 +115

Final LOC

969 1155 1059 1728

LOC/perf.

46.8 7.3 4.4 1.8

perf./LOC

0.02 0.14 0.22 0.56

CPU+FPGA CPU+MIC CPU+GPU CPU

Comparison of Line of Code What is your choice?

OpenCL for FPGA

slide-6
SLIDE 6

6

lingan@tsinghua.edu.cn www.thuhpgc.org

The Sunway TaihuLight Supercomputer

Sunway TaihuLight

  • Peak: 125 PFlops
  • Linpack: 93 Flops
  • 40960 SW 26010CPUs
  • Over 10M cores

163840 MPI processes 65 threads

Disaster?

  • Reg. Comm.
slide-7
SLIDE 7

7

lingan@tsinghua.edu.cn www.thuhpgc.org

§ MPI+X

§ X : OpenACC / Athread § One MPI process manages to run on one management core (MPE) § OpenACC conducts data transfer between main memory and on- chip memory (SPM), and distributes the kernel workload across compute cores (CPEs) § Athread is the threading library to manage thread on compute core (CPE), which is used in OpenACC implementation

Principal Programming Model on TaihuLight

slide-8
SLIDE 8

8

lingan@tsinghua.edu.cn www.thuhpgc.org

Kernel Code Host Code

int A[1024][1024]; intB[1024][1024]; intC[1024][1024]; #pragma acc parallel loop \ copyin(B, C) copyout(A) for(i = 0; i < 1024; i ++) { for(j = 0; j < 1024; j ++) { A[i][j] = B[i][j] + C[i][j]; } }

CPEs_swpan(CPE_kernel, args);

__SPM_localint SPM_A[1][1024]; __SPM_localint SPM_B[1][1024]; __SPM_localint SPM_C[1][1024]; void CPE_kernel(args) { for(i=CPE_id; i < 1024; i +=CPE_num) { dma_get(&B[i][0], SPM_B, 4096); dma_get(&C[i][0], SPM_C, 4096); for(j = 0; j < 1024; j ++) { SPM_A[0][j] = SPM_B[0][j] + SPM_C[0][j]; } //j-loop dma_put(SPM_A, &A[i][0], 4096); } //i-loop }

SWACC

Basic Compiler

a.out

OpenACC-level Athread-level

slide-9
SLIDE 9

9

lingan@tsinghua.edu.cn www.thuhpgc.org

Porting CAM: OpenACC + Athread

The performance improvements for the entire CAM model in ne30. MPE\ori refers to the original version based on MPE,

  • penacc refers to the usage of OpenACC directive, and athreadrefers to the further usage of Athread.

Can you accept the performance of OpenACC?

slide-10
SLIDE 10

10

lingan@tsinghua.edu.cn www.thuhpgc.org

Loop transformation tools

do ie = nets, nete do k = 1, nlev dp(k) = func_1() do q = 1, qsize Qtens(k,q,ie) = func_2(dp(k)) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do do ie = nets, nete do k = 1, nlev dp(k) = func_5() Vstar(k) = func_6() end do do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) end do

  • ptimized:

do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = func_2(func_1()) end do end do end do do ie = nets, nete do k = 1, nlev do q = 1, qsize qmin(k,q,ie) = … qmax(k,q,ie) = … end do end do end do

do ie = nets, nete do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_7(func_5(),func_6()) end do

OpenACC Athread

Tools for loop transformation Data locality

slide-11
SLIDE 11

11

lingan@tsinghua.edu.cn www.thuhpgc.org

Porting CAM on Sunway

CAM original total is 754,129 LOC We modified 152,336 LOC We further added 57,709 LOC

3.3 PFlops for a 750-m resolution using 10,075,000 cores 3.4 SYPD for 25-km horizontal resolution 21.5 SYPD for 100-km horizontal resolution

2 years

Redesign the atmosphere model

slide-12
SLIDE 12

12

lingan@tsinghua.edu.cn www.thuhpgc.org

Climate Models on Sunway

CAM WRF AM3 POM MASN UM CESM CIESM

slide-13
SLIDE 13

13

lingan@tsinghua.edu.cn www.thuhpgc.org

Software on Sunway

OpenFoam LAMMPS xMath VASP Thunder Star bowtie Petsc

slide-14
SLIDE 14

14

lingan@tsinghua.edu.cn www.thuhpgc.org

China’s homegrown supercomputers

  • National Supercomputing

Center in Jinan

  • 1 PFlops
  • multi-core processor

(16-core)

Sunway BlueLight 2012

  • National Supercomputing

Center in Wuxi

  • 125 PFlops
  • many-core processor

(260-core)

Sunway TaihuLight 2016

OpenFoam LAMMPS xMath VASP Thunder Star bowtie Petsc

Will you be interested in using future machine?

slide-15
SLIDE 15

15

lingan@tsinghua.edu.cn www.thuhpgc.org

Thanks and welcome to visit us in Wuxi

ICCS 2018 will take place at June, 2018

slide-16
SLIDE 16

Porting Atmospheric Programs

  • n Various Systems

Lin Gan

Tsinghua University NSCC-Wuxi lingan@tsinghua.edu.cn

2017.10.23-27 @Schloss Dagstuhl

Dagstuhl Seminar on Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions