Evaluation of Productivity and Performance of the XcalableACC - - PowerPoint PPT Presentation

evaluation of productivity and performance of the
SMART_READER_LITE
LIVE PREVIEW

Evaluation of Productivity and Performance of the XcalableACC - - PowerPoint PPT Presentation

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015 Masahiro Nakao (RIKEN AICS) HA-PACS/TCA Cluster System Each node has four GPUs (NVDIA K20X). Therefore we


slide-1
SLIDE 1

Evaluation of Productivity and Performance

  • f the XcalableACC programming language

LENS2015 INTERNATIONAL WORKSHOP, Oct. 29th. 2015

Masahiro Nakao (RIKEN AICS)

slide-2
SLIDE 2

HA-PACS/TCA Cluster System

2 http://www.ccs.tsukuba.ac.jp Each node has four GPUs (NVDIA K20X). Therefore we assigned four processes to one node, and each process deals with one GPU.

GPU GPU CPU GPU GPU CPU

slide-3
SLIDE 3

Objectives

3 Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations

slide-4
SLIDE 4

Objectives

4 Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations

slide-5
SLIDE 5

Implementation of HIMENO

5 float p[I][J][K]; #pragma xmp template t(0:K-1,0:J-1,0:I-1) #pragma xmp nodes n(1, NDY, NDX) #pragma xmp distribute t(block, block, ¥
 block) onto n #pragma xmp align p[k][j][i] with t(i, j, k) #pragma xmp shadow p[1:2][1:2][0:1]; #pragma acc data copy(p) .. { .. #pragma xmp reflect (p) acc .. #pragma xmp loop (k,j,i) on t(k,j,i) #pragma acc parallel loop .. for(i=1; i<MIMAX; ++i) for(j=1; j<MJMAX; ++j){ #pragma acc loop vector .. for(k=1; k<MKMAX; ++k){ S0 = p[i+1][j][k] * ..; Transfer distributed array to accelerator Exchange halo region Parallelize loop statement Define distributed array with halo region Only add XMP and OpenACC directives into the sequential Himeno benchmark.

slide-6
SLIDE 6

Pingpong on HA-PACS/TCA

6

PEACH2:PCIe Gen.2 x 8links : 4GB/s GPUDirect:InfiniBand 4xQDR x 2rails : 8GB/s

1" 10" 100" 1000" 10000" 8" 64" 512" 4096" 32768" 262144" 2097152" ( ¡ )(( ¡(

PEACH2 GPUDirect RDMA
 (MVAPICH2-GDR 2.0) better Latency (u second)

8 64 512 4K 32K 256K 2M 1000 100 10 1

Transfer data (Byte)

10000

128K Device memory to Device memory

  • n neighbor nodes
slide-7
SLIDE 7

Performance of HIMENO (1/2)

7 Comparison of “XACC with PEACH2” and “XACC with GDR (mvapich-GDR)” “XACC with PEACH2” is better than “XACC with GDR” in p[64][64][128]

50 100 150 1 2 4 8 16 XACC (PEACH2) XACC (GDR)

Number of Nodes Performance (GFlops)

array size : p[64][64][128]

100 200 300 400 1 2 4 8 16 XACC (PEACH2) XACC (GDR)

array size : p[128][128][256]

Number of Nodes

14%↑

better

4%↑

slide-8
SLIDE 8

Performance of HIMENO (2/2)

8

The performance of XACC is almost the same as that of OpenACC + MPI SLOC of XACC is about 60% of that of OpenACC + MPI

10 100 1000 10000 1 2 4 8 16 Performance (GFlops) Number of nodes XACC OpenACC + MPI

1518GFlops 1507GFlops 16nodes(64GPU) SLOC:198 SLOC:328 SLOC is source lines of code

(GDR)

slide-9
SLIDE 9

Objectives

9 Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations

slide-10
SLIDE 10

Implementation of NPB CG

10 double w[NA]; #pragma xmp nodes p(PROC_COLS,PROC_ROWS) #pragma xmp nodes sub_p(PROC_COLS)=p(:,*) #pragma xmp template t(0:NA-1,0:NA-1) #pragma xmp distribute t(block, block) onto p #pragma xmp align w[i] with t(*,i) for(cgit=1;cgit<=cgitmax;cgit++){ rho0 = rho; d = 0.0; rho = 0.0; #pragma xmp loop on t(*,j) #pragma acc parallel loop gang for(j=0;j<NA;j++){ double sum = 0.0; int rowstr_j = rowstr[j]; int rowstr_j1 = rowstr[j+1]; #pragma acc loop vector reduction(+:sum) for(k=rowstr_j;k<rowstr_j1;k++){ sum = sum + a[k]*p[colidx[k]]; } w[j] = sum; } // for j #pragma xmp reduction(+:w) on sub_p(:) acc Parallelize loop statement Reduction on device memory Reduction among nodes Define distributed array

slide-11
SLIDE 11

10 100 1000 1 2 4 8 16 32 64 Performance (Gops) Number of nodes XACC OpenACC + MPI

Performance of NPB CG

11

The performance of XACC is almost the same as that of OpenACC + MPI. SLOC of XACC is 79% of that of OpenACC + MPI. 236Gops 246Gops 64nodes(256GPU) SLOC:609 SLOC:772

slide-12
SLIDE 12

Objectives

12 Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations

slide-13
SLIDE 13

Implementation of STREAM

13

#pragma xmp nodes p(∗) #pragma xmp barrier time = -xmp_wtime(); #pragma omp parallel for for (i=0;i<N;i++) a[i] = b[i] + scalar∗c[i]; #pragma xmp barrier time += xmp_wtime(); GBs = calc_performance(time); #pragma xmp reduction(+:GBs)

XMP

Evaluate sustainable memory bandwidth (a[i] = b[i] + scalar * c[i])

#pragma xmp nodes p(∗) #pragma acc data copy(a[:GSIZE], b[:GSIZE], c[:GSIZE]) { #pragma xmp barrier time += xmp_wtime(); #pragma acc parallel loop async for(int j=0;j<GSIZE;j++) a[j] = b[j] + scalar*c[j]; #pragma omp parallel for for(i=GSIZE;i<N;i++) a[i] = b[i] + scalar∗c[i];
 #pragma acc wait #pragma xmp barrier time += xmp_wtime(); } GBs = calc_performance(time); #pragma xmp reduction(+:GBs)

XACC

  • .
slide-14
SLIDE 14

Performance of STREAM

14

The performance of XACC is 3.2 times better than that of XMP. (Note that XACC uses both GPU and CPU, and XMP uses only CPU.) 6,067GB/s 18,895GB/s

10 100 1000 10000 100000 1 2 4 8 16 32 64 Performance (GB/s) Number of nodes XACC XMP

64nodes(256GPU) SLOC:90 SLOC:78

slide-15
SLIDE 15

Objectives

15 Evaluate Performance and Productivity of XcalableACC (XACC) Four benchmarks HIMENO Evaluate the performance of incompressible fluid analysis code (stencil code) NPB CG Solve minimum eigenvalue of symmetric and positive definite sparse matrix using the Conjugate Gradient method STREAM Evaluate sustainable memory bandwidth HPL High Performance Linpack. This code evaluates the floating point rate of execution for solving a linear system of equations

slide-16
SLIDE 16

Implementation of HPL

16

  • 1. Block-cyclic distribution for coefficient matrix

double A[N][N]; #pragma xmp nodes p(P,Q) #pragma xmp template t(0:N-1, 0:N-1) #pragma xmp distribute t(cyclic(NB), \ cyclic(NB)) onto p #pragma xmp align A[i][j] with t(j,i)

A[N][N]

double L[N][NB]; #pragma xmp align L[i][*] with t(*,i) #pragma acc enter data create(L[:][:]) : #pragma xmp gmove acc(L) L[k:len][0:NB] = A[k:len][k-NB:NB];

A[N][N] on Host

k

L[N][NB] on Dev.

len

  • 2. Panel Broadcast from host memory to device memory

NB

  • 3. Update matrix : Use cuBLAS DGEMM developed by NVIDIA

node #1 node #2 node #3

・・・

slide-17
SLIDE 17

100 1000 10000 100000 1 2 4 8 Performance (GFlops) Number of Node Expected performance fromTop500 XACC XMP

Performance of HPL

17

The performance in Top500 is used by using CUDA + MPI version HPL developed by NVIDIA. The DGEMM kernel is different ?? Under investigation. 34.6TFlops (76%) 11.6TFlops (26%) 3.3TFlops SLOC:437 SLOC:343

slide-18
SLIDE 18

Conclusion

18

Objective Evaluation

In HIMENO, XACC using PEACH2 is better than that of mvapich- GDR in small data size SLOCs of XACC is smaller than those of OpenACC + MPI, typical
 programing model Performances of XACC is the almost the same as those of OpenACC + MPI except for HPL. Now we are tuning XACC version HPL. Evaluation of productivity and performance on XACC

Future plan

Real world application with N-body simulations in space scientific field (collaborate with Yohei Miki)