XcalableMP and XcalableACC Graduate School of Systems and - PowerPoint PPT Presentation

XcalableMP and XcalableACC   Graduate School of Systems and Information Engineering, University of Tsukuba * * * † † † HPC Challenge Class II BoF@SC14, Nov. 18th ‡ † Center for Computational Sciences, University of Tsukuba for Productivity and Performance   RIKEN Advanced Institute for Computational Science, Japan † † ‡ ‡ ‡ Mitsuhisa Sato Takenori Shimosaka, Akihiro Tabuchi, Taisuke Boku Masahiro Nakao, Hitoshi Murai, Hidetoshi Iwashita in HPC Challenge Award Competition *

Outline 1. XcalableMP (XMP) for cluster systems 2. XcalableACC (XACC) for accelerator cluster systems Extension of XMP using OpenACC Sorry !!, work-in-progress (14min.) (6min.) The submission report is available at http://xcalablemp.org 2

What is XcalableMP (XMP) ? By XMP specification working group of PC cluster consortium (SC Booth#2924) Directive-based language extensions of Fortran and C Global-view (HPF-like data/work mapping directives) Local-view (coarray) Support two memory models Platforms: Fujitsu the K computer and FX10, Cray XT/XE,   IBM BlueGene, NEC SX, Hitachi SR, Linux clusters, etc. Implementation of Compiler Version 1.2.1 specification available ( http://xcalablemp.org ) Omni XMP Compiler version 0.9 ( http://omni-compiler.org ) 3

Code example (Global-view) #pragma xmp loop on t(i) reduction(+:res) Work mapping and data synchronization Data distribution } res += array[i]; a[i] = func(i); for(i = 0; i <MAX; i++){ int i, j, res = 0; main(){ #pragma xmp align a[i] with t(i) #pragma xmp distribute t(block) on p #pragma xmp template t(0:MAX-1) #pragma xmp nodes p(4) int a[MAX]; add to the serial code : incremental parallelization 4

Code example (Local-view) double a[100]:[*], b[100]:[*]; int me = xmp_node_num(); if(me == 2) a[:]:[1] = b[:]; if(me == 1) a[0:50] = b[0:50]:[2]; Define Coarrays Put Operation Get Operation Coarray synax in XMP/C XMP/Fortran is upward compatible with Fortran 2008 array_name[start:length]:[node_number]; 5

Results and Machine SPARC64 VIIIfx Chip, 128 GFlops The K computer: 82,944 nodes Summary 5GB/s x 4links x 2 6D mesh/torus network Tofu Interconnect DDR3 SDRAM 16GB, 64GB/s Four HPCC Benchmarks Benchmark # Nodes Performance (/peak) SLOC Ver. 1 16,384 971 TFlops (46.3%) 313 HPL Ver. 2 4,096 423 TFlops ( 80.7% ) 426 FFT 82,944 212 TFlops (2.0%) 205 STREAM 82,994 3,583 TB/s (67.5%) 69 RandomAccess 16,384 254 GUPs 253 http://www.aics.riken.jp/jp/outreach/photogallery.html 6

HPL version 1 Block-Cyclic Distribution Source lines of Code (SLOC) is 313 , written in XMP/C A[N][N] 1 2 3 4 double ¡A[N][N]; ¡ 1 #pragma ¡xmp ¡nodes ¡p(P,Q) ¡ #pragma ¡xmp ¡template ¡t(0:N-‑1, ¡0:N-‑1) ¡ #pragma ¡xmp ¡distribute ¡t(cyclic(NB), ¡\ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cyclic(NB)) ¡onto ¡p ¡ 2 #pragma ¡xmp ¡align ¡A[i][j] ¡with ¡t(j,i) Programmer can use BLAS for distributed array. Panel Broadcast by using gmove directive A[N][N] A_L[N][NB] double ¡A_L[N][NB]; ¡ #pragma ¡xmp ¡align ¡A_L[i][*] ¡with ¡t(*,i) ¡ k ¡ ¡ ¡ ¡: ¡ #pragma ¡xmp ¡gmove ¡ len A_L[k:len][0:NB] ¡= ¡A[k:len][j:NB]; j NB 7

HPL version 2 Overlap communication and calculation SLOC is 426 , written in XMP/C ”Lookahead algorithm” by using gmove directive with async clause A[N][N] A_L[N][NB] double ¡A_L[N][NB]; ¡ #pragma ¡xmp ¡align ¡A_L[i][*] ¡with ¡t(*,i) ¡ ¡ ¡ ¡ ¡: ¡ k #pragma ¡xmp ¡gmove ¡async(1) ¡ A_L[k:len][0:NB] ¡= ¡A[k:len][j:NB]; ¡ len ¡ ¡ ¡ ¡: ¡ for(m=j+NB;m<N;m+=NB){ ¡ ¡ ¡for(n=j+NB;n<N;n+=NB){ ¡ j NB ¡ ¡ ¡ ¡cblas_dgemm(&A[m][n], ¡..); ¡ ¡ ¡ ¡ ¡if( xmp_test_async(1) ){ ¡ asynchronous broadcast ¡ ¡ ¡ ¡ ¡ ¡ ¡// ¡receive ¡A[k:len][j:NB]; ¡ communication ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡: Confirm whether data with async clause comes or not. 8

Sorry, the measurement in 16,384 nodes is late for this BoF. XMP-HPL Version 2 has a good scalability. Performance of HPL Version 1 423 TFlops (80.7%) 4,096 nodes 1000# 109 TFlops (83.5%) 1,024 nodes 971 TFlops (46.3%) TFlops 16,384 nodes Version 2 100# 310 TFlops (59.1%) 4,096 nodes 88 TFlops (67.2%) 1,024 nodes 10# 256# 1024# 4096# 16384# Number of nodes 9

RandomAccess Define coarray A point-to-point synchronization is specified with the XMP’s post and wait Put operation directives to realize asynchronous behavior of this algorithm The XMP RandomAccess is iterated over sets of CHUNK updates on each node Local-view programming with XMP/C coarray syntax SLOC is 253 , written in XMP/C u64Int ¡recv[LOGPROCS][RCHUNK+1]:[ ∗ ]; ¡ ... ¡ for ¡(j ¡= ¡0; ¡j ¡< ¡logNumProcs; ¡j++) ¡{ ¡ ¡ ¡recv[j][0:num]:[i_partner] ¡= ¡send[i][0:num]; ¡ ¡ #pragma ¡xmp ¡sync_memory ¡ #pragma ¡xmp ¡post(p(i_partner), ¡0) ¡ ¡ ¡ ¡: ¡ #pragma ¡xmp ¡wait(p(j_partner)) ¡ } 10

Last year, to implement the post/wait directives, XMP uses MPI_Send/Recv. This year, to implement them, XMP uses RDMA of the K computer. Performance of RandomAccess 1000" 254 GUPs 16,384 nodes This year 100" 162 GUPs GUPs 16,384 nodes Last year 10" 1" 64" 256" 1024" 4096" 16384" Number of nodes 11

Code cleanup and performance improvement. STREAM (SLOC 69, XMP/C) FFT (SLOC 205, XMP/F) FFT and STREAM Please refer to the submission report at http://xcalablemp.org 3,583 TB/s 212 TFlops 10000# 82,944 nodes 82,944 nodes r a 100" e This year y 50 TFlops s i 2,439 TB/s h 38,864 nodes 1000# T TFlops 82,944 nodes r a TB/s Last year e y t s a L 10" 100# 1" 10# 512" 4096" 32768" 1024# 8192# 65536# Number of nodes Number of nodes 12

Compare to two versions Improvement rate (on the same nodes) SLOC 37 - 94% improvement !! 1.94 2.0## 1.56 1.47 1.37 1.5## Ratio 1.0## Good 0.5## 0.0## HPL# RandomAccess# STREAM# FFT# (4,096 nodes) (16,384 nodes) (16,384 nodes) (36,864 nodes) Last year, work-in-progress to clean up code 2416 2500" 2000" 1500" 1000" Good 313 426 250 253 500" 205 66 69 0" HPL" RandomAccess" STREAM" FFT" 13

Outline 1. XcalableMP (XMP) for cluster systems 2. XcalableACC (XACC) for accelerator cluster systems Extension of XMP using OpenACC Sorry !!, work-in-progress (14min.) (6min.) The submission report is available at http://xcalablemp.org 14

What is XcalableACC? Mix XMP and OpenACC directives seamlessly Support transferring data among accelerators directly Extension of XMP using OpenACC for accelerator clusters Feature: 15

Difference XMP and XACC memory models Host XACC memory model node #2 node #1 Global Indexing ACCs (XACC) Transfer data among   Host - ACC (OpenACC) Transfer data among   Host memories (XMP) Transfer data among   ・・ ACC ACC XMP memory model Host ・・ node #2 node #1 Global Indexing Host memories (XMP) Transfer data among   ・・ Host Host ・・ Map “global Indexing” to accelerators 16

XACC code example Transfer XMP distributed arrays   Exchange halo region of uu[][] Data Distribution data on accelerator is transferred. XMP communication directive, When “acc” clause is specified in   by XMP directive the loop statement parallelized OpenACC directive parallelizes to accelerator Laplace’s equation double ¡u[XSIZE][YSIZE], ¡uu[XSIZE][YSIZE]; ¡ #pragma ¡xmp ¡nodes ¡p(x, ¡y) ¡ #pragma ¡xmp ¡template ¡t(0:YSIZE−1, ¡0:XSIZE−1) ¡ #pragma ¡xmp ¡distribute ¡t(block, ¡block) ¡onto ¡p ¡ #pragma ¡xmp ¡align ¡[j][i] ¡with ¡t(i,j) ¡:: ¡u, ¡uu ¡ #pragma ¡xmp ¡shadow ¡uu[1:1][1:1] ¡ … ¡ #pragma ¡acc ¡data ¡copy(u) ¡copyin(uu) ¡ { ¡ ¡ ¡for(k=0; ¡k<MAX_ITER; ¡k++){ ¡ #pragma ¡xmp ¡loop ¡(y,x) ¡on ¡t(y,x) ¡ #pragma ¡acc ¡parallel ¡loop ¡collapse(2) ¡ ¡ ¡ ¡for(x=1; ¡x<XSIZE-‑1; ¡x++) ¡ ¡ ¡ ¡ ¡ ¡ ¡for(y=1; ¡y<YSIZE-‑1; ¡y++) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡uu[x][y] ¡= ¡u[x][y]; ¡ #pragma ¡xmp ¡reflect ¡(uu) ¡acc ¡ #pragma ¡xmp ¡loop ¡(y,x) ¡on ¡t(y,x) ¡ #pragma ¡acc ¡parallel ¡loop ¡collapse(2) ¡ ¡ ¡ ¡ ¡for(x=1; ¡x<XSIZE-‑1; ¡x++) ¡ ¡ ¡ ¡ ¡ ¡ ¡for(y=1; ¡y<YSIZE-‑1; ¡y++) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡u[x][y] ¡= ¡(uu[x-‑1][y]+uu[x+1][y]+ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡uu[x][y-‑1]+uu[x][y+1])/4.0; ¡ ¡ ¡} ¡// ¡end ¡k ¡ } ¡// ¡end ¡data 17

XcalableMP and XcalableACC Graduate School of Systems and - PowerPoint PPT Presentation

XcalableMP and XcalableACC Graduate School of Systems and Information Engineering, University of Tsukuba * * * HPC Challenge Class II BoF@SC14, Nov. 18th Center for Computational Sciences, University of Tsukuba for

Evaluation of Productivity and Performance of the XcalableACC programming language LENS2015

O n - the -F ly S ynchronization C hecking for I nteractive P rogramming in X calable MP T atsuya A

XcalableMP Ver. 1.0 Background MPI is widely used as a parallel programming model on distributed

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and Schizophrenia and

ENTREPRENEURSHIP and MSE DEVELOPMENT IN TRINIDAD AND TOBAGO 2014 and Beyond OVERVIEW AND

GREEN AREAS AND SCULPTURES HANGAR AND GENERAL VIEWS SCULPTURES COMMEMORATIVE MONUMENT AND PATHWAY

Fiscal and Contract Law I and I I : The Basics and Deployment I ssues The Basics and Deployment

Phase 1 and Phase 2 Upgrades Phase 1 and Phase 2 Upgrades and prospects for Higgs and EWK and

Webinar Agenda Employers and Employers and Employer and Employer and the LGPS the LGPS Fund

Developing Developing and Developing and Developing and researching and researching

Family and Community Engagement Pioneers and Best Practice RUSD Office of Family and Community

Building an Authentic Following 1 Your WHAT and WHY -Passion and Purpose- Your WHAT and WHY

To serve God and my country, honest and fair, To help people at all times, friendly and helpful,

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

Turing Award Winner: Frances Elizabeth Allen V. Krishna Nandivada PACE Lab IIT Madras

Patterns for Modern Fortran Variation points Accomodate for change Partial differential

CHAPEL + LAPACK Ian Bertolacci NEW DOG, MEET OLD DOG. INTRO: WHAT IS CHAPEL Chapel is a

Sphera Franchise Group Interim results: Jan-Mar 2020 1 Disclaimer This presentation is not, and

HSPF Hydrologic Simulation Program FORTRAN EPA, USGS Presented by Satoshi Hirabayashi

Impact of Geothermic Well Temperatures and Residence Time on the In-situ Production of

INTRODUCTION TO OPENMP Hossein Pourreza March 26, 2015 Acknowledgement: Examples in this

tt t rst