Cray Centre of Excellence for HECToR This talk is not about how to - PowerPoint PPT Presentation

Thomas Edwards, Kevin Roy Cray Centre of Excellence for HECToR

 This talk is not about how to get maximum performance from a Lustre file system.  Plenty of information about tuning Lustre Performance  Previous CUGs  Lustre User Groups  This talk is about a way to design applications to be independent of I/O performance  All about Output, but Input technically possible with explicit pre-posting

Peak Bandwidth I/O Bandwidth Bandwidth grows more slowly as more processors become involved. Good percentage of peak from a proportion of the total processors available Rapid improvements in I/O Bandwidth as the number of processors increases from zero Number of Processors

Percent wallclock spent in Checkpoint Strong Scaling 100MB per processor - 10 GBs I/O Bandwidth 90% 80% Frequency of Checkpoints Percent Time Spent in Checkpoint 70% 10 Mins 60% 20 Mins 50% 30 Mins 40% 1 Hour 30% 3 Hours 20% 10% 0% 1024 1448 2048 2896 4096 5792 8192 11585 16384 23170 32768 46340 Number of Cores

 As apps show good weak scaling to ever larger numbers of processors the proportion of time spent writing results will increase.  It’s not always necessary for applications to complete writing before continuing computation if the data is cached in memory  Therefore I/O can be overlapped with computation  This I/O could be performed by only a fraction of the processors used for computation and still achieve good I/O bandwidth.

 Developed by Prof K. Taylor and team at Queen’s University, Belfast  Solves the Time Dependent Schrödinger Equation for two electrons in a Helium atom interacting with a laser pulse.  Parallelised using domain decomposition and MPI  Very computationally intensive, uses high order methods to integrate PDEs  Larger problems result in larger checkpoints  I/O component is being optimised as part of a Cray Centre of Excellence for HECToR project.  Preparing the code for the next generation machine

• Upper-triangular 0 1 2 3 4 5 domain decomposition 6 7 8 9 10 • Does not fit HDF5 or 11 12 13 14 MPI-IO models cleanly • Regular Checkpoints 15 16 17 • File per process I/O 18 19 • 50 MB per file • Scientific data extracted 20 from checkpoint data 8

Standard Sequential I/O Compute I/O Compute I/O Compute I/O Compute I/O Time Asynchronous I/O Compute Compute Compute Compute I/O I/O I/O I/O

Compute Node I/O Server do i =1, time_steps do i =1, time_steps compute( j ) do j =1, compute_nodes checkpoint( data ) MPI_Recv( j , buffer ) end do write( buffer ) end do subroutine checkpoint( data ) end do MPI_Wait( send_req ) buffer = data MPI_Isend( IO_SERVER , buffer ) Enforces the end subroutine order of processing ... sequential

Compute Node I/O Server do i =1, time_steps do i =1, time_steps compute( j ) do j =1, compute_nodes checkpoint( data ) MPI_Irecv( j , buffer(j),req(j) ) end do end do do j =1, compute_nodes subroutine checkpoint( data ) MPI_Waitany(req, j , buffer ) MPI_Wait( send_req ) write( buffer(j) ) buffer = data end do MPI_Isend( IO_SERVER , buffer ) end do end subroutine Requires a lot more buffer space... Receives in any order

• Many compute nodes per I/O Server • All compute nodes transmitting (almost) simultaneously I/O • Potentially too many incoming messages or pre-posted receive messages • Overloads the I/O server

Compute Node I/O Server do i =1, time_steps do i =1, time_steps compute() do j =1, compute_nodes send_io_data() MPI_Send( j ) ! Ping checkpoint() MPI_Recv( j , buffer ) end do write( buffer ) end do subroutine send_io_data() end do if (data_to_send) then Enforces the order of MPI_Test( pinged ) processing ... Sequential if ( pinged ) then but only one message to MPI_Isend( buffer, req ) the server at a time data_to_send = .false. end if end if end subroutine Subroutine called so subroutine checkpoint( data ) infrequently that data send_io_data() rarely sent MPI_Wait( req ) buffer = data ! Cache data data_to_send = .true. end subroutine

Compute Compute Compute Compute I/O I/O Compute Compute Compute Compute One at a time Two at a time

Compute Node I/O Server do i =1, time_steps do i =1, time_steps do j=1,sections do j =1, compute_nodes compute_section(j) MPI_Send( j ) ! Ping send_io_data() MPI_Recv( j , buffer ) end do write( buffer ) checkpoint() end do end do end do subroutine send_io_data() if (data_to_send) then MPI_Test( pinged ) if ( pinged ) then MPI_Isend( buffer, req ) data_to_send = .false. end if end if end subroutine Now called more frequently so subroutine checkpoint( data ) greater chance of success send_io_data() The greater the frequency of calls MPI_Wait( req ) the more efficient the transfer, but buffer = data ! Cache data the higher the load on the system data_to_send = .true. end subroutine

Time Wait I/O Wait I/O Wait I/O Wait I/O Wait I/O Wait I/O Wait Ping Ping Ping Ping Ping Ping Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute Interrupt Points

200.0 No I/O Servers 180.0 I/O Servers 160.0 140.0 120.0 Time (s) 100.0 80.0 60.0 40.0 20.0 0.0 64 256 1024 4096 16384 65536 Number of Processors

700 No I/O Servers 650 I/O Servers 600 550 Time (s) 500 450 400 350 300 64 512 4096 32768 Number of Processors

 Using MPI , messages have to be sent from the compute nodes to the I/O Server  To prevent overloading the I/O Server the compute nodes have to actively check for permission to send messages.  It is simpler to have the I/O Server pull the data from the compute nodes when it is ready  SHMEM is a single sided communications API supported on Cray systems  SHMEM supports remote push and remote pull of distributed data over the network  It Can be directly integrated with MPI on Cray Architectures

Compute Node I/O Server do i =1, time_steps do compute() do j =1, compute_nodes checkpoint() get( j , local_flag ) end do if ( local_flag = DATA_READY ) get( j , buffer ) subroutine checkpoint( data ) write (buffer ) if (.not. CP_DONE) then put( j , flag , CP_DONE ) wait_until( flag , CP_DONE) end if end if end do buffer = data ! Cache data end flag = DATA_READY end subroutine • I/O Server slightly more • Compute node code becomes complicated. much simpler... • Constantly polling the compute • No requirement to explicitly send nodes. data • Only one message at a time • Polling interrupt done by the system libraries

Compute Compute Not ready Not ready Compute Compute Not ready Ready! I/O I/O Compute Compute Not ready Not ready Compute Compute Not ready Ready Polling computes

Check Ready? I/O Check Ready? I/O Check Ready? I/O Check Ready? I/O Check Ready? I/O Check Ready? I/O Time Compute Compute Compute Compute Compute Compute Wait

 I/O Servers introduce additional communication to the application.  Does this additional load affect the application’s overall performance ? Time steps during I/O phase Time steps during idle phase I/O I/O I/O I/O I/O Tests measured the wall clock time to complete standard model time steps during I/O communications and during I/O idle time

 An average Time step took 9.31s with MPI, 9.72s with SHMEM  86% of Time steps were during idle time using MPI, 75% with SHMEM.  Using MPI, time steps during the I/O phase cost 2.33% more, with SHMEM 0.19%.

Greater risk to checkpoint data, longer time before Fewer I/O server processors writing is complete Compute Compute Compute Compute I/O I/O I/O I/O Minimises the time I/O servers are idle Reduces risk to checkpoint data. Written out at fastest More I/O server processors possible speed Compute Compute Compute Compute I/O servers are idle most of the time I I I I / / / / O O O O

Efficiency Performance

I/O Communicators

Another application on the system Bandwidth is shared between jobs on the system writes out a checkpoint at the same time. Effective application I/O Standard Sequential I/O bandwidth halved. Write time doubles Compute I/O Compute I/O Compute I/O Compute I/O Total run time increased Time Time Asynchronous I/O Compute Compute Compute Compute I/O I/O I/O I/O Same event, constant total run time

 I/O Server idle time could be put to good use  Performing post-processing on data structures  Averages, sums.  Restructuring data (transposes etc)  Repacking data (to HDF5, NetCDF etc)  Compression (RLE, Block sort)  Aggregating information between multiple jobs  Collecting information from multiple jobs and performing calculations  Ideally large numbers of small tasks  Short jobs that can be scheduled between I/O operations  Serial processes, or parallel tasks over the I/O servers  I/O Servers could become multi-threaded to increase responsiveness

Cray Centre of Excellence for HECToR This talk is not about how to - PowerPoint PPT Presentation

Thomas Edwards, Kevin Roy Cray Centre of Excellence for HECToR This talk is not about how to get maximum performance from a Lustre file system. Plenty of information about tuning Lustre Performance Previous CUGs Lustre User Groups

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

HECToR, the CoE and Large- Scale Application Performance on CLE David Tanqueray, Jason

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

CAMP CHIEF HECTOR YMCA Sunship Earth CAMP CHIEF

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

C o ff eeScrip t Hector Correa hector@hectorcorrea.com Monday, June 18, 12 Agenda What is

Application of the Bead Perturbation Technique to a Study of a Tunable 5 GHz Annular Cavity

Q2 Investm ent Update July 13, 20 17 Russ Allen, Berm an Capital CIO Disclosures Important

Early development and evaluation of technological devices A Piau1,2, F Nourhashemi1,3 1 CHU de

A new EU energy technology policy towards 2050: Which way to go? Sophia Ruester Universit

Robot Diaries Broadening Participation in the Computer Science Pipeline through Social Technical

Manipulation in Political Stock Manipulation in Political Stock Markets Markets Koleman Strumpf

PIP2IT HEBT Final Design Review Vacuum design Alex CHEN In partnership with: Final Design

Supply Connection Plans of the PANDA Cluster- Jet Target, Beam Dump and its Interface with other

Cray Centre of Excellence for HECToR This talk is not about how to - PowerPoint PPT Presentation

Thomas Edwards, Kevin Roy Cray Centre of Excellence for HECToR This talk is not about how to get maximum performance from a Lustre file system. Plenty of information about tuning Lustre Performance Previous CUGs Lustre User Groups

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

HECToR, the CoE and Large- Scale Application Performance on CLE David Tanqueray, Jason

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

CAMP CHIEF HECTOR YMCA Sunship Earth CAMP CHIEF

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

C o ff eeScrip t Hector Correa hector@hectorcorrea.com Monday, June 18, 12 Agenda What is

Application of the Bead Perturbation Technique to a Study of a Tunable 5 GHz Annular Cavity

Q2 Investm ent Update July 13, 20 17 Russ Allen, Berm an Capital CIO Disclosures Important

Early development and evaluation of technological devices A Piau1,2, F Nourhashemi1,3 1 CHU de

A new EU energy technology policy towards 2050: Which way to go? Sophia Ruester Universit

Robot Diaries Broadening Participation in the Computer Science Pipeline through Social Technical

Manipulation in Political Stock Manipulation in Political Stock Markets Markets Koleman Strumpf

PIP2IT HEBT Final Design Review Vacuum design Alex CHEN In partnership with: Final Design

Supply Connection Plans of the PANDA Cluster- Jet Target, Beam Dump and its Interface with other

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>