collecting application level job completion statistics
play

Collecting Application- Level Job Completion Statistics CUG 2010, - PowerPoint PPT Presentation

Collecting Application- Level Job Completion Statistics CUG 2010, Edinburgh Matthew Ezell HPC Systems Administrator National Institute for Computational Sciences University of Tennessee NICS is the latest NSF HPC center Kraken #3 on


  1. Collecting Application- Level Job Completion Statistics CUG 2010, Edinburgh Matthew Ezell HPC Systems Administrator

  2. National Institute for Computational Sciences University of Tennessee • NICS is the latest NSF HPC center • Kraken #3 on Top 500 – 1.030 Petaflop peak; 831.7 Teraflops Linpack – First academic petaflop • Athena #30 on Top 500 – 166 Teraflops peak; 125 Teraflops Linpack 2

  3. Motivation and Goals • Need for statistics on the frequency and nature of job failures • XT Systems produce massive amounts of log data – Some job-level error messages are only put in job standard output or standard error • It should have the ability to explain “cryptic” error messages to users • Should not increase job walltime or modify user experience CUG 2010 3

  4. Design: apwrap Data Flow !"#.-////////////////////////////!"#.- +,'%-/ ()!"&* 9$':%& !"#$%" !"#$%" 0'1,,&' +,'%- !"#&'' !"#&'' 7$6/(8'.," +,4'1,/51"161!& 2$*,%"&/3$#&! CUG 2010 4

  5. Design: Prologues and Epilogues • Allow arbitrary, system-defined programs to run before and after aprun execution • Should be able to send messages to the user and/or prevent the application from being launched • Can be integrated with other tools, such as the Automatic Library Tracking Database (ALTD) at NICS CUG 2010 5

  6. Design: Example Rules rules => [{ name=> 'NODEFAIL', pattern=> '^\[NID \d+\] \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} Apid \d+ killed. Received node failed or halted event for nid (\d+)', message=> 'A compute node had a hardware failure. Please resubmit your job.' },{ name=> 'SEGFAULT', pattern=> '^_pmii_daemon\(SIGCHLD\): PE \d+ exit signal Segmentation fault', message=> 'A node experienced a segmentation fault. This happens when the code attempts to access a memory location that it is not allowed to.' }] CUG 2010 6

  7. Sample Database Entry id | 189 user_binary | /lustre/scratch/ user1 / username | user1 binary system | athena mpmd | f pbsserver | nid00004 pid | 18367 batchid | 68122.nid00004 start_time | 1270358965 batchidnum | 68122 exit_time | 1270366985 apid | 1290954 Duration | 8020 batch_node | aprun3 exit_code | 1 pwd | /lustre/scratch/ user1 error_name | NODEFAIL arguments | -n 4096 -N 1 -d 4 error_string | [NID 15050] binary 2010-04-04 03:42:45 pes | 4096 Apid 1290954 killed. Received node pes_per_node | 1 failed or halted event for nid 15051 depth | 4 CUG 2010 7

  8. Successful Completion Rate Exited Non- Zero 13% Completed Successfully 87% CUG 2010 8

  9. Types of Errors Experienced NID_UNKNOWN 10% NODEFAIL 0% OOM SEGFAULT 2% 2% APRUN_ARGS MPI_ABORT 0% 61% EXCEEDS_ALLOC 1% EXE_NOTFOUND KILLED 7% 16% FLOAT_EXCEPTION 1% CUG 2010 9

  10. MPI_ABORT (61%) • The code purposely calls this function • May occur if – an input file could not be found – the algorithm reaches numeric instability – a call to malloc() returns a NULL pointer – etc… • Usually not a system problem CUG 2010 10

  11. KILLED (16%) • Two Causes – Job runs out of walltime, batch system kills it – User chooses to kill the job/app • Extended walltime may be due to a system problem, but it’s difficult to tell CUG 2010 11

  12. NID_UNKNOWN (10%) • Usually code-specific The last 50 lines from stderr follow: wks.c: Error in opngks_(): Could not open "./ 20100517-gmeta/comref-2010051700_spg40-24h.gmeta" FORTRAN STOP [NID 00078] 2010-05-17 11:57:19 Apid 1409935: initiated application termination CUG 2010 12

  13. Conclusions • Most errors experienced by users are (most likely) due to users errors • System-level errors are more rare, and require administrator involvement to debug CUG 2010 13

  14. Questions? Contact me at ezell@nics.utk.edu CUG 2010 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend