Batch Systems
Running your jobs on an HPC machine
Batch Systems Running your jobs on an HPC machine Outline What are - - PowerPoint PPT Presentation
Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are they needed? How to run jobs on an HPC machine via a batch system: Concepts Resource scheduling and job execution Job submission
Running your jobs on an HPC machine
What are they and why do we need them?
computing resources
many jobs (“production runs”)
(demand usually exceeds supply)
licences
exceeds a time limit
Write ¡ Job ¡Script ¡ Job ¡ Queued ¡ Job ¡ Executes ¡ Job ¡ Finished ¡ Allocated ¡ Job ¡ID ¡ Output ¡ Files ¡ ¡ (& ¡Errors) ¡ Status ¡ Job ¡Submit ¡ Command ¡ Job ¡Delete ¡ Command ¡ Job ¡Status ¡ Command ¡
resources it requires (number of nodes / cores, job time, etc.)
these requirements to become available for your job to use
(specified in advance in your job script):
the full time
to a portion of the machine:
architecture or accelerators such as GPUs, etc.
09:00-20:00 only
influences how soon it will start (higher priority more likely to start sooner)
time used
Command and examples
PBS (ARCHER) SLURM Job submit command qsub myjob.pbs sbatch myjob_sbatch Job status command qstat –u $USER squeue –u $USER Job delete command qdel ######## scancel ######## PBS job state (ARCHER) Meaning Q The job is queued and waiting to start R The job is currently running E The job is currently exiting H The job is held and not eligible to run
Use these commands inside a job script to launch a parallel executable
Parallel application launcher commands aprun –n 48 –N 12 –d 2 my_program (ARCHER) mpirun –ppn 12 –np 48 my_program mpiexec –n 48 my_program
PBS example: #!/bin/bash --login #PBS -N Weather1 #PBS -l select=200 #PBS -l walltime=1:00:00 #PBS –q short cd $PBS_O_WORKDIR aprun –n 4800 weathersim
Parallel job launcher Requested job duration Changing to directory to run in Number of nodes requested Job name Linux shell to run job script in Number of parallel instances of program to launch Queue to submit job to Program name
SLURM example: #!/bin/bash #SBATCH –J Weather1 #SBATCH --nodes=2 #SBATCH --time=12:00:00 #SBATCH --ntasks=24 #SBATCH –p tesla mpirun –np 24 weathersim
Parallel job launcher Requested job duration Number of nodes requested Job name Linux shell to run job script in Number of parallel instances of program to launch Number of parallel tasks Program name Queue to submit job to (GPU queue)
Testing, development and visualisation
interactively
machine
need to request an interactive job from the batch scheduler
size, queue, etc.):
qsub -I –l select=1,walltime=0:10:0 –A y14 –q short
using parallel launcher (aprun, mpirun, etc.) as for batch jobs
interactive jobs
ARCHER)
How does the scheduler decide which job to run when?
sizes on system to ensure
policy that varies from machine to machine, allowing control
resources, calculate when the required resources will become available and schedule A to run at that future time.
and for which sufficient resources are currently available
http://archer.ac.uk/status/
Scheduling coefficient = runtime / (runtime + queuedtime) Statistics over last year:
Tips for using HPC batch systems
jobs without first testing may burn resources without producing good results)
script and submit it to the batch system (e.g. to short queue)
command line (bash or other) available in scripts
is very easy to lose the results of a large simulation due to a typo (or unforeseen error) in a script
Changing your scripts from one batch system to another
batch systems/HPC resources: https://github.com/aturner-epcc/bolt
resources on HPC systems and maximise utilisation
while they queue and run
utilisation according to policy