SLURM. Our Way. Douglas Jacobsen, James Botts, Helen He NERSC CUG - PowerPoint PPT Presentation

SLURM. Our Way. Douglas Jacobsen, James Botts, Helen He NERSC CUG 2016

NERSC Vital Statistics ● 860 active projects ○ DOE selects projects and PIs, allocates most of our computer time ● 7750 active users 700+ codes both established and in-development ● edison XC30, 5586 ivybridge nodes ● ○ Primarily used for large capability jobs ○ Small - midrange as well ○ Moved edison from Oakland, CA to Berkeley, CA in Dec 2015 ● cori phase 1 XC40, 1628 haswell nodes ○ DataWarp ○ realtime jobs for experimental facilities ○ massive quantities of serial jobs ○ regular workload too ○ shifter

repurposed Native SLURM at NERSC "net" node slurmctld slurmctld Why native? (backup) (primary) 1. Enables direct support for serial jobs slurmdbd 2. Simplifies operation by easing mysql prolog/epilog access to compute nodes 3. Simplifies user experience eslogin a. No shared batch-script nodes eslogin eslogin b. Similar to other cluster systems compute compute 4. Enables new features and functionality slurmd slurmd on existing systems 5. Creates a "platform for innovation" /dsl/opt/slurm/default /opt/slurm/default rsip slurm.conf ControlAddr slurm.conf ControlAddr unset to allow slurmctld overridden to force traffic to use ipogif0 owing slurmctld traffic over to lookup of nid0xxxx ethernet interface hostname ldap

Basic CLE 5.2 Deployment Challenge : Upgrade native SLURM Original Method: Issue : Installed to /dsl/opt/slurm/<version>, with symlink to "default". /opt/slurm/15.08.xx_instTag_20150912xxxx → Changing symlink can have little impact on actual /opt/slurm/default -> /etc/alternatives/slurm /etc/alternatives/slurm -> /opt/slurm/15.08. version "pointed to" on compute nodes xx_... Result : Often receive recommendation to reboot supercomputer after upgrading. Production Method: /opt/slurm/15.08.xx_instTag_20150912xxxx Challenge : NERSC patches SLURM often and is not /opt/slurm/default -> 15.08.xx_instTag_20150912xxxx interested in rebooting AND Issue : /dsl DVS mount attribute cache prevents proper dereference of "default" symlink Compute node /etc/fstab: Solution : mount /dsl/opt/slurm a second time with short (15s) attrcache /opt/slurm /dsl/opt/slurm dvs \ Result : NERSC can live upgrade without rebooting path=/dsl/opt/slurm,nodename=<dslNidList>, \ <opts>, attrcache_timeout=15 Also moved slurm sysconfdir to /opt/slurm/etc, where etc is a symlink to conf.< rev > to workaround a rare dvs issue

Scaling Up Challenge : Small and mid-scale jobs work great! compute When MPI ranks exceed ~50,000 sometimes users get: Sun Jan 24 04:51:29 2016: [unset]:_pmi_alps_get_apid:alps response not OKAY Sun Jan 24 04:51:29 2016: [unset]:_pmi_init:_pmi_alps_init returned -1 compute [Sun Jan 24 04:51:30 2016] [c3-0c2s9n3] Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(547): MPID_Init(203).......: channel initialization failed compute MPID_Init(584).......: PMI2 init failed: 1 <repeat ad nauseum for every rank> lustre Workaround: Increase PMI timeout from 60s to something ... bigger (app env): PMI_MMAP_SYNC_WAIT_TIME=300 compute Problem: srun directly execs the application from the hosting filesystem location. FS cannot deliver the application at scale. aprun would copy the executable to in-memory filesystem by default. Other scaling topics: ● srun ports for stdout/err Solution: New 15.08 srun feature merging sbcast and srun ● rsip port exhaustion srun --bcast=/tmp/a.out ./mpi/a.out ● slurm.conf TreeWidth slurm 16.05 adds --compress option to deliver ● Backfill tuning executable in similar time as aprun

"NERSC users run applications Scheduling at every scale to conduct their research." Source: Brian Austin, NERSC

edison Scheduling ● big job metric - need to always be running at least one "large" job (>682 nodes) ○ Give priority boost + discount cori cori+edison ● "shared" partition ○ Up to 32 jobs per node debug partition ● ○ HINT: set --gres=craynetwork:0 in ○ delivers debug-exclusive nodes job_submit.lua for shared jobs ○ more exclusive nodes during business ○ allow users to submit 10,000 jobs with up hours to 1,000 concurrently running regular partition ● ● "realtime" partition ○ Highly utilized workhorse ○ Jobs must start within 2 minutes low and premium QOS ● ○ Per-project limits implemented using QOS ○ accessible in most partitions ○ Top priority jobs + exclusive access to scavenger QOS ● small number of nodes (92% utilized) ○ Once a user account balance drops below burstbuffer QOS gives constant priority ● zero, all jobs automatically put into boost to burst buffer jobs scavenger. Eligible for all partitions except realtime

Scheduling - How Debug Works Nights and Weekends regular debug nid00008 nid05586 Business Hours regular debug nid00008 nid05586 Debug jobs: Day/Night: are smaller than "regular" jobs cron-run script manipulates regular ● ● ● are shorter than "regular" jobs partition configuration (scontrol update have access to all nodes in the system partition=regular…) ● ● have advantageous priority ● during night mode adds a reservation to prevent long running jobs from starting these concepts are extended for cori's on contended nodes realtime and shared partitions

now Scheduling - Backfill jobs NERSC typically has hundreds of ● running jobs (thousands on cori) time Queue frequently 10x larger (2,000 - ● 10,000 eligible jobs) New Backfill Algorithm! Much parameter optimization required and so on... ● bf_min_prio_reserve to get things "working" bf_interval ○ 1. choose particular priority value ○ bf_max_job_partition as threshold bf_max_job_user ○ 2. Everything above threshold gets ○ … resource reservations We still weren't getting our target ● 3. Everything below is evaluated utilization (>95%) with simple "start now" check Still were having long waits with many ● (NEW for SLURM) backfill targets in the queue Utilization jumped on average more than 7% per day Job Prioritization Every backfill opportunity is realized 1. QOS 2. Aging (scaled to 1 point per minute) 3. Fairshare (up to 1440 points)

Primary Difficulty Faced xtcheckhealth xtcheckhealth slurmctld needs to become xtcleanup_after slurmctld xtcheckhealth xtcleanup_after Issue is that a "completing" node, stuck ... xtcheckhealth on unkillable process (or other similar ... issue), becomes an emergency NHC doesn't run until entire allocation If NHC is run from per-node epilog, each node has ended. In cases slow-to-complete can complete independently, returning them to node, this holds large allocations idle. service faster.

Exciting slurm topics I'm not covering today user training and tutorials accounting/integrating slurmdbd with NERSC databases user experience and documentation draining dvs service nodes with prolog my speculations about Rhine/Redwood blowing up slurm details of realtime implementation without getting burned burstbuffer / DataWarp integration NERSC slurm plugins: vtune, blcr, shifter, completion ccm reservations job_submit.lua monitoring knl

Conclusions and Future Directions ● We have consistently delivered ● Integrating Cori Phase 2 (+9300 highly usable systems with SLURM KNL) since it was put on the systems ○ 11,000 node system New processor requiring new NUMA ○ ● Our typical experience is that bugs binding capabilities, node reboot are repaired same-or-next day capabilities, Native SLURM is a new technology ● ● Deploying SLURM on that has rough edges with great Rhine/Redwood opportunity! ○ Continuous delivery of configurations ● Increasing resolution of binding Live rebuild/redeploy (less frequent) ○ affinities ● Scaling topologically aware scheduling

Acknowledgements NERSC SchedMD ● Tina Declerck ● Moe Jette ● Ian Nascimento ● Danny Auble ● Stephen Leak Tim Wickberg ● Brian Christiansen ● Cray ● Brian Gilmer

SLURM. Our Way. Douglas Jacobsen, James Botts, Helen He NERSC CUG - PowerPoint PPT Presentation

SLURM. Our Way. Douglas Jacobsen, James Botts, Helen He NERSC CUG 2016 NERSC Vital Statistics 860 active projects DOE selects projects and PIs, allocates most of our computer time 7750 active users 700+ codes both

Slurm status and news from the Nordics 2015-03-27, Hepix spring 2015, Oxford Overview SLURM

Slurm: New NREL Capabilities HPC Operations March 2019 Presentation by: Dan Harris NREL |

PMIx: Process Management for Exascale Environments Ralph H. Castain , David Solt, Joshua Hursey,

Expansion of Backfill algorithm for increasing efficiency of supercomputer Lomonosov

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing

Testing SLURM batch system for a grid farm: functionalities, scalability, performance and how it

Communication-aware Job Scheduling using SLURM Priya Mishra, Tushar Agrawal, Preeti Malakar

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Contents Introduction ZHT Enhancement for SLURM++ Compare and Swap Resource

OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR

Deadline to implement E-Way Bill Basis Inter-Sate Intra -State Voluntary E-Way Bill 16-01-2018

United Way of Tompkins County United Way Inclusive United Way of Tompkins Community Worldwide

A New Way of Medical A New Way of Medical A New Way of Medical A New Way of Medical

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Our Galaxy Chapter 19 19.1 The Milky Way Revealed Our goals for learning What does our

Electrifiying - but still gentle holidays: Discover the Alpine Pearl Werfenweng Svea Lauterjung,

The Costs and Benefits of Building Hypermedia APIs (with Node.js) Layer 7 Confidential 1 Mike

WISPy dark matter from the top down Mark D. Goodsell LPTHE Introduction ALPs Vector dark

Searches for New Light Weakly Coupled Particles around DESY Intensity Frontier Workshop IF5:

ARCHER hardware Slides contributed by Cray and EPCC Reusing this material This work is licensed

Chinas Stock Connect Insights from the Trading Team May 2019 Signi fi cant Increase in Use

Lepton-flavor-violating decays into axion-like particles Lorenzo Calibbi (Nankai University)

Stochastic conversions of TeV photons into axion- like particles in extragalactic magnetic fields