Multicore job management in the Multicore job management in the - - PowerPoint PPT Presentation
Multicore job management in the Multicore job management in the - - PowerPoint PPT Presentation
Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid Worldwide LHC Computing Grid EGI Community Forum, Helsinki, May 20th 2014 EGI Community Forum, Helsinki, May 20th 2014 Antonio Prez-Calero Yzquierdo
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 2
Outline Outline
- Multicore applications in WLCG
- Job submission across the WLCG
- The problem of multicore job scheduling
- The WLCG multicore deployment TF
- First results and current status
- Conclusions and Outlook
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 3
Jobs in WLCG Jobs in WLCG
- LHC experiments need a global computing infrastructure in order to analyze LHC
collisions: the Worldwide LHC Computing Grig
- A distributed computing model with data and job submission across the Grid in
- rder to process billions of collision events
- Computation tasks include experimental data reconstruction and analysis as
well as event simulation.
- Mainly sequential tasks, one event at a time, with no parallelization, single core
jobs
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 4
Multicore jobs in WLCG Multicore jobs in WLCG
Looking at the restart of the LHC data taking in 2015, experiments are developing multicore applications due to:
- Hardware evolution: over the last decade architecture design goes in the
direction of adding processors to the CPU, while individual core performance will probably not increase significantly
- Evolution of LHC conditions: higher data volumes to be processed, with
increased event complexity due to higher pileup, causing increasing
–
processing time per event
–
memory usage
New era for HEP computing:
- Integration of elements of Grid Computing and High Performance Computing
going from sequential programming to parallel processing over the Grid: distributed parallel computing
- Parallel view to other activities:
LHC upgrades ↔ LHC detectors upgrades ↔ LHC VOs software upgrades
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 5
Multicore jobs in WLCG Multicore jobs in WLCG
Advantages for multicore jobs:
- Fully exploit future CPU capabilities, adapting code to new architecture designs
- Reduced tasks memory consumption per core, as memory may be shared
between threads
Parallel processing is being considered at different levels:
- Run over events in parallel processes
- Process data modules inside an event in parallel
- Both combined: processing in parallel modules not necessarily from the same
event
Jobs running parallel threads share common data in memory, such as detector geometry, calibration and conditions data, etc.
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 6
Multicore job scheduling problem Multicore job scheduling problem
Objectives:
- Integrate scheduling of both multicore and single-
core jobs, that will still be used by LHC experiments, as well as other VOs in shared sites.
- Avoid splitting resources, such as dedicated whole
node slots and separated queues, which may introduce additional inefficiency and complexity in site resources configuration and management.
- Maximize CPU usage: No idle CPUs while there is
job to be done
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 7
Pilot based job submission Pilot based job submission
- LHC experiments commonly make use of pilot jobs:
–
reserve resources at the remote computing centers
–
- nce pilots get resources, they start pulling jobs from a general job pool,
pulling jobs to be run at their location
- In this schema, multicore jobs require multicore pilots
- Ex. glideinWMS based
- n HTCondor (CMS)
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 8
WLCG Multicore Deployment TF WLCG Multicore Deployment TF
Job scheduling involves two main elements :
a) Grid-wide job submission by experiments b) Resource allocation at the sites The purpose of the WLCG Multicore Deployment TF is to explore, develop and propose ways to connect a) and b) in the most efficient way, with reasonable effort from sites and experiments, and in a reasonable time in order to achieve our multicore job scheduling objectives.
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 9
WLCG Multicore Deployment TF WLCG Multicore Deployment TF
Evaluate:
- Multicore capabilities of local batch systems
- Compatibility of approaches to multicore job distribution by different
LHC VOs
This contribution: summary of the activities of this task force over
the last months.
- Acknowledgements: thanks to all the participating people and
sites, which provided the content for this talk!
Project twiki: https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 10
Review of batch Review of batch systems systems
- We have reviewed batch systems in terms of their functionalities useful for
multicore scheduling
- Experience related to:
–
ATLAS multicore jobs in production since January
–
CMS limited testing up to now
- Mini workshops dedicated to each technology
–
HTCondor (RAL), UGE (KIT), Torque/Maui (NIKHEF), SLURM (CSCS)
- Main conclusion: most popular batch systems support multicore jobs
–
Native functionalities plus sometimes complementary scripts
- System configuration (tuning) depends on site load composition and running
conditions: we will need more than one iteration to fully evaluate the performance of each system
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 11
Scheduling multicore jobs Scheduling multicore jobs
- Key problem: in order for a multicore job to start in a non-dedicated environment,
the machine needs to be sufficiently drained
- Creating a multicore slot: prevent single core jobs from taking freed resources
–
draining = idle CPUs!!
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 12
Scheduling with backfilling Scheduling with backfilling
However, a well tuned scheduler doing backfilling can reduce the amount of idle CPUs caused by the WN draining:
- Jobs of lower priority are allowed to utilize the reserved resources only if their
prospective job end (i.e. their declared wallclock usage) is before the start of the reservation
job job job
BACKFILLED JOBS
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 13
The ability of the scheduler algorithm to perform successful backfilling depends on the concepts of entropy and predictability
- Entropy: having a variety of jobs with different requirements in the
- queue. There should be a distribution of jobs resources requests in
- rder to increase the likelihood of finding the right "piece" to fill each
temporary hole in draining WNs
- Predictability: reasonably accurate prediction for jobs running time, so
that the scheduler can make a decision on whether it should run this job in that hole or not.
– How accurate this prediction needs to be?
Scheduling with backfilling Scheduling with backfilling
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 14
Job running time estimation Job running time estimation
Providing a reliable estimation of job running times is however difficult for various reasons:
- Inherent to the jobs themselves, as the instantaneous luminosity and pile-up
determine the complexity of events and thus the job running time
–
different for analysis, MC production and data reconstruction/reprocessing
–
there are currently ways to mitigate this, for example data reconstruction workload distributed in a number of jobs with approximately equal running time
- Access to input data waiting times: unpredictable in a complex environment
such as the WLCG
- Variance in CPU power for WNs distributed across the grid and also within sites
–
This may not be so much of a problem if the actual different between the faster and lower machines at a given site still provides an estimation accurate enough to do some backfilling
- The masking effect of pilots: submission of jobs through pilots introduce some
- ther effects, such as running more than one job per pilot, waiting for new jobs to
appear, etc.
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 15
Conserving the slots Conserving the slots
- There are two aspects of the problem: creating and conserving multicore slots
–
Once the cost has been paid, avoid multicore slot destruction
VO:1 job VO:2 job
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 16
Conserving the slots Conserving the slots
In order to keep the multicore slots alive, sites should have a more or less stable flow of multicore jobs, so that the vacated slots can be filled with new multicore jobs. Here we can see some aspects:
- Different VOs should agree on a common slot size so that they can
access the same slots in shared sites.
– This is well understood and there is general consensus that there
should exist at least a default value (for example 8)
- Rank expressions/job priorities should be adjusted in order to assign
multicore jobs to multicore slots, as opposed to getting partially filled by single core jobs.
- Stable flow of multicore jobs: bursty submission patterns force the
system to continue and re-adjust the level of draining
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 17
Multicore job submission models: ATLAS Multicore job submission models: ATLAS
ATLAS considers scheduling to be mostly a site problem
- ATLAS will keep single core and multicore jobs separated
- one pilot pulls only one payload.
Its strong point would be the entropy. The experiment submission system is being adapted to provide the job running times to the batch systems.
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 18
Multicore job submission models: CMS Multicore job submission models: CMS
CMS pilots provides the internal machinery (partitionable slots) so that sites in general do not have to deal with a mixture of jobs from CMS
- Integrate single core and multicore jobs into the same pilots
- pilots continue pulling jobs until they exhaust the queue walltime limits
The strong point in this model is the predictability, as pilots should run for as long as they are allowed to. However, it has the effect of reducing entropy from the system, thus making it more difficult to perform backfilling The aim is in keeping the multicore slots alive
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 19
- First results
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 20
First results First results
- Most sites have been exposed to ATLAS jobs, as they started earlier with MC
production via multicore jobs. Experience from KIT: Multicore jobs: with longer waiting times combined with short running times The cost of draining the slot to run a multicore job is not fully exploited by short running jobs
Ref: https://indico.cern.ch/event/298062/contribution/3/material/slides/0.pdf
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 21
First results First results
- Most sites have been exposed to ATLAS jobs, as they started earlier with MC
production via multicore jobs. Experience from KIT:
Observed bursty multicore jobs submission patterns
Ref: https://indico.cern.ch/event/298062/contribution/3/material/slides/0.pdf
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 22
First results First results
- Most sites have been exposed to ATLAS jobs, as they started earlier with MC
production via multicore jobs. Results from RAL:
Cancelled draining nodes. Conclusion: need to constantly drain slots to maintain the number of running multicore jobs Constant draining means keeping idle CPUs all the time Draining rate retuned in an attempt to reduce wastage Conclusion: the draining rate has be tuned according to the amount of running and queued multicore jobs
Ref: https://indico.cern.ch/event/298065/contribution/0/material/slides/1.pdf
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 23
First results First results
In summary:
- When no backfilling is available due to the lack of running time estimates,
draining nodes has a cost in idle CPUs
- The impact of the degradation of CPU usage as a consequence of draining
depends on the size and load composition of the site.
–
However, even if small, could be extremely important, as in general funding is linked to good results
- Short multicore jobs do not fully exploit the effort made in creating the multicore
slot for them
- Job submission patterns affect tuning, performance and wastage of the system
–
Wavelike patterns require to constantly tune the amount of draining needed
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 24
Dynamic partitioning Dynamic partitioning
The solution many sites have gone for is the dynamic partitioning of their clusters:
- Moving WNs between separated pools allows single and multicore jobs to
coexist at a site without the single core jobs destroying the multicore slots
- There is a floating boundary between the two partitions which is adjusted
dynamically according to the load at the site
– draining in a very controlled amount, 1-2% of the total number of cores in a
site being drained simultaneously at most (e.g. NIKHEF)
–
No draining is needed to support a constant multicore job load
–
However, depending on the size of the request and the number of cores in the nodes, getting multicore jobs to run may take quite a long time
- It accommodates both CMS and ATLAS models
- It is a native solution for a number of batch systems (e.g. UGE at KIT), while for
- thers some custom scripts may be required (e.g. Torque/Maui at NIKHEF).
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 25
- Conclusions and Outlook
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 26
Conclusions Conclusions
- Multicore jobs will be required to process and simulate data in the next LHC run
- The distribution and scheduling of multicore jobs across the WLCG is a problem in
itself:
–
multitude of sites with diverse batch system technologies
–
different job submission models for each experiment (ATLAS and CMS)
- The WLCG multicore deployment task force has the mandate of coordinating
these activities to make sure they can work together
- First results from experiments show that, since backfilling is not currently
available, multicore slots have to be conserved to avoid unnecessary draining resulting in CPU wastage
–
A dynamic partitioning of the resources into separated WN pools is emerging as a viable solution
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 27
Outlook Outlook
- The immediate objective is to test both models (CMS and ATLAS) in a shared
environment at real scale
–
Evaluate if they are compatible
–
Analyze how the global performance depends on the size of the site, the actual mixture of single core / multicore jobs, if the site is dedicated mainly to HEP or not, the actual batch system capabilities and its particular tuning, etc.
- For these tests, the ideal place are Tier1s, given that a good fraction of them
support both experiments and there is also a diversity of batch systems to study.
–
CMS and ATLAS will send multicore jobs concurrently to shared Tier1s by the end of May
- Multicore support result from the interaction of the VO submission models with
local batch system capabilities and scheduler tuning
–
It is and iterative process, feedback exchange will be needed: sites ↔ VOs
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 28
- Extra slides
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 29
Abstract
Multicore job management in the Worldwide LHC Computing Grid After two years since the very successful first run of the Large Hadron Collider finished, data taking is scheduled to be restarted in early 2015. The experimental conditions for this second run include higher collision energy and beam luminosities, both leading to increased data volumes and event complexity. In order to process the data generated in such scenario, and also best exploit the multicore architectures of current CPUs, the LHC experiments have been developing parallelized data analysis and simulation software. However, workload scheduling in these conditions becomes a complex problem in itself, as computing jobs with a broad range of resources requirements have to be efficiently distributed across the multiple sites which make up the Worldwide LHC Computing Grid. A WLCG Task Force has been created with the purpose of coordinating the joint effort from experiments and WLCG sites. This contribution will present the activities of the Task Force, including the experiences from sites on how to best use the different batch system technologies, the development of advanced workload submission tools by the experiments and the real-size scale tests of the different proposed strategies.
20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 30
Abstract (continued)
Description : Job scheduling in a distributed resources environment such as the WLCG involves the grid-wide workload submission tools used by the LHC experimental collaborations, known as Virtual Organizations (VO) in this context, and the batch system technologies in charge of the allocation of the local resources, which are deployed at every WLCG site. The objective of the WLCG Multicore Deployment Task Force is to explore, develop and propose ways to connect both elements in order to fulfill the computing needs of the different VOs, which now require sites to be able to run their newly developed multicore applications in addition to the more usual single-core software. Furthermore, the best use of the resources must be ensured, avoiding CPUs being idle when there is work to be done and minimizing CPU inefficiencies which may be
- riginated by the scheduling mechanisms. All this should be achieved without imposing on the
sites unnecessary complexities in the way they manage their resources and maintaining a high rotation of jobs from the different users in multi-VO sites. Conclusion : Apart from the main objective of satisfying the new computing needs of the LHC VOs, this task force has the mandate of providing the necessary coordination in order to avoid duplicated efforts in the development of new grid-wide submission tools, as well as ensuring the convergence of approaches from different VOs to best use shared resources. Additionally, a better understanding of the technical capabilities of existing batch systems and schedulers is expected, as the participants develop and present the best system configurations, which may be shared between sites operating the same technologies. https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore