Multicore job management in the Multicore job management in the - - PowerPoint PPT Presentation

multicore job management in the multicore job management
SMART_READER_LITE
LIVE PREVIEW

Multicore job management in the Multicore job management in the - - PowerPoint PPT Presentation

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid Worldwide LHC Computing Grid EGI Community Forum, Helsinki, May 20th 2014 EGI Community Forum, Helsinki, May 20th 2014 Antonio Prez-Calero Yzquierdo


slide-1
SLIDE 1

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid Worldwide LHC Computing Grid

EGI Community Forum, Helsinki, May 20th 2014 EGI Community Forum, Helsinki, May 20th 2014

Antonio Pérez-Calero Yzquierdo and Alessandra Forti for the WLCG Multicore deployment Task Force

slide-2
SLIDE 2

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 2

Outline Outline

  • Multicore applications in WLCG
  • Job submission across the WLCG
  • The problem of multicore job scheduling
  • The WLCG multicore deployment TF
  • First results and current status
  • Conclusions and Outlook
slide-3
SLIDE 3

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 3

Jobs in WLCG Jobs in WLCG

  • LHC experiments need a global computing infrastructure in order to analyze LHC

collisions: the Worldwide LHC Computing Grig

  • A distributed computing model with data and job submission across the Grid in
  • rder to process billions of collision events
  • Computation tasks include experimental data reconstruction and analysis as

well as event simulation.

  • Mainly sequential tasks, one event at a time, with no parallelization, single core

jobs

slide-4
SLIDE 4

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 4

Multicore jobs in WLCG Multicore jobs in WLCG

Looking at the restart of the LHC data taking in 2015, experiments are developing multicore applications due to:

  • Hardware evolution: over the last decade architecture design goes in the

direction of adding processors to the CPU, while individual core performance will probably not increase significantly

  • Evolution of LHC conditions: higher data volumes to be processed, with

increased event complexity due to higher pileup, causing increasing

processing time per event

memory usage

New era for HEP computing:

  • Integration of elements of Grid Computing and High Performance Computing

going from sequential programming to parallel processing over the Grid: distributed parallel computing

  • Parallel view to other activities:

LHC upgrades ↔ LHC detectors upgrades ↔ LHC VOs software upgrades

slide-5
SLIDE 5

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 5

Multicore jobs in WLCG Multicore jobs in WLCG

Advantages for multicore jobs:

  • Fully exploit future CPU capabilities, adapting code to new architecture designs
  • Reduced tasks memory consumption per core, as memory may be shared

between threads

Parallel processing is being considered at different levels:

  • Run over events in parallel processes
  • Process data modules inside an event in parallel
  • Both combined: processing in parallel modules not necessarily from the same

event

Jobs running parallel threads share common data in memory, such as detector geometry, calibration and conditions data, etc.

slide-6
SLIDE 6

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 6

Multicore job scheduling problem Multicore job scheduling problem

Objectives:

  • Integrate scheduling of both multicore and single-

core jobs, that will still be used by LHC experiments, as well as other VOs in shared sites.

  • Avoid splitting resources, such as dedicated whole

node slots and separated queues, which may introduce additional inefficiency and complexity in site resources configuration and management.

  • Maximize CPU usage: No idle CPUs while there is

job to be done

slide-7
SLIDE 7

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 7

Pilot based job submission Pilot based job submission

  • LHC experiments commonly make use of pilot jobs:

reserve resources at the remote computing centers

  • nce pilots get resources, they start pulling jobs from a general job pool,

pulling jobs to be run at their location

  • In this schema, multicore jobs require multicore pilots
  • Ex. glideinWMS based
  • n HTCondor (CMS)
slide-8
SLIDE 8

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 8

WLCG Multicore Deployment TF WLCG Multicore Deployment TF

Job scheduling involves two main elements :

a) Grid-wide job submission by experiments b) Resource allocation at the sites The purpose of the WLCG Multicore Deployment TF is to explore, develop and propose ways to connect a) and b) in the most efficient way, with reasonable effort from sites and experiments, and in a reasonable time in order to achieve our multicore job scheduling objectives.

slide-9
SLIDE 9

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 9

WLCG Multicore Deployment TF WLCG Multicore Deployment TF

Evaluate:

  • Multicore capabilities of local batch systems
  • Compatibility of approaches to multicore job distribution by different

LHC VOs

This contribution: summary of the activities of this task force over

the last months.

  • Acknowledgements: thanks to all the participating people and

sites, which provided the content for this talk!

Project twiki: https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore

slide-10
SLIDE 10

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 10

Review of batch Review of batch systems systems

  • We have reviewed batch systems in terms of their functionalities useful for

multicore scheduling

  • Experience related to:

ATLAS multicore jobs in production since January

CMS limited testing up to now

  • Mini workshops dedicated to each technology

HTCondor (RAL), UGE (KIT), Torque/Maui (NIKHEF), SLURM (CSCS)

  • Main conclusion: most popular batch systems support multicore jobs

Native functionalities plus sometimes complementary scripts

  • System configuration (tuning) depends on site load composition and running

conditions: we will need more than one iteration to fully evaluate the performance of each system

slide-11
SLIDE 11

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 11

Scheduling multicore jobs Scheduling multicore jobs

  • Key problem: in order for a multicore job to start in a non-dedicated environment,

the machine needs to be sufficiently drained

  • Creating a multicore slot: prevent single core jobs from taking freed resources

draining = idle CPUs!!

slide-12
SLIDE 12

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 12

Scheduling with backfilling Scheduling with backfilling

However, a well tuned scheduler doing backfilling can reduce the amount of idle CPUs caused by the WN draining:

  • Jobs of lower priority are allowed to utilize the reserved resources only if their

prospective job end (i.e. their declared wallclock usage) is before the start of the reservation

job job job

BACKFILLED JOBS

slide-13
SLIDE 13

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 13

The ability of the scheduler algorithm to perform successful backfilling depends on the concepts of entropy and predictability

  • Entropy: having a variety of jobs with different requirements in the
  • queue. There should be a distribution of jobs resources requests in
  • rder to increase the likelihood of finding the right "piece" to fill each

temporary hole in draining WNs

  • Predictability: reasonably accurate prediction for jobs running time, so

that the scheduler can make a decision on whether it should run this job in that hole or not.

– How accurate this prediction needs to be?

Scheduling with backfilling Scheduling with backfilling

slide-14
SLIDE 14

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 14

Job running time estimation Job running time estimation

Providing a reliable estimation of job running times is however difficult for various reasons:

  • Inherent to the jobs themselves, as the instantaneous luminosity and pile-up

determine the complexity of events and thus the job running time

different for analysis, MC production and data reconstruction/reprocessing

there are currently ways to mitigate this, for example data reconstruction workload distributed in a number of jobs with approximately equal running time

  • Access to input data waiting times: unpredictable in a complex environment

such as the WLCG

  • Variance in CPU power for WNs distributed across the grid and also within sites

This may not be so much of a problem if the actual different between the faster and lower machines at a given site still provides an estimation accurate enough to do some backfilling

  • The masking effect of pilots: submission of jobs through pilots introduce some
  • ther effects, such as running more than one job per pilot, waiting for new jobs to

appear, etc.

slide-15
SLIDE 15

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 15

Conserving the slots Conserving the slots

  • There are two aspects of the problem: creating and conserving multicore slots

Once the cost has been paid, avoid multicore slot destruction

VO:1 job VO:2 job

slide-16
SLIDE 16

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 16

Conserving the slots Conserving the slots

In order to keep the multicore slots alive, sites should have a more or less stable flow of multicore jobs, so that the vacated slots can be filled with new multicore jobs. Here we can see some aspects:

  • Different VOs should agree on a common slot size so that they can

access the same slots in shared sites.

– This is well understood and there is general consensus that there

should exist at least a default value (for example 8)

  • Rank expressions/job priorities should be adjusted in order to assign

multicore jobs to multicore slots, as opposed to getting partially filled by single core jobs.

  • Stable flow of multicore jobs: bursty submission patterns force the

system to continue and re-adjust the level of draining

slide-17
SLIDE 17

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 17

Multicore job submission models: ATLAS Multicore job submission models: ATLAS

ATLAS considers scheduling to be mostly a site problem

  • ATLAS will keep single core and multicore jobs separated
  • one pilot pulls only one payload.

Its strong point would be the entropy. The experiment submission system is being adapted to provide the job running times to the batch systems.

slide-18
SLIDE 18

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 18

Multicore job submission models: CMS Multicore job submission models: CMS

CMS pilots provides the internal machinery (partitionable slots) so that sites in general do not have to deal with a mixture of jobs from CMS

  • Integrate single core and multicore jobs into the same pilots
  • pilots continue pulling jobs until they exhaust the queue walltime limits

The strong point in this model is the predictability, as pilots should run for as long as they are allowed to. However, it has the effect of reducing entropy from the system, thus making it more difficult to perform backfilling The aim is in keeping the multicore slots alive

slide-19
SLIDE 19

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 19

  • First results
slide-20
SLIDE 20

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 20

First results First results

  • Most sites have been exposed to ATLAS jobs, as they started earlier with MC

production via multicore jobs. Experience from KIT: Multicore jobs: with longer waiting times combined with short running times The cost of draining the slot to run a multicore job is not fully exploited by short running jobs

Ref: https://indico.cern.ch/event/298062/contribution/3/material/slides/0.pdf

slide-21
SLIDE 21

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 21

First results First results

  • Most sites have been exposed to ATLAS jobs, as they started earlier with MC

production via multicore jobs. Experience from KIT:

Observed bursty multicore jobs submission patterns

Ref: https://indico.cern.ch/event/298062/contribution/3/material/slides/0.pdf

slide-22
SLIDE 22

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 22

First results First results

  • Most sites have been exposed to ATLAS jobs, as they started earlier with MC

production via multicore jobs. Results from RAL:

Cancelled draining nodes. Conclusion: need to constantly drain slots to maintain the number of running multicore jobs Constant draining means keeping idle CPUs all the time Draining rate retuned in an attempt to reduce wastage Conclusion: the draining rate has be tuned according to the amount of running and queued multicore jobs

Ref: https://indico.cern.ch/event/298065/contribution/0/material/slides/1.pdf

slide-23
SLIDE 23

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 23

First results First results

In summary:

  • When no backfilling is available due to the lack of running time estimates,

draining nodes has a cost in idle CPUs

  • The impact of the degradation of CPU usage as a consequence of draining

depends on the size and load composition of the site.

However, even if small, could be extremely important, as in general funding is linked to good results

  • Short multicore jobs do not fully exploit the effort made in creating the multicore

slot for them

  • Job submission patterns affect tuning, performance and wastage of the system

Wavelike patterns require to constantly tune the amount of draining needed

slide-24
SLIDE 24

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 24

Dynamic partitioning Dynamic partitioning

The solution many sites have gone for is the dynamic partitioning of their clusters:

  • Moving WNs between separated pools allows single and multicore jobs to

coexist at a site without the single core jobs destroying the multicore slots

  • There is a floating boundary between the two partitions which is adjusted

dynamically according to the load at the site

– draining in a very controlled amount, 1-2% of the total number of cores in a

site being drained simultaneously at most (e.g. NIKHEF)

No draining is needed to support a constant multicore job load

However, depending on the size of the request and the number of cores in the nodes, getting multicore jobs to run may take quite a long time

  • It accommodates both CMS and ATLAS models
  • It is a native solution for a number of batch systems (e.g. UGE at KIT), while for
  • thers some custom scripts may be required (e.g. Torque/Maui at NIKHEF).
slide-25
SLIDE 25

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 25

  • Conclusions and Outlook
slide-26
SLIDE 26

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 26

Conclusions Conclusions

  • Multicore jobs will be required to process and simulate data in the next LHC run
  • The distribution and scheduling of multicore jobs across the WLCG is a problem in

itself:

multitude of sites with diverse batch system technologies

different job submission models for each experiment (ATLAS and CMS)

  • The WLCG multicore deployment task force has the mandate of coordinating

these activities to make sure they can work together

  • First results from experiments show that, since backfilling is not currently

available, multicore slots have to be conserved to avoid unnecessary draining resulting in CPU wastage

A dynamic partitioning of the resources into separated WN pools is emerging as a viable solution

slide-27
SLIDE 27

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 27

Outlook Outlook

  • The immediate objective is to test both models (CMS and ATLAS) in a shared

environment at real scale

Evaluate if they are compatible

Analyze how the global performance depends on the size of the site, the actual mixture of single core / multicore jobs, if the site is dedicated mainly to HEP or not, the actual batch system capabilities and its particular tuning, etc.

  • For these tests, the ideal place are Tier1s, given that a good fraction of them

support both experiments and there is also a diversity of batch systems to study.

CMS and ATLAS will send multicore jobs concurrently to shared Tier1s by the end of May

  • Multicore support result from the interaction of the VO submission models with

local batch system capabilities and scheduler tuning

It is and iterative process, feedback exchange will be needed: sites ↔ VOs

slide-28
SLIDE 28

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 28

  • Extra slides
slide-29
SLIDE 29

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 29

Abstract

Multicore job management in the Worldwide LHC Computing Grid After two years since the very successful first run of the Large Hadron Collider finished, data taking is scheduled to be restarted in early 2015. The experimental conditions for this second run include higher collision energy and beam luminosities, both leading to increased data volumes and event complexity. In order to process the data generated in such scenario, and also best exploit the multicore architectures of current CPUs, the LHC experiments have been developing parallelized data analysis and simulation software. However, workload scheduling in these conditions becomes a complex problem in itself, as computing jobs with a broad range of resources requirements have to be efficiently distributed across the multiple sites which make up the Worldwide LHC Computing Grid. A WLCG Task Force has been created with the purpose of coordinating the joint effort from experiments and WLCG sites. This contribution will present the activities of the Task Force, including the experiences from sites on how to best use the different batch system technologies, the development of advanced workload submission tools by the experiments and the real-size scale tests of the different proposed strategies.

slide-30
SLIDE 30

20-05-2014 Multicore job management in WLCG - Antonio Pérez-Calero Yzquierdo 30

Abstract (continued)

Description : Job scheduling in a distributed resources environment such as the WLCG involves the grid-wide workload submission tools used by the LHC experimental collaborations, known as Virtual Organizations (VO) in this context, and the batch system technologies in charge of the allocation of the local resources, which are deployed at every WLCG site. The objective of the WLCG Multicore Deployment Task Force is to explore, develop and propose ways to connect both elements in order to fulfill the computing needs of the different VOs, which now require sites to be able to run their newly developed multicore applications in addition to the more usual single-core software. Furthermore, the best use of the resources must be ensured, avoiding CPUs being idle when there is work to be done and minimizing CPU inefficiencies which may be

  • riginated by the scheduling mechanisms. All this should be achieved without imposing on the

sites unnecessary complexities in the way they manage their resources and maintaining a high rotation of jobs from the different users in multi-VO sites. Conclusion : Apart from the main objective of satisfying the new computing needs of the LHC VOs, this task force has the mandate of providing the necessary coordination in order to avoid duplicated efforts in the development of new grid-wide submission tools, as well as ensuring the convergence of approaches from different VOs to best use shared resources. Additionally, a better understanding of the technical capabilities of existing batch systems and schedulers is expected, as the participants develop and present the best system configurations, which may be shared between sites operating the same technologies. https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore