Computing Operation A stab at improving production through-put - - PowerPoint PPT Presentation
Computing Operation A stab at improving production through-put - - PowerPoint PPT Presentation
Computing Operation A stab at improving production through-put Outlook Figure out cleaning procedures, people would like to use DDM for managing dataops space : will require some work , headache and and testing Alert system : to get an
05/19/15 CalTech Group Meeting 2
Outlook
- Figure out cleaning procedures, people would like to use
DDM for managing dataops space : will require some work , headache and and testing
- Alert system : to get an indication that something is being
held for assignment
- Time monitoring : to see the dynamic of workflows passing
through the system
- Figure out a better “# of copies” strategy, size ? Estimated
CPU ? Priority ? … transfer are parallel = no delays added ?
- Tune parameters to prevent starvation
- We were almost there, but then we added several T2s to
the digi-reco pool and things are going very fast through.
- Let it run and have Matteo (co-L3) take care of it every other
week
05/19/15 CalTech Group Meeting 3
Overview
- McM is the service for organizing, configuring and book-keeping the production
requests from all PAG/POG/DPG https://cms-pdmv.cern.ch/mcm/ https://twiki.cern.ch/twiki/bin/view/CMS/PdmVMcM
- Request Manager is the production service for book-keeping actual production
requests https://cmsweb.cern.ch/reqmgr/
- Why two ?
➢ PREP/reqmgr development went in parallel. Reqmgr aimed at doing prep job ➢ McM rewrote PREP with more integration to reqmgr ➢ Reqmgr is production oriented while mcm is book-keeping and information
- riented
➢ Chaining of workflows is not a concept of reqmgr ➢ In a nutshell One does the preparation/book-keeping one does the
production
➢ More integration possible (mcm under cmsweb, simplify the interface,...)
- Wmagent is/are
➢ pulling workload from request manager and pushing production jobs to sites ➢ Injecting data to dbs & phedex
- This is not enough to do the job
05/19/15 CalTech Group Meeting 4
What was missing
- What sites to use for what purpose
- How much to queue to sites
- Where to locate input data when needed
- When is the data placed and ready to be used
- Is the production complete and sane
- Where to place the output for users
- All this, or most has been done by hand
- Lots of automation were put in “gen-sim” production (including fastsim)
- Not much was done for “digi-reco”
05/19/15 CalTech Group Meeting 5
Goals and Strategy
- Reduce manual intervention to the minimum (that always fail in
commissioning part)
➔ Adopt a set of generic rules that should lead to a stable operation
- Reduce latency for delivery of samples
➔ Automatize all steps for requests not having any issue
- Reduce re-shuffling of priorities
➔ Spread the load using multi-site white list systematically
- Increase throughput
➔ Use as many sites as possible
05/19/15 CalTech Group Meeting 6
Implementation
- Python modules developed from previous scripts
https://github.com/CMSCompOps/WmAgentScripts
- Unify handling of all request type to unique software
https://github.com/CMSCompOps/WmAgentScripts/tree/master/Unified
- Documentation from the beginning
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowL3Responsibilities#Automatic_Assignment_and_Unified
- Monitoring from day one
https://cmst2.web.cern.ch/cmst2/unified/
- Adopt a set of representative statuses (see next slide)
- Use a simple database back-end sqlite file with python sqlalchemy library
http://www.sqlalchemy.org/
- Hourly polling cycles
- Configuration global or by campaign
- All modules can be run by hand with option to push the system when necessary
05/19/15 CalTech Group Meeting 7
considered staged staging away done clean Assignment Remove input No input needed Input needed Input available trouble Aborted Cloned forget Rejected Modules
- injector
- transferor
- stagor
- Assignor
- checkor
- closor
- cleanor
- rejector
- From assignment-approved
Cloned Cloned assistance Closed-out issues completed completed close Jean-Roch Vlimant
05/19/15 CalTech Group Meeting 8
Rules
- Look at workflows in term of input needed
➢ if none go with all T2s and T1s ➢ If primary input but no secondary, same ➢ If secondary (PU) go with T1s and strong T2s
- Distribute secondary systematically to all site in whitelist
- Distribute primary inputs to a produce a certain number of copies (3) of the input
dataset across all sites in whitelist
➢ Dataset are chopped in 1TB chunk, and these chunks are spread
- Once initiated transfers complete, use also residual blocks of input at other site to
inflate site whitelist
- Assign to sites, restricted to where secondary is
“wmagent business interlude”
- Once completed, check for processing completion, data injection completion to dbs,
phedex, check lumisize, lumi duplications, custodial replication requests
➢ If checking out, output passed on to ddm (when applicable) and back to McM ➢ If not, initiate custodial, wait for data injection, or let operators create recovery
workflows
- Two days from completion, clean the input dataset from disk (except for one copy?)
- Clean the output
- Completely from disk if one copy under analysis ops (DDM)
- Keep a full copy on disk if none under analysis ops