computing operation a stab at improving production
play

Computing Operation A stab at improving production through-put - PowerPoint PPT Presentation

Computing Operation A stab at improving production through-put Outlook Figure out cleaning procedures, people would like to use DDM for managing dataops space : will require some work , headache and and testing Alert system : to get an


  1. Computing Operation A stab at improving production through-put

  2. Outlook ● Figure out cleaning procedures, people would like to use DDM for managing dataops space : will require some work , headache and and testing ● Alert system : to get an indication that something is being held for assignment ● Time monitoring : to see the dynamic of workflows passing through the system ● Figure out a better “# of copies” strategy, size ? Estimated CPU ? Priority ? … transfer are parallel = no delays added ? ● Tune parameters to prevent starvation ● We were almost there, but then we added several T2s to the digi-reco pool and things are going very fast through. ● Let it run and have Matteo (co-L3) take care of it every other week 05/19/15 CalTech Group Meeting 2

  3. Overview ● McM is the service for organizing, configuring and book-keeping the production requests from all PAG/POG/DPG https://cms-pdmv.cern.ch/mcm/ https://twiki.cern.ch/twiki/bin/view/CMS/PdmVMcM ● Request Manager is the production service for book-keeping actual production requests https://cmsweb.cern.ch/reqmgr/ ● Why two ? ➢ PREP/reqmgr development went in parallel. Reqmgr aimed at doing prep job ➢ McM rewrote PREP with more integration to reqmgr ➢ Reqmgr is production oriented while mcm is book-keeping and information oriented ➢ Chaining of workflows is not a concept of reqmgr ➢ In a nutshell One does the preparation/book-keeping one does the production ➢ More integration possible (mcm under cmsweb, simplify the interface,...) ● Wmagent is/are ➢ pulling workload from request manager and pushing production jobs to sites ➢ Injecting data to dbs & phedex ● This is not enough to do the job 05/19/15 CalTech Group Meeting 3

  4. What was missing ● What sites to use for what purpose ● How much to queue to sites ● Where to locate input data when needed ● When is the data placed and ready to be used ● Is the production complete and sane ● Where to place the output for users ● All this, or most has been done by hand ● Lots of automation were put in “gen-sim” production (including fastsim) ● Not much was done for “digi-reco” 05/19/15 CalTech Group Meeting 4

  5. Goals and Strategy ● Reduce manual intervention to the minimum (that always fail in commissioning part) ➔ Adopt a set of generic rules that should lead to a stable operation ● Reduce latency for delivery of samples ➔ Automatize all steps for requests not having any issue ● Reduce re-shuffling of priorities ➔ Spread the load using multi-site white list systematically ● Increase throughput ➔ Use as many sites as possible 05/19/15 CalTech Group Meeting 5

  6. Implementation ● Python modules developed from previous scripts https://github.com/CMSCompOps/WmAgentScripts ● Unify handling of all request type to unique software https://github.com/CMSCompOps/WmAgentScripts/tree/master/Unified ● Documentation from the beginning https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowL3Responsibilities#Automatic_Assignment_and_Unified ● Monitoring from day one https://cmst2.web.cern.ch/cmst2/unified/ ● Adopt a set of representative statuses (see next slide) ● Use a simple database back-end sqlite file with python sqlalchemy library http://www.sqlalchemy.org/ ● Hourly polling cycles ● Configuration global or by campaign ● All modules can be run by hand with option to push the system when necessary 05/19/15 CalTech Group Meeting 6

  7. From assignment-approved Cloned Modules considered ● injector Input needed ● transferor ● stagor No input staging ● Assignor needed Input available Cloned ● checkor ● closor staged ● cleanor Cloned forget ● rejector Assignment Rejected trouble ● Jean-Roch Vlimant away issues completed Aborted assistance completed close Closed-out done Remove input clean 05/19/15 CalTech Group Meeting 7

  8. Rules ● Look at workflows in term of input needed ➢ if none go with all T2s and T1s ➢ If primary input but no secondary, same ➢ If secondary (PU) go with T1s and strong T2s ● Distribute secondary systematically to all site in whitelist ● Distribute primary inputs to a produce a certain number of copies (3) of the input dataset across all sites in whitelist ➢ Dataset are chopped in 1TB chunk, and these chunks are spread ● Once initiated transfers complete, use also residual blocks of input at other site to inflate site whitelist ● Assign to sites, restricted to where secondary is “wmagent business interlude” ● Once completed, check for processing completion, data injection completion to dbs, phedex, check lumisize, lumi duplications, custodial replication requests ➢ If checking out, output passed on to ddm (when applicable) and back to McM ➢ If not, initiate custodial, wait for data injection, or let operators create recovery workflows ● Two days from completion, clean the input dataset from disk (except for one copy?) ● Clean the output ● Completely from disk if one copy under analysis ops (DDM) ● Keep a full copy on disk if none under analysis ops 05/19/15 CalTech Group Meeting 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend