Computing Operation A stab at improving production through-put - - PowerPoint PPT Presentation

computing operation a stab at improving production
SMART_READER_LITE
LIVE PREVIEW

Computing Operation A stab at improving production through-put - - PowerPoint PPT Presentation

Computing Operation A stab at improving production through-put Outlook Figure out cleaning procedures, people would like to use DDM for managing dataops space : will require some work , headache and and testing Alert system : to get an


slide-1
SLIDE 1

Computing Operation A stab at improving production through-put

slide-2
SLIDE 2

05/19/15 CalTech Group Meeting 2

Outlook

  • Figure out cleaning procedures, people would like to use

DDM for managing dataops space : will require some work , headache and and testing

  • Alert system : to get an indication that something is being

held for assignment

  • Time monitoring : to see the dynamic of workflows passing

through the system

  • Figure out a better “# of copies” strategy, size ? Estimated

CPU ? Priority ? … transfer are parallel = no delays added ?

  • Tune parameters to prevent starvation
  • We were almost there, but then we added several T2s to

the digi-reco pool and things are going very fast through.

  • Let it run and have Matteo (co-L3) take care of it every other

week

slide-3
SLIDE 3

05/19/15 CalTech Group Meeting 3

Overview

  • McM is the service for organizing, configuring and book-keeping the production

requests from all PAG/POG/DPG https://cms-pdmv.cern.ch/mcm/ https://twiki.cern.ch/twiki/bin/view/CMS/PdmVMcM

  • Request Manager is the production service for book-keeping actual production

requests https://cmsweb.cern.ch/reqmgr/

  • Why two ?

➢ PREP/reqmgr development went in parallel. Reqmgr aimed at doing prep job ➢ McM rewrote PREP with more integration to reqmgr ➢ Reqmgr is production oriented while mcm is book-keeping and information

  • riented

➢ Chaining of workflows is not a concept of reqmgr ➢ In a nutshell One does the preparation/book-keeping one does the

production

➢ More integration possible (mcm under cmsweb, simplify the interface,...)

  • Wmagent is/are

➢ pulling workload from request manager and pushing production jobs to sites ➢ Injecting data to dbs & phedex

  • This is not enough to do the job
slide-4
SLIDE 4

05/19/15 CalTech Group Meeting 4

What was missing

  • What sites to use for what purpose
  • How much to queue to sites
  • Where to locate input data when needed
  • When is the data placed and ready to be used
  • Is the production complete and sane
  • Where to place the output for users
  • All this, or most has been done by hand
  • Lots of automation were put in “gen-sim” production (including fastsim)
  • Not much was done for “digi-reco”
slide-5
SLIDE 5

05/19/15 CalTech Group Meeting 5

Goals and Strategy

  • Reduce manual intervention to the minimum (that always fail in

commissioning part)

➔ Adopt a set of generic rules that should lead to a stable operation

  • Reduce latency for delivery of samples

➔ Automatize all steps for requests not having any issue

  • Reduce re-shuffling of priorities

➔ Spread the load using multi-site white list systematically

  • Increase throughput

➔ Use as many sites as possible

slide-6
SLIDE 6

05/19/15 CalTech Group Meeting 6

Implementation

  • Python modules developed from previous scripts

https://github.com/CMSCompOps/WmAgentScripts

  • Unify handling of all request type to unique software

https://github.com/CMSCompOps/WmAgentScripts/tree/master/Unified

  • Documentation from the beginning

https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowL3Responsibilities#Automatic_Assignment_and_Unified

  • Monitoring from day one

https://cmst2.web.cern.ch/cmst2/unified/

  • Adopt a set of representative statuses (see next slide)
  • Use a simple database back-end sqlite file with python sqlalchemy library

http://www.sqlalchemy.org/

  • Hourly polling cycles
  • Configuration global or by campaign
  • All modules can be run by hand with option to push the system when necessary
slide-7
SLIDE 7

05/19/15 CalTech Group Meeting 7

considered staged staging away done clean Assignment Remove input No input needed Input needed Input available trouble Aborted Cloned forget Rejected Modules

  • injector
  • transferor
  • stagor
  • Assignor
  • checkor
  • closor
  • cleanor
  • rejector
  • From assignment-approved

Cloned Cloned assistance Closed-out issues completed completed close Jean-Roch Vlimant

slide-8
SLIDE 8

05/19/15 CalTech Group Meeting 8

Rules

  • Look at workflows in term of input needed

➢ if none go with all T2s and T1s ➢ If primary input but no secondary, same ➢ If secondary (PU) go with T1s and strong T2s

  • Distribute secondary systematically to all site in whitelist
  • Distribute primary inputs to a produce a certain number of copies (3) of the input

dataset across all sites in whitelist

➢ Dataset are chopped in 1TB chunk, and these chunks are spread

  • Once initiated transfers complete, use also residual blocks of input at other site to

inflate site whitelist

  • Assign to sites, restricted to where secondary is

“wmagent business interlude”

  • Once completed, check for processing completion, data injection completion to dbs,

phedex, check lumisize, lumi duplications, custodial replication requests

➢ If checking out, output passed on to ddm (when applicable) and back to McM ➢ If not, initiate custodial, wait for data injection, or let operators create recovery

workflows

  • Two days from completion, clean the input dataset from disk (except for one copy?)
  • Clean the output
  • Completely from disk if one copy under analysis ops (DDM)
  • Keep a full copy on disk if none under analysis ops