Production & Reprocessing Unified Overview Jean-Roch Vlimant - - PowerPoint PPT Presentation
Production & Reprocessing Unified Overview Jean-Roch Vlimant - - PowerPoint PPT Presentation
Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2
Big Picture
O&C PPD Physics McM Workflow Management HTCondor Unified T0 T1s T2s Data Mngt ?
THIS TALK
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 3
In a Nutshell
- Automatic transfers, automatic assignment, workload overflowing (see backup slides)
- Simple sqlite DB with 3 tables
https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignSchema.py
➔ Actual DB file on afs
- 80k workflows (460 char)
- 124k outputs (430 char, 5 int)
- 37k transfers (1 int, 1 pickled string representing a vector of int of size <40)
➔ Lightweight schema (right ?) ➔ Plan on adding one more temporarily depending on needs, for locking dataset/blocks.
Should never exceed 5k entries
- Access pattern
➔ Currently have difficulties with concurrency update to the table, one different records ➔ Commits are essentially status changes (30 char) and insertion of of outputs (5 per 10s)
- r transfers. Plus rare modification of transfers attributes.
➔ Read using wide filters, ~1000 workflows at a time, every 20, 30 min
- Load foreseen
➔ Might go up to 3-4x ~1000 workflows at a time
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 4
Integration
- Using sqlalchemy
- Tested on devdb12 (June, July?)
➢ Maxedout the account quota very fast,
could perform read/write and automatic indexing
- Tested on int9r (Dec)
➢ Same schema, no issues ➢ Looking forward to a backedup, reliable and
supporting concurrency table update
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 5
Unified State Diagram
considered staged staging away done Assignment No input needed Input needed Input available trouble Aborted Cloned forget Rejected From assignment-approved Cloned Cloned assistance Closed-out issues completed completed close https://cmst2.web.cern.ch/cmst2/unified/
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 6
ReqMgr2 State Diagram
https://cmsweb.cern.ch/wmstats/index.html
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 7
Strategies 1/3
- If any input is needed
➔ Distribute primary in N copies (configurable) ✔ Among site matching the workflow requirement (Core, Mem, quota, …) ✔ Any existing blocks counts to make the N copies ✔ Distributed in chunk of 4TB ✔ Any existing subscription counts to make the N copies ✔ Locks the dataset from deletion
- When all requirements are met
➔ The N copies are ready at sites matching workflow requirments ➔ A transfer appears stuck ✔ Early start with ≥1 ➔ Send back to setting a transfer if too many end point in downtime
- Assignment in request manager (all below configurable by campaign)
➔ Set the lfn from parents, campaign, or default to /store/mc ➔ Set the mem/time watchdogs ➔ Tune the splitting (pre-assignment) ➔ Use as many compatible sites ➔ Set xrootd flags on primary and/or secondary ➔ Pick a site for full copy among whitelist ✔ T1 first, then T2 ✔ DDM-buffer enforced
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 8
Strategies 2/3
- When WorkQueue elements appear with no location
➔ Transfer the missing blocks to a site in the whitelist
- When workflow has too many pending jobs
➔ Overflow to neighboring (hard-coded) sites
- Adapt job Memory requirement (job classad)
➔ From history of successful jobs per task. 95% percentile of Mem distribution
with >100 successful jobs
➔ Hook for job time requirement not used
- Truncate the processing
➔ If >99% after 7 days, force complete 10 per cycle ➔ If the requester ask for it via McM API
- Verify the output (see later if something is not right)
➔ Completed in terms of #of lumisection (95% pass bar, 100% for data), fall
back to #of events in case of request manager corruption
➔ Lumi section size is “small enough” (mostly ignored now) ➔ DBS/Phedex file count consistency. lfn consistency ➔ Output dataset consistency ➔ Tape Subscription made when applicable ➔ Duplicate lumi
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 9
Strategies 3/3
- Subscribe most production blocks to DataOps
➔ For on-going workflows, closed-out, and announced ✗ Some blocks from aborted workflow might be unclaimed
- Create Harvesting
➔ For data requests, once the /DQMIO is in full somewhere, extract info from
- riginal workflow and make the harvesting (no further check on statuses)
- Set workflow announced
➔ Triggering condition for next step in McM
- Set the status VALID
➔ Although datasets are usable before hand ➔ Dataset might not be in full at a single site, but all blocks are out at
production sites
- Send analysis datasets to AnalysisOps when applicable
➔ DDM scripts turns the full subscriptions to AnalysisOps ➔ Production blocks are left as DataOps
- Locks are released
➔ When requested tape copy is completed ➔ When no other workflow uses that /LHE or 30 days ➔ If INVALID or in-existing (aborted/rejected workflow) ➔ ++
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 10
When Things Go Wrong
- Stuck transfers are exposed to transfer team
➔ Both to disk and tape https://cmst2.web.cern.ch/cmst2/unified/data.html
- Workflow with high failure rate
➔ Shifters' alert ➔ Inspected and manually aborted ➔ Automated notification to requester via McM (if via Unified)
- Workflow and dataset to be rejected
➔ Detected from McM and operated
- When requester ask for update
➔
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=HIG-RunIISpring16DR80-01780
➔
http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=HIG-RunIISpring16DR80-01780
➔ All needed information is pretty much available
- Workflow with some failures
➔ And does not pass the completion bar : see next slide
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 11
Recoveries
- Investigation of errors
➔ Itemized https://cmst2.web.cern.ch/cmst2/unified/assistance.html ➔ Drill down to error report on https://cmsweb.cern.ch/wmstats/index.html ➔ Categorized by most popular issues and cast ACDC : fetches in request manager what are the missing bits from failing
jobs, create a new set of jobs and submit them.
Clone : just restart from square one, Unified picks it back Recovery : evolve procedure to create an ACDC document with what is
needed to remake the missing data (most used for data rereco)
Extension : create new events in non overlapping lumi-section (rarely
used these days)
➔ ACDC are partially handled by Unified automatically following some rules >20% 50660 : bump MeM by 1G >20% 50664 : split x2 >20% 61104 : plain recovery >20% 8028 : plain recovery >20% 8021 : plain recovery (if FileReadError) >20% 8001
: split x4 (if No lhe event found in ExternalLHEProducer)
➔ Things usually clear out on first round
12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 12
When Things Go Really Bad
- Cause for very large tails : not in order of importance
➔ Long lasting workflow has merge issues→recovery fails because the unmerged files are
being removed at the site
A solution is to have the site clear according to a list that can be extract from request
manager
➔ Lots of workflow to be inspected by hand, workflow with low priority looked at
last→recovery fails because of unmerged files are gone :
Do much less by hand (increase automation) Do things much faster by hand (see Dan's slides) ➔ Site going down→clone→another site going down, … : Solved by using a more reliable site at last stage Maybe need to prune sites by availability ? Maybe match estimated workload to site mean-uptime ? ➔ Performance issue→clone with good splitting→other issue→clone with initial splitting, … Operator interference, bad issue tracking, ... ➔ Data reprocessing needing ~100% completion ✗ Large lumi prevents job creation ✗ Segfault = no fwjreport ✗ ACDC of ACDC finish with no error ✗ Assignment mistakes ✗ Bad issue tracking ...