production reprocessing unified overview jean roch
play

Production & Reprocessing Unified Overview Jean-Roch Vlimant - PowerPoint PPT Presentation

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2


  1. Production & Reprocessing “Unified” Overview Jean-Roch Vlimant

  2. Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2

  3. In a Nutshell ● Automatic transfers, automatic assignment, workload overflowing (see backup slides) ● Simple sqlite DB with 3 tables https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignSchema.py ➔ Actual DB file on afs ● 80k workflows (460 char) ● 124k outputs (430 char, 5 int) ● 37k transfers (1 int, 1 pickled string representing a vector of int of size <40) ➔ Lightweight schema (right ?) ➔ Plan on adding one more temporarily depending on needs, for locking dataset/blocks. Should never exceed 5k entries ● Access pattern ➔ Currently have difficulties with concurrency update to the table, one different records ➔ Commits are essentially status changes (30 char) and insertion of of outputs (5 per 10s) or transfers. Plus rare modification of transfers attributes. ➔ Read using wide filters, ~1000 workflows at a time, every 20, 30 min ● Load foreseen ➔ Might go up to 3-4x ~1000 workflows at a time 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 3

  4. Integration ● Using sqlalchemy ● Tested on devdb12 (June, July?) ➢ Maxedout the account quota very fast, could perform read/write and automatic indexing ● Tested on int9r (Dec) ➢ Same schema, no issues ➢ Looking forward to a backedup, reliable and supporting concurrency table update 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 4

  5. Unified State Diagram From assignment-approved Cloned considered Input needed No input staging needed Input available Cloned staged Cloned forget Assignment Rejected trouble away issues completed Aborted assistance completed close Closed-out done https://cmst2.web.cern.ch/cmst2/unified/ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 5

  6. ReqMgr2 State Diagram https://cmsweb.cern.ch/wmstats/index.html 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 6

  7. Strategies 1/3 ● If any input is needed ➔ Distribute primary in N copies (configurable) ✔ Among site matching the workflow requirement (Core, Mem, quota, …) ✔ Any existing blocks counts to make the N copies ✔ Distributed in chunk of 4TB ✔ Any existing subscription counts to make the N copies ✔ Locks the dataset from deletion ● When all requirements are met ➔ The N copies are ready at sites matching workflow requirments ➔ A transfer appears stuck ✔ Early start with ≥1 ➔ Send back to setting a transfer if too many end point in downtime ● Assignment in request manager (all below configurable by campaign) ➔ Set the lfn from parents, campaign, or default to /store/mc ➔ Set the mem/time watchdogs ➔ Tune the splitting (pre-assignment) ➔ Use as many compatible sites ➔ Set xrootd flags on primary and/or secondary ➔ Pick a site for full copy among whitelist ✔ T1 first, then T2 ✔ DDM-buffer enforced 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 7

  8. Strategies 2/3 ● When WorkQueue elements appear with no location ➔ Transfer the missing blocks to a site in the whitelist ● When workflow has too many pending jobs ➔ Overflow to neighboring (hard-coded) sites ● Adapt job Memory requirement (job classad) ➔ From history of successful jobs per task. 95% percentile of Mem distribution with >100 successful jobs ➔ Hook for job time requirement not used ● Truncate the processing ➔ If >99% after 7 days, force complete 10 per cycle ➔ If the requester ask for it via McM API ● Verify the output (see later if something is not right) ➔ Completed in terms of #of lumisection (95% pass bar, 100% for data), fall back to #of events in case of request manager corruption ➔ Lumi section size is “small enough” (mostly ignored now) ➔ DBS/Phedex file count consistency. lfn consistency ➔ Output dataset consistency ➔ Tape Subscription made when applicable ➔ Duplicate lumi 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 8

  9. Strategies 3/3 ● Subscribe most production blocks to DataOps ➔ For on-going workflows, closed-out, and announced ✗ Some blocks from aborted workflow might be unclaimed ● Create Harvesting ➔ For data requests, once the /DQMIO is in full somewhere, extract info from original workflow and make the harvesting (no further check on statuses) ● Set workflow announced ➔ Triggering condition for next step in McM ● Set the status VALID ➔ Although datasets are usable before hand ➔ Dataset might not be in full at a single site, but all blocks are out at production sites ● Send analysis datasets to AnalysisOps when applicable ➔ DDM scripts turns the full subscriptions to AnalysisOps ➔ Production blocks are left as DataOps ● Locks are released ➔ When requested tape copy is completed ➔ When no other workflow uses that /LHE or 30 days ➔ If INVALID or in-existing (aborted/rejected workflow) ➔ ++ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 9

  10. When Things Go Wrong ● Stuck transfers are exposed to transfer team ➔ Both to disk and tape https://cmst2.web.cern.ch/cmst2/unified/data.html ● Workflow with high failure rate ➔ Shifters' alert ➔ Inspected and manually aborted ➔ Automated notification to requester via McM (if via Unified) ● Workflow and dataset to be rejected ➔ Detected from McM and operated ● When requester ask for update https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=HIG-RunIISpring16DR80-01780 ➔ http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=HIG-RunIISpring16DR80-01780 ➔ ➔ All needed information is pretty much available ● Workflow with some failures ➔ And does not pass the completion bar : see next slide 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 10

  11. Recoveries ● Investigation of errors ➔ Itemized https://cmst2.web.cern.ch/cmst2/unified/assistance.html ➔ Drill down to error report on https://cmsweb.cern.ch/wmstats/index.html ➔ Categorized by most popular issues and cast  ACDC : fetches in request manager what are the missing bits from failing jobs, create a new set of jobs and submit them.  Clone : just restart from square one, Unified picks it back  Recovery : evolve procedure to create an ACDC document with what is needed to remake the missing data (most used for data rereco)  Extension : create new events in non overlapping lumi-section (rarely used these days) ➔ ACDC are partially handled by Unified automatically following some rules  >20% 50660 : bump MeM by 1G  >20% 50664 : split x2  >20% 61104 : plain recovery  >20% 8028 : plain recovery  >20% 8021 : plain recovery (if FileReadError)  >20% 8001 : split x4 (if No lhe event found in ExternalLHEProducer) ➔ Things usually clear out on first round 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 11

  12. When Things Go Really Bad ● Cause for very large tails : not in order of importance ➔ Long lasting workflow has merge issues→recovery fails because the unmerged files are being removed at the site  A solution is to have the site clear according to a list that can be extract from request manager ➔ Lots of workflow to be inspected by hand, workflow with low priority looked at last→recovery fails because of unmerged files are gone :  Do much less by hand (increase automation)  Do things much faster by hand (see Dan's slides) ➔ Site going down→clone→another site going down, … :  Solved by using a more reliable site at last stage  Maybe need to prune sites by availability ?  Maybe match estimated workload to site mean-uptime ? ➔ Performance issue→clone with good splitting→other issue→clone with initial splitting, …  Operator interference, bad issue tracking, ... ➔ Data reprocessing needing ~100% completion ✗ Large lumi prevents job creation ✗ Segfault = no fwjreport ✗ ACDC of ACDC finish with no error ✗ Assignment mistakes ✗ Bad issue tracking ... 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend