Production & Reprocessing Unified Overview Jean-Roch Vlimant - - PowerPoint PPT Presentation

production reprocessing unified overview jean roch
SMART_READER_LITE
LIVE PREVIEW

Production & Reprocessing Unified Overview Jean-Roch Vlimant - - PowerPoint PPT Presentation

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2


slide-1
SLIDE 1

Production & Reprocessing “Unified” Overview Jean-Roch Vlimant

slide-2
SLIDE 2

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2

Big Picture

O&C PPD Physics McM Workflow Management HTCondor Unified T0 T1s T2s Data Mngt ?

THIS TALK

slide-3
SLIDE 3

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 3

In a Nutshell

  • Automatic transfers, automatic assignment, workload overflowing (see backup slides)
  • Simple sqlite DB with 3 tables

https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignSchema.py

➔ Actual DB file on afs

  • 80k workflows (460 char)
  • 124k outputs (430 char, 5 int)
  • 37k transfers (1 int, 1 pickled string representing a vector of int of size <40)

➔ Lightweight schema (right ?) ➔ Plan on adding one more temporarily depending on needs, for locking dataset/blocks.

Should never exceed 5k entries

  • Access pattern

➔ Currently have difficulties with concurrency update to the table, one different records ➔ Commits are essentially status changes (30 char) and insertion of of outputs (5 per 10s)

  • r transfers. Plus rare modification of transfers attributes.

➔ Read using wide filters, ~1000 workflows at a time, every 20, 30 min

  • Load foreseen

➔ Might go up to 3-4x ~1000 workflows at a time

slide-4
SLIDE 4

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 4

Integration

  • Using sqlalchemy
  • Tested on devdb12 (June, July?)

➢ Maxedout the account quota very fast,

could perform read/write and automatic indexing

  • Tested on int9r (Dec)

➢ Same schema, no issues ➢ Looking forward to a backedup, reliable and

supporting concurrency table update

slide-5
SLIDE 5

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 5

Unified State Diagram

considered staged staging away done Assignment No input needed Input needed Input available trouble Aborted Cloned forget Rejected From assignment-approved Cloned Cloned assistance Closed-out issues completed completed close https://cmst2.web.cern.ch/cmst2/unified/

slide-6
SLIDE 6

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 6

ReqMgr2 State Diagram

https://cmsweb.cern.ch/wmstats/index.html

slide-7
SLIDE 7

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 7

Strategies 1/3

  • If any input is needed

➔ Distribute primary in N copies (configurable) ✔ Among site matching the workflow requirement (Core, Mem, quota, …) ✔ Any existing blocks counts to make the N copies ✔ Distributed in chunk of 4TB ✔ Any existing subscription counts to make the N copies ✔ Locks the dataset from deletion

  • When all requirements are met

➔ The N copies are ready at sites matching workflow requirments ➔ A transfer appears stuck ✔ Early start with ≥1 ➔ Send back to setting a transfer if too many end point in downtime

  • Assignment in request manager (all below configurable by campaign)

➔ Set the lfn from parents, campaign, or default to /store/mc ➔ Set the mem/time watchdogs ➔ Tune the splitting (pre-assignment) ➔ Use as many compatible sites ➔ Set xrootd flags on primary and/or secondary ➔ Pick a site for full copy among whitelist ✔ T1 first, then T2 ✔ DDM-buffer enforced

slide-8
SLIDE 8

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 8

Strategies 2/3

  • When WorkQueue elements appear with no location

➔ Transfer the missing blocks to a site in the whitelist

  • When workflow has too many pending jobs

➔ Overflow to neighboring (hard-coded) sites

  • Adapt job Memory requirement (job classad)

➔ From history of successful jobs per task. 95% percentile of Mem distribution

with >100 successful jobs

➔ Hook for job time requirement not used

  • Truncate the processing

➔ If >99% after 7 days, force complete 10 per cycle ➔ If the requester ask for it via McM API

  • Verify the output (see later if something is not right)

➔ Completed in terms of #of lumisection (95% pass bar, 100% for data), fall

back to #of events in case of request manager corruption

➔ Lumi section size is “small enough” (mostly ignored now) ➔ DBS/Phedex file count consistency. lfn consistency ➔ Output dataset consistency ➔ Tape Subscription made when applicable ➔ Duplicate lumi

slide-9
SLIDE 9

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 9

Strategies 3/3

  • Subscribe most production blocks to DataOps

➔ For on-going workflows, closed-out, and announced ✗ Some blocks from aborted workflow might be unclaimed

  • Create Harvesting

➔ For data requests, once the /DQMIO is in full somewhere, extract info from

  • riginal workflow and make the harvesting (no further check on statuses)
  • Set workflow announced

➔ Triggering condition for next step in McM

  • Set the status VALID

➔ Although datasets are usable before hand ➔ Dataset might not be in full at a single site, but all blocks are out at

production sites

  • Send analysis datasets to AnalysisOps when applicable

➔ DDM scripts turns the full subscriptions to AnalysisOps ➔ Production blocks are left as DataOps

  • Locks are released

➔ When requested tape copy is completed ➔ When no other workflow uses that /LHE or 30 days ➔ If INVALID or in-existing (aborted/rejected workflow) ➔ ++

slide-10
SLIDE 10

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 10

When Things Go Wrong

  • Stuck transfers are exposed to transfer team

➔ Both to disk and tape https://cmst2.web.cern.ch/cmst2/unified/data.html

  • Workflow with high failure rate

➔ Shifters' alert ➔ Inspected and manually aborted ➔ Automated notification to requester via McM (if via Unified)

  • Workflow and dataset to be rejected

➔ Detected from McM and operated

  • When requester ask for update

https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=HIG-RunIISpring16DR80-01780

http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=HIG-RunIISpring16DR80-01780

➔ All needed information is pretty much available

  • Workflow with some failures

➔ And does not pass the completion bar : see next slide

slide-11
SLIDE 11

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 11

Recoveries

  • Investigation of errors

➔ Itemized https://cmst2.web.cern.ch/cmst2/unified/assistance.html ➔ Drill down to error report on https://cmsweb.cern.ch/wmstats/index.html ➔ Categorized by most popular issues and cast  ACDC : fetches in request manager what are the missing bits from failing

jobs, create a new set of jobs and submit them.

 Clone : just restart from square one, Unified picks it back  Recovery : evolve procedure to create an ACDC document with what is

needed to remake the missing data (most used for data rereco)

 Extension : create new events in non overlapping lumi-section (rarely

used these days)

➔ ACDC are partially handled by Unified automatically following some rules  >20% 50660 : bump MeM by 1G  >20% 50664 : split x2  >20% 61104 : plain recovery  >20% 8028 : plain recovery  >20% 8021 : plain recovery (if FileReadError)  >20% 8001

: split x4 (if No lhe event found in ExternalLHEProducer)

➔ Things usually clear out on first round

slide-12
SLIDE 12

12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 12

When Things Go Really Bad

  • Cause for very large tails : not in order of importance

➔ Long lasting workflow has merge issues→recovery fails because the unmerged files are

being removed at the site

 A solution is to have the site clear according to a list that can be extract from request

manager

➔ Lots of workflow to be inspected by hand, workflow with low priority looked at

last→recovery fails because of unmerged files are gone :

 Do much less by hand (increase automation)  Do things much faster by hand (see Dan's slides) ➔ Site going down→clone→another site going down, … :  Solved by using a more reliable site at last stage  Maybe need to prune sites by availability ?  Maybe match estimated workload to site mean-uptime ? ➔ Performance issue→clone with good splitting→other issue→clone with initial splitting, …  Operator interference, bad issue tracking, ... ➔ Data reprocessing needing ~100% completion ✗ Large lumi prevents job creation ✗ Segfault = no fwjreport ✗ ACDC of ACDC finish with no error ✗ Assignment mistakes ✗ Bad issue tracking ...