Production & Reprocessing Unified Overview Jean-Roch Vlimant - PowerPoint PPT Presentation

Production & Reprocessing “Unified” Overview Jean-Roch Vlimant

Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2

In a Nutshell ● Automatic transfers, automatic assignment, workload overflowing (see backup slides) ● Simple sqlite DB with 3 tables https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/assignSchema.py ➔ Actual DB file on afs ● 80k workflows (460 char) ● 124k outputs (430 char, 5 int) ● 37k transfers (1 int, 1 pickled string representing a vector of int of size <40) ➔ Lightweight schema (right ?) ➔ Plan on adding one more temporarily depending on needs, for locking dataset/blocks. Should never exceed 5k entries ● Access pattern ➔ Currently have difficulties with concurrency update to the table, one different records ➔ Commits are essentially status changes (30 char) and insertion of of outputs (5 per 10s) or transfers. Plus rare modification of transfers attributes. ➔ Read using wide filters, ~1000 workflows at a time, every 20, 30 min ● Load foreseen ➔ Might go up to 3-4x ~1000 workflows at a time 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 3

Integration ● Using sqlalchemy ● Tested on devdb12 (June, July?) ➢ Maxedout the account quota very fast, could perform read/write and automatic indexing ● Tested on int9r (Dec) ➢ Same schema, no issues ➢ Looking forward to a backedup, reliable and supporting concurrency table update 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 4

Unified State Diagram From assignment-approved Cloned considered Input needed No input staging needed Input available Cloned staged Cloned forget Assignment Rejected trouble away issues completed Aborted assistance completed close Closed-out done https://cmst2.web.cern.ch/cmst2/unified/ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 5

ReqMgr2 State Diagram https://cmsweb.cern.ch/wmstats/index.html 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 6

Strategies 1/3 ● If any input is needed ➔ Distribute primary in N copies (configurable) ✔ Among site matching the workflow requirement (Core, Mem, quota, …) ✔ Any existing blocks counts to make the N copies ✔ Distributed in chunk of 4TB ✔ Any existing subscription counts to make the N copies ✔ Locks the dataset from deletion ● When all requirements are met ➔ The N copies are ready at sites matching workflow requirments ➔ A transfer appears stuck ✔ Early start with ≥1 ➔ Send back to setting a transfer if too many end point in downtime ● Assignment in request manager (all below configurable by campaign) ➔ Set the lfn from parents, campaign, or default to /store/mc ➔ Set the mem/time watchdogs ➔ Tune the splitting (pre-assignment) ➔ Use as many compatible sites ➔ Set xrootd flags on primary and/or secondary ➔ Pick a site for full copy among whitelist ✔ T1 first, then T2 ✔ DDM-buffer enforced 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 7

Strategies 2/3 ● When WorkQueue elements appear with no location ➔ Transfer the missing blocks to a site in the whitelist ● When workflow has too many pending jobs ➔ Overflow to neighboring (hard-coded) sites ● Adapt job Memory requirement (job classad) ➔ From history of successful jobs per task. 95% percentile of Mem distribution with >100 successful jobs ➔ Hook for job time requirement not used ● Truncate the processing ➔ If >99% after 7 days, force complete 10 per cycle ➔ If the requester ask for it via McM API ● Verify the output (see later if something is not right) ➔ Completed in terms of #of lumisection (95% pass bar, 100% for data), fall back to #of events in case of request manager corruption ➔ Lumi section size is “small enough” (mostly ignored now) ➔ DBS/Phedex file count consistency. lfn consistency ➔ Output dataset consistency ➔ Tape Subscription made when applicable ➔ Duplicate lumi 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 8

Strategies 3/3 ● Subscribe most production blocks to DataOps ➔ For on-going workflows, closed-out, and announced ✗ Some blocks from aborted workflow might be unclaimed ● Create Harvesting ➔ For data requests, once the /DQMIO is in full somewhere, extract info from original workflow and make the harvesting (no further check on statuses) ● Set workflow announced ➔ Triggering condition for next step in McM ● Set the status VALID ➔ Although datasets are usable before hand ➔ Dataset might not be in full at a single site, but all blocks are out at production sites ● Send analysis datasets to AnalysisOps when applicable ➔ DDM scripts turns the full subscriptions to AnalysisOps ➔ Production blocks are left as DataOps ● Locks are released ➔ When requested tape copy is completed ➔ When no other workflow uses that /LHE or 30 days ➔ If INVALID or in-existing (aborted/rejected workflow) ➔ ++ 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 9

When Things Go Wrong ● Stuck transfers are exposed to transfer team ➔ Both to disk and tape https://cmst2.web.cern.ch/cmst2/unified/data.html ● Workflow with high failure rate ➔ Shifters' alert ➔ Inspected and manually aborted ➔ Automated notification to requester via McM (if via Unified) ● Workflow and dataset to be rejected ➔ Detected from McM and operated ● When requester ask for update https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=HIG-RunIISpring16DR80-01780 ➔ http://dabercro.web.cern.ch/dabercro/unified/showlog/?search=HIG-RunIISpring16DR80-01780 ➔ ➔ All needed information is pretty much available ● Workflow with some failures ➔ And does not pass the completion bar : see next slide 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 10

Recoveries ● Investigation of errors ➔ Itemized https://cmst2.web.cern.ch/cmst2/unified/assistance.html ➔ Drill down to error report on https://cmsweb.cern.ch/wmstats/index.html ➔ Categorized by most popular issues and cast  ACDC : fetches in request manager what are the missing bits from failing jobs, create a new set of jobs and submit them.  Clone : just restart from square one, Unified picks it back  Recovery : evolve procedure to create an ACDC document with what is needed to remake the missing data (most used for data rereco)  Extension : create new events in non overlapping lumi-section (rarely used these days) ➔ ACDC are partially handled by Unified automatically following some rules  >20% 50660 : bump MeM by 1G  >20% 50664 : split x2  >20% 61104 : plain recovery  >20% 8028 : plain recovery  >20% 8021 : plain recovery (if FileReadError)  >20% 8001 : split x4 (if No lhe event found in ExternalLHEProducer) ➔ Things usually clear out on first round 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 11

When Things Go Really Bad ● Cause for very large tails : not in order of importance ➔ Long lasting workflow has merge issues→recovery fails because the unmerged files are being removed at the site  A solution is to have the site clear according to a list that can be extract from request manager ➔ Lots of workflow to be inspected by hand, workflow with low priority looked at last→recovery fails because of unmerged files are gone :  Do much less by hand (increase automation)  Do things much faster by hand (see Dan's slides) ➔ Site going down→clone→another site going down, … :  Solved by using a more reliable site at last stage  Maybe need to prune sites by availability ?  Maybe match estimated workload to site mean-uptime ? ➔ Performance issue→clone with good splitting→other issue→clone with initial splitting, …  Operator interference, bad issue tracking, ... ➔ Data reprocessing needing ~100% completion ✗ Large lumi prevents job creation ✗ Segfault = no fwjreport ✗ ACDC of ACDC finish with no error ✗ Assignment mistakes ✗ Bad issue tracking ... 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 12

Production & Reprocessing Unified Overview Jean-Roch Vlimant - PowerPoint PPT Presentation

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2

Nuclear Fuel Reprocessing By Daniel Bolgren Jeff Menees Goals of the Project Develop a

Recovering and Reprocessing Resources from Waste Tabled on 6 June 2019 This presentation

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Modern Discrete Probability III - Stopping times and martingales Review S ebastien Roch

Fractal algebras of discretization sequences Steffen Roch Accompanying material to lectures at

Modern Discrete Probability IV - Coupling Review S ebastien Roch UWMadison Mathematics

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

Adaptive Algorithms for new Parallel Supports Bruno Raffin, Jean-Louis Roch, Denis Trystram ID

RAW2RecHit Unpacking Approach Performance of Ecal local reconstruction in HLT going from RAW to

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) & Generator Groups

Chapter Secure Random Number Generator Jean-Louis Roch, Grenoble University, M2-SCCI/SECR Anyone

Charged Particle Tracking Hands-On Dustin Anderson, Steve Farrell, Dorian Kcira, Jean-Roch

Bricks-in-the-Loop Scott M Thompson, CACI, Inc. S tarting Programming in BAS IC at 13.

Creating Balance and Achieving Well-Being Monique Trudel, MA, CCDP learn. do. grow.

VOL. 4 2011 VOL. 4 2011 Effectively Proving Foreign Law in U.S. Litigation

Wyoming EORI 3nd Annual Wyoming CO 2 Conference 6/23/09 The Beaver Creek Field Madison CO 2

A FRICA Xiaoning Gong Chief, Economic Statistics and National Accounts Section, ACS, UNECA at

AbuseHUB: Ramping Up the Tweede niveau Fight against Botnets

Atlantic records By Ellie Morrison 1947 Atlantic Records was founded by Ahmet Ertegun and Herb

GHHI Survey Data Analysis As partners in GHHI, members of the Austin Home Repair Coalition

Production & Reprocessing Unified Overview Jean-Roch Vlimant - PowerPoint PPT Presentation

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD Physics McM Data Mngt ? Unified Workflow Management THIS TALK HTCondor T0 T1s T2s 12/06/16 CMS Alca/DB Meeting, Unified, JR Vlimant 2

Nuclear Fuel Reprocessing By Daniel Bolgren Jeff Menees Goals of the Project Develop a

Recovering and Reprocessing Resources from Waste Tabled on 6 June 2019 This presentation

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Modern Discrete Probability III - Stopping times and martingales Review S ebastien Roch

Fractal algebras of discretization sequences Steffen Roch Accompanying material to lectures at

Modern Discrete Probability IV - Coupling Review S ebastien Roch UWMadison Mathematics

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

Adaptive Algorithms for new Parallel Supports Bruno Raffin, Jean-Louis Roch, Denis Trystram ID

RAW2RecHit Unpacking Approach Performance of Ecal local reconstruction in HLT going from RAW to

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) &amp; Generator Groups

Chapter Secure Random Number Generator Jean-Louis Roch, Grenoble University, M2-SCCI/SECR Anyone

Charged Particle Tracking Hands-On Dustin Anderson, Steve Farrell, Dorian Kcira, Jean-Roch

Bricks-in-the-Loop Scott M Thompson, CACI, Inc. S tarting Programming in BAS IC at 13.

Creating Balance and Achieving Well-Being Monique Trudel, MA, CCDP learn. do. grow.

VOL. 4 2011 VOL. 4 2011 Effectively Proving Foreign Law in U.S. Litigation

Wyoming EORI 3nd Annual Wyoming CO 2 Conference 6/23/09 The Beaver Creek Field Madison CO 2

A FRICA Xiaoning Gong Chief, Economic Statistics and National Accounts Section, ACS, UNECA at

AbuseHUB: Ramping Up the Tweede niveau Fight against Botnets

Atlantic records By Ellie Morrison 1947 Atlantic Records was founded by Ahmet Ertegun and Herb

GHHI Survey Data Analysis As partners in GHHI, members of the Austin Home Repair Coalition

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) & Generator Groups