ichep mc production post mortem j r vlimant on behalf of
play

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone - PowerPoint PPT Presentation

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone else Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation


  1. “ICHEP MC Production” Post-Mortem J-R Vlimant on behalf of everyone else

  2. Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation post-mortem analysis is planned for the Computing & Offline Management Meeting in Trieste, 25-27 July 2012. Full post-mortem will be done by then Lots of lessons learned will be turned into action items then. It's easy to only notice what goes wrong. 2 6/26/12

  3. Summer12 GEN-SIM https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 ● Started on January 25 ● 3.3B in the campaign ● 2.5 B events produced ● 500 M/month achieved (has been 400M/Month) ● Part of it has been put in stand-by, while completing the HPA and Upgrade samples. ● Resuming of all “non HPA” gen-sim (800M) this week http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_GEN-SIM_speed.html 3 6/26/12

  4. Summer12 Digi-Reco https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 ● 5.1 digi-reco : 370M events, started March 7, delivered early April, tailed into beginning of May ● 5.2 digi-reco : 1.3B events, started end of April, tailing in end of June. ✔ Validation samples (end of March - end of April) ✔ Low PU production ( April 18 – May 21) ( PU_S8 or E8TeV4BX50ns ) ✔ TSG production (April 21 - May 14) ( PU_S9) ✔ HPA Production (end of March – today) ( PU_S7 and PU_S6 ) http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_START52_AODSIM_speed.html 4 6/26/12

  5. Priority Lists ● Selected samples for 5.1 ✔ Defined with Physics Coordination ✗ Production overshot by <~1week ➔ Data popularity analysis ? ● High Priority Analysis with 5.2 , ✔ Defined by all groups, filtered by Physics Coordination, compiled, and arranged for production ✔ 5 blocks+1block for the rest (see details in next slides) ✔ Everything else not on that list was frozen in production (or not attend to) ✗ Complications were met with samples already submitted in gen-sim, acquired in the queue, with lower priority, inherited from the beginning of Summer12 (early Feb) ✔ Not much issue met with Digi-Reco prioritization (since nothing had been started yet) ✔ Overall, the production went fine 5 6/26/12

  6. HPA (1/2) http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPA x = Block x in the url ● HPA1 : 38/40 completed. ✗ DiPhotonJets_7TeV-madgraph useless in Summer12 ✗ TTJets_MassiveBinDECAY available in PU_S6 as requestd, missing PU_S7 ✔ 140M to AODSIM ● HPA2 : 42/48 completed. ✗ 4 Higgs request still new : means not defined in PREP ✗ EWK : DY4JetsToLL_M-50 digi-reco stalled ✗ EWK : DY2JetsToLL_M-50 gen-sim extension stalled ✔ 200M to AODSIM ● HPA3 : 326/329 completed. ✗ 2 requests in “new” : means not defined in PREP ✗ JME QCD_Pt-15to30 digi-reco stalled ● 240M to AODSIM NB : “stalled” = Site issues, Queue overhead, probably done by now. 6 6/26/12

  7. HPA (2/2) http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPA x = Block x in the url ● HPA4 : 308/339 completed. ✗ BPH : BdToKK, BdToPiPi, BdToPiMuNu, LambdaBToPK, digi-reco stalled ✗ BPH : LambdaBToPMuNu gen-sim taking forever due to a very low filter efficiency ✗ EWK : DYToTauTau_M-20_CT10 digi-reco stalled ✗ SUS : QCD_HT-500To1000 completing ✗ SUS : TTWWJets, WZZNoGstarJets, WWWJets, TTGJets, TTWJets, ✗ Top : 7 systematic samples (TT/T/W scale up/down) digi-reco stalled ✗ Top : 3 systematic samples gen-sim stalled ✔ 550M to AODSIM ● HPA5 : 82/87 completed. ✗ SUS : QCD_HT-100To250 digi-reco stalled ✗ Top : 4 systematic samples (TT/W matching up/down) digi-reco stalled ✗ Higgs : VBF_HToZZTo2L2Nu_M-525 digi-reco stalled ✔ 52M to AODSIM NB : “stalled” = Site issues, Queue overhead, probably done by now. 7 6/26/12

  8. Issues and Action (1/3) ● WM Commissioning was done during the production itself ✔ No support from main developers gone to work in industry ✔ Solved by definition with experienced gained ✔ Lot's of experience gained both by PdmV and Comp-Ops ➔ Computing full post-mortem end July ➔ PREP2 project ● Lack of monitoring at several levels ✔ Ad-hoc monitoring pages will be turned into a consolidated third party PREP/reqMng monitoring in medium time scale (pre-PREP2) ➔ GlobalMonitor is being upgraded ● Operation over-head for submission & chaining ✔ Ad-hoc chaining from PREP evolved to ad-hoc operation summary ➔ Improvement of current PREP to speed up operation ➔ PREP2 / integration with request manager ● Operation over-head for dispatching ✗ Daily assignment is a killer overhead ✗ Weekly assignment does not allow for quick turn-over ✗ Monthly assignment early April severely delayed some samples ➔ Accumulate experience into automated procedures ➔ More from the July post-mortem 8 6/26/12

  9. Issues and Action (2/3) ● Aborting valid request to reclaim resources ✗ Damaged the output dataset ✔ We won't do that again anytime soon ➔ Development of the system to allow for this feature ● Frenzy of wanting things faster ✗ Many cases of “change the priority” the “next day it was acquirred” ✔ Ask for future careful pre-planning ✔ Tied to lack of approximate estimated time of delivery ➔ Development of the system to allow more flexibility ● Some samples “missing” were in fact never asked in priority ✔ Were dealt in priority ➔ Add a link to a PAS in PREP2 to tie requests to analysis ➔ More careful planning from the groups needs to be made, early on ● Some samples were submitted with an incorrect physics content ✔ Improve on preparation/documentation of special requests ➔ Implementing a gen-validation step as part of the submission procedure ● Not possible to “take a pick” at large samples ✗ The first 10% of the samples was not reachable fast enough ✔ Numerous requests were staged, but the rest steals resources ✔ A handful of requests were extended ➔ Planning for two-speed submission of samples (10% high, 90% bulk) with PREP2 ➔ Development on WM infrastructure to allow for safe extension of dataset 9 6/26/12

  10. Issues and Action (3/3) ● Mis-understanding on priority number and operation related to it ✔ Clarified half-way ✔ Tied to resource downtime ➔ Planned to be automated “by date” in PREP2 ● Requested statistics not matched ✗ Due to filter efficiency, corrupted LHE,... ➔ Incorporate this as part of a gen-valid request ● Stuck samples ✗ History monitoring missing ✔ Weekly report from PREP ✔ Scanning scripts developed by the operators ✔ Thanks to the eyes of some requester, making clear reports ➔ More from the July post-mortem ● Difficult to get large systematic samples ➔ Increase the usage of Fastsim ● Loosing track of samples, requests, relevance ✔ Increase coordination between groups ✔ Follow up on important samples ✔ Propagation of operational information and news ➔ Monte-Carlo coordination meeting put in place 10 6/26/12

  11. Summary Things went fine for the bulk of production. Production went over the expectations. Most of the samples prioritized have been delivered “before last week”. A few bumps along the way. Most issues addressed already. More from Computing post-mortem full analysis 11 6/26/12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend