ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone - - PowerPoint PPT Presentation

ichep mc production post mortem j r vlimant on behalf of
SMART_READER_LITE
LIVE PREVIEW

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone - - PowerPoint PPT Presentation

ICHEP MC Production Post-Mortem J-R Vlimant on behalf of everyone else Disclaimer Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation


slide-1
SLIDE 1

“ICHEP MC Production” Post-Mortem J-R Vlimant

  • n behalf of everyone else
slide-2
SLIDE 2

6/26/12 2

Disclaimer

Post-Mortem : is after death/end, while there are still MC samples being produced, as we speak. The body is still warm ! A full computing operation post-mortem analysis is planned for the Computing & Offline Management Meeting in Trieste, 25-27 July 2012. Full post-mortem will be done by then Lots of lessons learned will be turned into action items then. It's easy to only notice what goes wrong.

slide-3
SLIDE 3

6/26/12 3

Summer12 GEN-SIM

  • Started on January 25
  • 3.3B in the campaign
  • 2.5 B events produced
  • 500 M/month achieved (has been 400M/Month)
  • Part of it has been put in stand-by, while completing the HPA and Upgrade samples.
  • Resuming of all “non HPA” gen-sim (800M) this week

https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12 http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_GEN-SIM_speed.html

slide-4
SLIDE 4

6/26/12 4

Summer12 Digi-Reco

  • 5.1 digi-reco : 370M events, started March 7, delivered early April,

tailed into beginning of May

  • 5.2 digi-reco : 1.3B events, started end of April, tailing in end of June.

✔ Validation samples (end of March - end of April) ✔ Low PU production ( April 18 – May 21) (PU_S8 or E8TeV4BX50ns) ✔ TSG production (April 21 - May 14) (PU_S9) ✔ HPA Production (end of March – today) (PU_S7 and PU_S6)

https://twiki.cern.ch/twiki/bin/view/CMS/PdmVProductionSummer12

http://vlimant.web.cern.ch/vlimant/Directory/summer12/progress/Summer12_START52_AODSIM_speed.html

slide-5
SLIDE 5

6/26/12 5

Priority Lists

  • Selected samples for 5.1

✔ Defined with Physics Coordination ✗ Production overshot by <~1week ➔ Data popularity analysis ?

  • High Priority Analysis with 5.2,

✔ Defined by all groups, filtered by Physics Coordination, compiled, and

arranged for production

✔ 5 blocks+1block for the rest (see details in next slides) ✔ Everything else not on that list was frozen in production (or not attend to) ✗ Complications were met with samples already submitted in gen-sim,

acquired in the queue, with lower priority, inherited from the beginning of Summer12 (early Feb)

✔ Not much issue met with Digi-Reco prioritization (since nothing had been

started yet)

✔ Overall, the production went fine

slide-6
SLIDE 6

6/26/12 6

HPA (1/2)

  • HPA1 : 38/40 completed.

✗ DiPhotonJets_7TeV-madgraph useless in Summer12 ✗ TTJets_MassiveBinDECAY available in PU_S6 as

requestd, missing PU_S7

✔ 140M to AODSIM

  • HPA2 : 42/48 completed.

✗ 4 Higgs request still new : means not defined in PREP ✗ EWK : DY4JetsToLL_M-50 digi-reco stalled ✗ EWK : DY2JetsToLL_M-50 gen-sim extension stalled ✔ 200M to AODSIM

  • HPA3 : 326/329 completed.

✗ 2 requests in “new” : means not defined in PREP ✗ JME QCD_Pt-15to30 digi-reco stalled

  • 240M to AODSIM

NB : “stalled” = Site issues, Queue overhead, probably done by now. http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPAx = Blockx in the url

slide-7
SLIDE 7

6/26/12 7

HPA (2/2)

  • HPA4 : 308/339 completed.

✗ BPH : BdToKK, BdToPiPi, BdToPiMuNu, LambdaBToPK, digi-reco stalled ✗ BPH : LambdaBToPMuNu gen-sim taking forever due to a very low filter efficiency ✗ EWK : DYToTauTau_M-20_CT10 digi-reco stalled ✗ SUS : QCD_HT-500To1000 completing ✗ SUS : TTWWJets, WZZNoGstarJets, WWWJets, TTGJets, TTWJets, ✗ Top : 7 systematic samples (TT/T/W scale up/down) digi-reco stalled ✗ Top : 3 systematic samples gen-sim stalled ✔ 550M to AODSIM

  • HPA5 : 82/87 completed.

✗ SUS : QCD_HT-100To250 digi-reco stalled ✗ Top : 4 systematic samples (TT/W matching up/down) digi-reco stalled ✗ Higgs : VBF_HToZZTo2L2Nu_M-525 digi-reco stalled ✔ 52M to AODSIM

http://vlimant.web.cern.ch/vlimant/Directory/summer12/summary.html?search=Block1 HPAx = Blockx in the url NB : “stalled” = Site issues, Queue overhead, probably done by now.

slide-8
SLIDE 8

6/26/12 8

Issues and Action (1/3)

  • WM Commissioning was done during the production itself

✔ No support from main developers gone to work in industry ✔ Solved by definition with experienced gained ✔ Lot's of experience gained both by PdmV and Comp-Ops ➔ Computing full post-mortem end July ➔ PREP2 project

  • Lack of monitoring at several levels

✔ Ad-hoc monitoring pages will be turned into a consolidated third party

PREP/reqMng monitoring in medium time scale (pre-PREP2)

➔ GlobalMonitor is being upgraded

  • Operation over-head for submission & chaining

✔ Ad-hoc chaining from PREP evolved to ad-hoc operation summary ➔ Improvement of current PREP to speed up operation ➔ PREP2 / integration with request manager

  • Operation over-head for dispatching

✗ Daily assignment is a killer overhead ✗ Weekly assignment does not allow for quick turn-over ✗ Monthly assignment early April severely delayed some samples ➔ Accumulate experience into automated procedures ➔ More from the July post-mortem

slide-9
SLIDE 9

6/26/12 9

Issues and Action (2/3)

  • Aborting valid request to reclaim resources

✗ Damaged the output dataset ✔ We won't do that again anytime soon ➔ Development of the system to allow for this feature

  • Frenzy of wanting things faster

✗ Many cases of “change the priority” the “next day it was acquirred” ✔ Ask for future careful pre-planning ✔ Tied to lack of approximate estimated time of delivery ➔ Development of the system to allow more flexibility

  • Some samples “missing” were in fact never asked in priority

✔ Were dealt in priority ➔ Add a link to a PAS in PREP2 to tie requests to analysis ➔ More careful planning from the groups needs to be made, early on

  • Some samples were submitted with an incorrect physics content

✔ Improve on preparation/documentation of special requests ➔ Implementing a gen-validation step as part of the submission procedure

  • Not possible to “take a pick” at large samples

✗ The first 10% of the samples was not reachable fast enough ✔ Numerous requests were staged, but the rest steals resources ✔ A handful of requests were extended ➔ Planning for two-speed submission of samples (10% high, 90% bulk) with PREP2 ➔ Development on WM infrastructure to allow for safe extension of dataset

slide-10
SLIDE 10

6/26/12 10

Issues and Action (3/3)

  • Mis-understanding on priority number and operation related to it

✔ Clarified half-way ✔ Tied to resource downtime ➔ Planned to be automated “by date” in PREP2

  • Requested statistics not matched

✗ Due to filter efficiency, corrupted LHE,... ➔ Incorporate this as part of a gen-valid request

  • Stuck samples

✗ History monitoring missing ✔ Weekly report from PREP ✔ Scanning scripts developed by the operators ✔ Thanks to the eyes of some requester, making clear reports ➔ More from the July post-mortem

  • Difficult to get large systematic samples

➔ Increase the usage of Fastsim

  • Loosing track of samples, requests, relevance

✔ Increase coordination between groups ✔ Follow up on important samples ✔ Propagation of operational information and news ➔ Monte-Carlo coordination meeting put in place

slide-11
SLIDE 11

6/26/12 11

Summary

Things went fine for the bulk of production. Production went over the expectations. Most of the samples prioritized have been delivered “before last week”. A few bumps along the way. Most issues addressed already. More from Computing post-mortem full analysis