advancement of usage of taskchain in production j r
play

Advancement of usage of TaskChain in production J-R Vlimant In A - PowerPoint PPT Presentation

Advancement of usage of TaskChain in production J-R Vlimant In A Nutshell TaskChain is the most flexible type of workflow One cmsRun per task A root task either reading from input dataset or generating events wmLHE and pLHE


  1. Advancement of usage of TaskChain in production J-R Vlimant

  2. In A Nutshell ● TaskChain is the most flexible type of workflow ● One cmsRun per “task” ● A root task either reading from input dataset or generating events ● wmLHE and pLHE enabled ● Each subsequente task feeds from one of the output module from one of the preceding task ● Trees of tasks possible ● A → B → C1 → D1 and B → C2 → D2 (C2 → D3 and so on) ● Job splitting either done explicitly (#events/job, #lumis/job) or automatic using time/event (N.B. #events/lumi fully functioning) ● All outputs are exposed to computing up-front ● PROS ● In a multi-campaign mode of operation, reduces the number of workflows (items in request manager) from N>1 to 1 ● No intermediate manipulation of datasets ● No latency in assigning the next workflows ● No latency, less manual operation in creating tape families ● CONS ● Full chain has to be tested at once : change of mode of operation from gen contact ● Recovery workflows can become complicated with large number of tasks : change of operation from ops ● The chain has one priority ● All requests need to run at the same site (no T2 → T1 relocation) 2 Post-MccM Discussion, J-R Vlimant 9/19/14

  3. Already Tested ● Years of operation of release validation samples ● Although job splitting was always set explicitly ● Treating eos-based .lhe files in input https://github.com/dmwm/WMCore/issues/4871 ● #Events per lumi https://github.com/dmwm/WMCore/issues/4872 ● Doing wmLHE and gen-sim in a single workflow ● 2 requests in mcm ● 2 tasks in the taskchain ● https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_2t.json ● Output 2 datasets as if they were processed in two different workflows, without the dataset manipulation latency ● Doing trees of requests from SUS-Fall13wmLHE-00011 ● https://cms-pdmv.cern.ch/mcm/chained_requests?root_request=SUS-Fall13wmLHE-0001 ● 1 wmLHE, 1 gen-sim, 2 digi-reco, 1 mini-aod : 5 workflows compared to one taskchain ● https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_at.json ● The last clone made by Alan succeeded with only an AODSIM output dataset collision due to wrong assignment. 3 Post-MccM Discussion, J-R Vlimant 9/19/14

  4. Already Developed (1/3) ● Testing script for the full chain request (March 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/ ● get_setup/<chained request id> ● Setup&run each request one after the others ● Testing API for chained requests (March 2014) https://cms-pdmv.cern.ch/mcm/restapi/chained_requests/ ● Test/<chained request id> ● Threaded runtest of the chain ● Verification of performance & efficiency measured ● Requires certificates and xrootd enabled ● Creating the taskchain dictionary from ● A chained request ID (March 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-chain_Fall13wmLHE_flowWMLHEtoF13_flowS14P ● Handle only the requests that are part of the chain ● N.B. The link has scratch=true which unfolds the whole chain ● A request ID (August 2014) https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-Fall13wmLHE-00011?scratch=true ● Look for the tree of requests from the chains the request is involved with ● N.B. The link has scratch=true which unfolds the whole chain ● Injection of taskchain (March 2014) ● wmcontrol is provided with the url to the dictionnary https://github.com/cms-PdmV/wmcontrol/commit/0a2352e7866a61cf41fb31afa334f4f268f8a415 ● Everything is done within McM 4 Post-MccM Discussion, J-R Vlimant 9/19/14

  5. Already Developed (2/3) ● Labelling of the output dataset “processingstring” (March 2014) ● Application of experience with relvals ● Simplifies greatly the assignment of TaskChains ● Registering statistics and status of multiple output dataset (August 2014) ● Required for proper toggling of done status with completed events in McM ● Reduction of stats DB size by making an history member of each doc (August 2014) ● From 23Gb to 500Mb … ● Growth plot fully available and made simpler to make ● Button for chain request testing available to gen contact (September 2014) ● Fixed for un-intentional reset of requests ● Approval toggling from gen contact & convener (September 2014) ● Once validation is finished, status is toggled ● Toggling to define then approve in the regular way ● Injection of taskchain and batch texting (September 2014) ● Injection is now threaded and locked ● Subject&Text of the pilot batch was ambiguous https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546.html ● and now fixed https://github.com/cms-PdmV/cmsPdmV/pull/652 5 Post-MccM Discussion, J-R Vlimant 9/19/14

  6. Already Developed (3/3) ● Toggling of status to done using multiple output (September 2014) ● Few typos fixed ● Worked out of the box, with regular request inspection https://cms-pdmv.cern.ch/mcm//requests?member_of_chain=HIG-chain_Summer12_flowS12to53-00264&page=0&shown=146297325599 ● Protection for dataset name collision (September 2014) ● PR https://github.com/cms-PdmV/cmsPdmV/pull/658 ● Required to prevent TaskChains to create collisions with existing requests ● Functions with indirect injection of taskchain : i.e. when toggling submit approval ● Does not operated with direct injection : i.e using /restapi/chained_requests/inject/<id> ● 6 Post-MccM Discussion, J-R Vlimant 9/19/14

  7. On-Going ● Pilot batch of TaskChain from McM ● From HIG mass scan https://cms-pdmv.cern.ch/mcm/requests?dataset_name=*FilterMuOrEle15*&member_of_campaign=Summer12 ● Extra mass point (55) added, validated https://cms-pdmv.cern.ch/mcm/requests?prepid=HIG-Summer12-02258 ➔ Completed after a few manual steps ➔ Issue with ACDC not solved yet ● Brainstorming on assignment (Ops) quoting chats with Alan ● Adapt the scripts that look for possible job location based on input datasets, being primary or pileup ● Adapt possible modification to job splitting made by assignment scripts ● Allocate TaskChains to site based on resource availability ➔ No feedback yet ➔ Proper site white list wasn't used in the pilot and lead to failures in digi-reco https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html ➔ Suspicion that this is also what is causing the ACDC to not start 7 Post-MccM Discussion, J-R Vlimant 9/19/14

  8. Suggestion To Next Steps ● Get feedback the Ops brainstorming and iron out the handshaking details ● Do a reservation campaign in Summer11 & Summer12* ● Put all new requests in Summer11 and Summer12* through TaskChain ● Extend to new requests in Fall13* → miniAOD ● Extend to new requests in Fall14wmLHE → Fall14 8 Post-MccM Discussion, J-R Vlimant 9/19/14

  9. Suggestion To Next Steps ● Get feedback the Ops brainstorming and iron out the handshaking details ● Do a reservation campaign in Summer11 & Summer12* ● Put all new requests in Summer11 and Summer12* through TaskChain ● Extend to new requests in Fall13* → miniAOD ● Extend to new requests in Fall14wmLHE → Fall14 9 Post-MccM Discussion, J-R Vlimant 9/19/14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend