Advancement of usage of TaskChain in production J-R Vlimant In A - - PowerPoint PPT Presentation
Advancement of usage of TaskChain in production J-R Vlimant In A - - PowerPoint PPT Presentation
Advancement of usage of TaskChain in production J-R Vlimant In A Nutshell TaskChain is the most flexible type of workflow One cmsRun per task A root task either reading from input dataset or generating events wmLHE and pLHE
9/19/14 Post-MccM Discussion, J-R Vlimant 2
In A Nutshell
- TaskChain is the most flexible type of workflow
- One cmsRun per “task”
- A root task either reading from input dataset or generating events
- wmLHE and pLHE enabled
- Each subsequente task feeds from one of the output module from one of the
preceding task
- Trees of tasks possible
- A → B → C1 → D1 and B → C2 → D2 (C2 → D3 and so on)
- Job splitting either done explicitly (#events/job, #lumis/job) or automatic using
time/event (N.B. #events/lumi fully functioning)
- All outputs are exposed to computing up-front
- PROS
- In a multi-campaign mode of operation, reduces the number of workflows (items in
request manager) from N>1 to 1
- No intermediate manipulation of datasets
- No latency in assigning the next workflows
- No latency, less manual operation in creating tape families
- CONS
- Full chain has to be tested at once : change of mode of operation from gen contact
- Recovery workflows can become complicated with large number of tasks : change
- f operation from ops
- The chain has one priority
- All requests need to run at the same site (no T2 → T1 relocation)
9/19/14 Post-MccM Discussion, J-R Vlimant 3
Already Tested
- Years of operation of release validation samples
- Although job splitting was always set explicitly
- Treating eos-based .lhe files in input
https://github.com/dmwm/WMCore/issues/4871
- #Events per lumi
https://github.com/dmwm/WMCore/issues/4872
- Doing wmLHE and gen-sim in a single workflow
- 2 requests in mcm
- 2 tasks in the taskchain
- https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_2t.json
- Output 2 datasets as if they were processed in two different workflows,
without the dataset manipulation latency
- Doing trees of requests from SUS-Fall13wmLHE-00011
- https://cms-pdmv.cern.ch/mcm/chained_requests?root_request=SUS-Fall13wmLHE-0001
- 1 wmLHE, 1 gen-sim, 2 digi-reco, 1 mini-aod : 5 workflows compared to one
taskchain
- https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_at.json
- The last clone made by Alan succeeded with only an AODSIM output
dataset collision due to wrong assignment.
9/19/14 Post-MccM Discussion, J-R Vlimant 4
Already Developed (1/3)
- Testing script for the full chain request (March 2014)
https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/
- get_setup/<chained request id>
- Setup&run each request one after the others
- Testing API for chained requests (March 2014)
https://cms-pdmv.cern.ch/mcm/restapi/chained_requests/
- Test/<chained request id>
- Threaded runtest of the chain
- Verification of performance & efficiency measured
- Requires certificates and xrootd enabled
- Creating the taskchain dictionary from
- A chained request ID (March 2014)
https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-chain_Fall13wmLHE_flowWMLHEtoF13_flowS14P
- Handle only the requests that are part of the chain
- N.B. The link has scratch=true which unfolds the whole chain
- A request ID (August 2014)
https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-Fall13wmLHE-00011?scratch=true
- Look for the tree of requests from the chains the request is involved with
- N.B. The link has scratch=true which unfolds the whole chain
- Injection of taskchain (March 2014)
- wmcontrol is provided with the url to the dictionnary
https://github.com/cms-PdmV/wmcontrol/commit/0a2352e7866a61cf41fb31afa334f4f268f8a415
- Everything is done within McM
9/19/14 Post-MccM Discussion, J-R Vlimant 5
Already Developed (2/3)
- Labelling of the output dataset “processingstring” (March 2014)
- Application of experience with relvals
- Simplifies greatly the assignment of TaskChains
- Registering statistics and status of multiple output dataset (August 2014)
- Required for proper toggling of done status with completed events in McM
- Reduction of stats DB size by making an history member of each doc (August 2014)
- From 23Gb to 500Mb …
- Growth plot fully available and made simpler to make
- Button for chain request testing available to gen contact (September 2014)
- Fixed for un-intentional reset of requests
- Approval toggling from gen contact & convener (September 2014)
- Once validation is finished, status is toggled
- Toggling to define then approve in the regular way
- Injection of taskchain and batch texting (September 2014)
- Injection is now threaded and locked
- Subject&Text of the pilot batch was ambiguous
https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546.html
- and now fixed https://github.com/cms-PdmV/cmsPdmV/pull/652
9/19/14 Post-MccM Discussion, J-R Vlimant 6
Already Developed (3/3)
- Toggling of status to done using multiple output (September 2014)
- Few typos fixed
- Worked out of the box, with regular request inspection
https://cms-pdmv.cern.ch/mcm//requests?member_of_chain=HIG-chain_Summer12_flowS12to53-00264&page=0&shown=146297325599
- Protection for dataset name collision (September 2014)
- PR https://github.com/cms-PdmV/cmsPdmV/pull/658
- Required to prevent TaskChains to create collisions with existing requests
- Functions with indirect injection of taskchain : i.e. when toggling submit approval
- Does not operated with direct injection : i.e using /restapi/chained_requests/inject/<id>
9/19/14 Post-MccM Discussion, J-R Vlimant 7
On-Going
- Pilot batch of TaskChain from McM
- From HIG mass scan
https://cms-pdmv.cern.ch/mcm/requests?dataset_name=*FilterMuOrEle15*&member_of_campaign=Summer12
- Extra mass point (55) added, validated
https://cms-pdmv.cern.ch/mcm/requests?prepid=HIG-Summer12-02258
➔ Completed after a few manual steps ➔ Issue with ACDC not solved yet
- Brainstorming on assignment (Ops) quoting chats with Alan
- Adapt the scripts that look for possible job location based on input datasets,
being primary or pileup
- Adapt possible modification to job splitting made by assignment scripts
- Allocate TaskChains to site based on resource availability
➔ No feedback yet ➔ Proper site white list wasn't used in the pilot and lead to failures in digi-reco
https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html
➔ Suspicion that this is also what is causing the ACDC to not start
9/19/14 Post-MccM Discussion, J-R Vlimant 8
Suggestion To Next Steps
- Get feedback the Ops brainstorming and iron out the handshaking details
- Do a reservation campaign in Summer11 & Summer12*
- Put all new requests in Summer11 and Summer12* through TaskChain
- Extend to new requests in Fall13* → miniAOD
- Extend to new requests in Fall14wmLHE → Fall14
9/19/14 Post-MccM Discussion, J-R Vlimant 9
Suggestion To Next Steps
- Get feedback the Ops brainstorming and iron out the handshaking details
- Do a reservation campaign in Summer11 & Summer12*
- Put all new requests in Summer11 and Summer12* through TaskChain
- Extend to new requests in Fall13* → miniAOD
- Extend to new requests in Fall14wmLHE → Fall14