Advancement of usage of TaskChain in production J-R Vlimant In A - - PowerPoint PPT Presentation

advancement of usage of taskchain in production j r
SMART_READER_LITE
LIVE PREVIEW

Advancement of usage of TaskChain in production J-R Vlimant In A - - PowerPoint PPT Presentation

Advancement of usage of TaskChain in production J-R Vlimant In A Nutshell TaskChain is the most flexible type of workflow One cmsRun per task A root task either reading from input dataset or generating events wmLHE and pLHE


slide-1
SLIDE 1

Advancement of usage of TaskChain in production J-R Vlimant

slide-2
SLIDE 2

9/19/14 Post-MccM Discussion, J-R Vlimant 2

In A Nutshell

  • TaskChain is the most flexible type of workflow
  • One cmsRun per “task”
  • A root task either reading from input dataset or generating events
  • wmLHE and pLHE enabled
  • Each subsequente task feeds from one of the output module from one of the

preceding task

  • Trees of tasks possible
  • A → B → C1 → D1 and B → C2 → D2 (C2 → D3 and so on)
  • Job splitting either done explicitly (#events/job, #lumis/job) or automatic using

time/event (N.B. #events/lumi fully functioning)

  • All outputs are exposed to computing up-front
  • PROS
  • In a multi-campaign mode of operation, reduces the number of workflows (items in

request manager) from N>1 to 1

  • No intermediate manipulation of datasets
  • No latency in assigning the next workflows
  • No latency, less manual operation in creating tape families
  • CONS
  • Full chain has to be tested at once : change of mode of operation from gen contact
  • Recovery workflows can become complicated with large number of tasks : change
  • f operation from ops
  • The chain has one priority
  • All requests need to run at the same site (no T2 → T1 relocation)
slide-3
SLIDE 3

9/19/14 Post-MccM Discussion, J-R Vlimant 3

Already Tested

  • Years of operation of release validation samples
  • Although job splitting was always set explicitly
  • Treating eos-based .lhe files in input

https://github.com/dmwm/WMCore/issues/4871

  • #Events per lumi

https://github.com/dmwm/WMCore/issues/4872

  • Doing wmLHE and gen-sim in a single workflow
  • 2 requests in mcm
  • 2 tasks in the taskchain
  • https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_2t.json
  • Output 2 datasets as if they were processed in two different workflows,

without the dataset manipulation latency

  • Doing trees of requests from SUS-Fall13wmLHE-00011
  • https://cms-pdmv.cern.ch/mcm/chained_requests?root_request=SUS-Fall13wmLHE-0001
  • 1 wmLHE, 1 gen-sim, 2 digi-reco, 1 mini-aod : 5 workflows compared to one

taskchain

  • https://vlimant.web.cern.ch/vlimant/SUS-Fall13wmLHE-00011_dict_at.json
  • The last clone made by Alan succeeded with only an AODSIM output

dataset collision due to wrong assignment.

slide-4
SLIDE 4

9/19/14 Post-MccM Discussion, J-R Vlimant 4

Already Developed (1/3)

  • Testing script for the full chain request (March 2014)

https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/

  • get_setup/<chained request id>
  • Setup&run each request one after the others
  • Testing API for chained requests (March 2014)

https://cms-pdmv.cern.ch/mcm/restapi/chained_requests/

  • Test/<chained request id>
  • Threaded runtest of the chain
  • Verification of performance & efficiency measured
  • Requires certificates and xrootd enabled
  • Creating the taskchain dictionary from
  • A chained request ID (March 2014)

https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-chain_Fall13wmLHE_flowWMLHEtoF13_flowS14P

  • Handle only the requests that are part of the chain
  • N.B. The link has scratch=true which unfolds the whole chain
  • A request ID (August 2014)

https://cms-pdmv.cern.ch/mcm/public/restapi/chained_requests/get_dict/SUS-Fall13wmLHE-00011?scratch=true

  • Look for the tree of requests from the chains the request is involved with
  • N.B. The link has scratch=true which unfolds the whole chain
  • Injection of taskchain (March 2014)
  • wmcontrol is provided with the url to the dictionnary

https://github.com/cms-PdmV/wmcontrol/commit/0a2352e7866a61cf41fb31afa334f4f268f8a415

  • Everything is done within McM
slide-5
SLIDE 5

9/19/14 Post-MccM Discussion, J-R Vlimant 5

Already Developed (2/3)

  • Labelling of the output dataset “processingstring” (March 2014)
  • Application of experience with relvals
  • Simplifies greatly the assignment of TaskChains
  • Registering statistics and status of multiple output dataset (August 2014)
  • Required for proper toggling of done status with completed events in McM
  • Reduction of stats DB size by making an history member of each doc (August 2014)
  • From 23Gb to 500Mb …
  • Growth plot fully available and made simpler to make
  • Button for chain request testing available to gen contact (September 2014)
  • Fixed for un-intentional reset of requests
  • Approval toggling from gen contact & convener (September 2014)
  • Once validation is finished, status is toggled
  • Toggling to define then approve in the regular way
  • Injection of taskchain and batch texting (September 2014)
  • Injection is now threaded and locked
  • Subject&Text of the pilot batch was ambiguous

https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546.html

  • and now fixed https://github.com/cms-PdmV/cmsPdmV/pull/652
slide-6
SLIDE 6

9/19/14 Post-MccM Discussion, J-R Vlimant 6

Already Developed (3/3)

  • Toggling of status to done using multiple output (September 2014)
  • Few typos fixed
  • Worked out of the box, with regular request inspection

https://cms-pdmv.cern.ch/mcm//requests?member_of_chain=HIG-chain_Summer12_flowS12to53-00264&page=0&shown=146297325599

  • Protection for dataset name collision (September 2014)
  • PR https://github.com/cms-PdmV/cmsPdmV/pull/658
  • Required to prevent TaskChains to create collisions with existing requests
  • Functions with indirect injection of taskchain : i.e. when toggling submit approval
  • Does not operated with direct injection : i.e using /restapi/chained_requests/inject/<id>
slide-7
SLIDE 7

9/19/14 Post-MccM Discussion, J-R Vlimant 7

On-Going

  • Pilot batch of TaskChain from McM
  • From HIG mass scan

https://cms-pdmv.cern.ch/mcm/requests?dataset_name=*FilterMuOrEle15*&member_of_campaign=Summer12

  • Extra mass point (55) added, validated

https://cms-pdmv.cern.ch/mcm/requests?prepid=HIG-Summer12-02258

➔ Completed after a few manual steps ➔ Issue with ACDC not solved yet

  • Brainstorming on assignment (Ops) quoting chats with Alan
  • Adapt the scripts that look for possible job location based on input datasets,

being primary or pileup

  • Adapt possible modification to job splitting made by assignment scripts
  • Allocate TaskChains to site based on resource availability

➔ No feedback yet ➔ Proper site white list wasn't used in the pilot and lead to failures in digi-reco

https://hypernews.cern.ch/HyperNews/CMS/get/dataopsrequests/5546/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html

➔ Suspicion that this is also what is causing the ACDC to not start

slide-8
SLIDE 8

9/19/14 Post-MccM Discussion, J-R Vlimant 8

Suggestion To Next Steps

  • Get feedback the Ops brainstorming and iron out the handshaking details
  • Do a reservation campaign in Summer11 & Summer12*
  • Put all new requests in Summer11 and Summer12* through TaskChain
  • Extend to new requests in Fall13* → miniAOD
  • Extend to new requests in Fall14wmLHE → Fall14
slide-9
SLIDE 9

9/19/14 Post-MccM Discussion, J-R Vlimant 9

Suggestion To Next Steps

  • Get feedback the Ops brainstorming and iron out the handshaking details
  • Do a reservation campaign in Summer11 & Summer12*
  • Put all new requests in Summer11 and Summer12* through TaskChain
  • Extend to new requests in Fall13* → miniAOD
  • Extend to new requests in Fall14wmLHE → Fall14