Useful info on SNT workflows
Nick Amin
Useful info on SNT workflows Nick Amin Overview Two main parts - - PowerPoint PPT Presentation
Useful info on SNT workflows Nick Amin Overview Two main parts Data/metadata retrieval people usually use DAS many of us use DIS metadata about SNT samples (i.e., CMS4) Job submission people usually use CRAB many
Nick Amin
⚫ Two main parts
⚫ Also advertising how to retrieve this information and how to
2
⚫ Many CMS services to deal with data bookkeeping ⚫ DAS (Data Aggregation Service) deals with many but not all
⚫ "Can I get the cmsRun configuration for the GENSIM used to make this MC sample?"
3 user DAS DBS Phedex Other stuff PmP McM dataset/file info dataset/file location campaign progress MC configurations minor stuff nobody uses CMS4 CMS4 information
Skype with Nick, or ls /hadoop/...
👥
⚫ DIS alleviates this issue by querying services directly
⚫ Drops some DAS things that we don’t use daily ⚫ Adds some things that combine multiple sources ⚫ Adds CMS4 bookkeeping ⚫ You don’t need a proxy/cert for anything, only the person running the DIS server
4 user DIS DBS Phedex PmP McM dataset/file info dataset/file location campaign progress MC configurations CMS4 CMS4 information
👥
5 processing time: 8.1s …not to mention it times out sometimes, and there’s also this kind of page:
6 processing time: 2.2s
⚫ If DIS talks to DBS directly, and DAS talks to DBS for the same data, then how is
DAS 4x slower? 🤸
⚫ dasgoclient (CLI) written by DAS author to bypass DAS and query DBS directly
⚫ http://uaf-8.t2.ucsd.edu/~namin/dis/?query=%2FEGamma%2FRun2018D-22Jan2019-
v2%2FMINIAOD&type=files&short=short
7 what kind of info do you want? "short" output? if unchecked, display more details
query (almost always just a dataset name)
⚫ Basic
8
⚫ Files — by default, shows only 10 files (uncheck short option to see all)
⚫ Sites — where is my data?
9 If you put in a dataset, you get fractional T2 presence If you put in a file, you get info about that file/block
⚫ Chain
10
⚫ Pick (pickevents)
11
⚫ Pick_cms4 (pickevents to CMS4 level)
events
⚫ SNT (search CMS4 samples)
12
⚫ How can we summarize lots of output? ⚫ "What’s the total event count of all /TTTT* samples?"
13
⚫ Additionally, for SNT queries, put restrictions as comma
separated list after dataset pattern
⚫ Print out hadoop path and dataset name for Prompt 2018 data
processed with the CMS4_V10-02-04 tag
14
⚫ Python command line client has exact same syntax as webpage (just
give -t <type>), and you can make nice tables too
MINIAOD,cms3tag=CMS4_V10-02-04 | grep dataset_name,gtag,nevents_in" --table
15
dataset_name gtag nevents_in /MET/Run2018A-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 5980578 /MET/Run2018A-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 30172992 /MET/Run2018B-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 28012780 /MET/Run2018C-PromptReco-v1/MINIAOD 102X_dataRun2_Prompt_v11 1986935 /MET/Run2018B-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 1739672 /MET/Run2018A-PromptReco-v3/MINIAOD 102X_dataRun2_Prompt_v11 17175066 /MET/Run2018D-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 162272551 /MET/Run2018C-PromptReco-v2/MINIAOD 102X_dataRun2_Prompt_v11 14698298 /MET/Run2018C-PromptReco-v3/MINIAOD 102X_dataRun2_Prompt_v11 14586790
16
⚫ Other features
⚫ "CRAB mostly works when it works, but it mostly doesn’t work"
⚫ Luckily we have local condor submission and running lots of cmsRun isn’t that
complicated
⚫ Almost all data processing we do is based on dataset in → files out
produce files with events
configuration file
⚫ Metis (https://github.com/aminnj/ProjectMetis) makes it more functional
17
task = CMSSWTask( sample = DBSSample(dataset="/ZeroBias6/Run2017A-PromptReco-v2/MINIAOD"), events_per_output = 450e3,
tag = "CMS4_V00-00-03", pset = "pset_test.py", pset_args = "data=True prompt=True", cmssw_version = "CMSSW_9_2_1", tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V00-00-03_workaround.tar.gz", is_data = True, )
18
def main(): task = CMSSWTask( sample = DBSSample(dataset="/ZeroBias6/Run2017A-PromptReco-v2/MINIAOD"), events_per_output = 450e3,
tag = "CMS4_V00-00-03", pset = "pset_test.py", pset_args = "data=True prompt=True", cmssw_version = "CMSSW_9_2_1", tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V00-00-03_workaround.tar.gz", is_data = True, ) task.process() StatsParser(data=total_summary, webdir="~/public_html/dump/metis_test/").do() if __name__ == "__main__": # Do stuff, sleep, do stuff, sleep, etc. for i in range(100): main() time.sleep(1.*3600) # Since everything is backed up, totally OK to Ctrl+C and pick up later
⚫ Process a task
⚫ Make a summary of
jobs and put it on a dashboard
⚫ Easily extendible to a
loop over datasets
19
tag = "v1" total_summary = {} for _ in range(10000): gen = CMSSWTask( sample = DummySample(N=1, dataset="/WH_HtoRhoGammaPhiGamma/privateMC_102x/GENSIM"), events_per_output = 1000, total_nevents = 1000000, pset = "gensim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, split_within_files = True, ) raw = CMSSWTask( sample = DirectorySample( location = gen.get_outputdir(), dataset = gen.get_sample().get_datasetname().replace("GENSIM","RAWSIM"), ),
files_per_output = 1, pset = "rawsim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ) aod = CMSSWTask( sample = DirectorySample( location = raw.get_outputdir(), dataset = raw.get_sample().get_datasetname().replace("RAWSIM","AODSIM"), ),
files_per_output = 5, pset = "aodsim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ) miniaod = CMSSWTask( sample = DirectorySample( location = aod.get_outputdir(), dataset = aod.get_sample().get_datasetname().replace("AODSIM","MINIAODSIM"), ),
flush = True, files_per_output = 5, pset = "miniaodsim_cfg.py", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, ) cms4 = CMSSWTask( sample = DirectorySample( location = miniaod.get_outputdir(), dataset = miniaod.get_sample().get_datasetname().replace("MINIAODSIM","CMS4"), ),
flush = True, files_per_output = 1,
pset = "psets_cms4/main_pset_V10-02-04.py", pset_args = "data=False year=2018", global_tag = "102X_upgrade2018_realistic_v12", cmssw_version = "CMSSW_10_2_5", scram_arch = "slc6_amd64_gcc700", tag = tag, tarfile = "/nfs-7/userdata/libCMS3/lib_CMS4_V10-02-04_1025.tar.xz", ) tasks = [gen,raw,aod,miniaod,cms4] for task in tasks: task.process() summary = task.get_task_summary() total_summary[task.get_sample().get_datasetname()] = summary StatsParser(data=total_summary, webdir="~/public_html/dump/metis/").do() time.sleep(30*60)
⚫ Can chain together tasks
⚫ Allows one to make a
20
⚫ Also has other Task objects (CondorTask) that can be used to make
babies from CMS4
⚫ Many examples in
⚫ Other Metis features
⚫ Monitoring page for central instance of Metis that submits SNT
samples
21
⚫ https://github.com/cmstas/NtupleTools/blob/master/MetisScripts/
bigly.py
for >500 datasets (currently)
⚫ Procedure to get more CMS4
file, e.g.,
product of xsec, kfactor, filtereff really matters)
"/QCD_HT100to200_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM|19380000|1|1", "/QCD_HT200to300_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM|1559000|1.0|1.0", "/QCD_HT300to500_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM|323300|1.0|1.0", "/QCD_HT500to700_TuneCP5_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM|30000|1.0|1.0", …