interactive nanoaod analysis
play

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - PowerPoint PPT Presentation

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of overhead Transferring and setting up environment for every job Scheduling/internal condor overhead Condor jobs are not interactive


  1. Interactive NanoAOD analysis Nick Amin Aug 19, 2019

  2. Introduction ⚫ Condor jobs have a lot of overhead • Transferring and setting up environment for every job • Scheduling/internal condor overhead ⚫ Condor jobs are not interactive • Designed for monolithic background processing • Job status through condor_q spam ⚫ Interactive login node use will not really be "interactive" anymore • Several versions (e.g., years/periods) of ntuples → everyone turning to parallel looping to make histograms on login nodes � 2

  3. Introduction ⚫ Goal: Enable fast er interactive analysis with NanoAOD using a message broker and HTCondor • preliminary code in https://github.com/aminnj/redis-htcondor ⚫ Not a new concept! There are lots of tools designed to do this more generally (dynamic task execution with brokered communication): • htmap, parsl, dask, celery, ray, airflow, dramatiq, … • Limitations on open ports/communication within/outside the batch system means it’s di ffi cult to use these out of the box ⚫ Wrote my own simple task queue system based on redis to allow low-level customization to suit our use case (HTCondor, data locality, caching, compressed communication, hardware, hadoop, …). • jobs are "embarrassingly parallel", so there's no real need for dynamic logic, inter-worker communication, DAGs, etc, • this relies on exactly one redis master server (one single public-facing ip/port). � 3

  4. Setup 1. First, user submits N condor jobs (each condor one is 1 "worker") which listen to a FIFO jobs/workers redis server tasks queue of the redis server tasks (broker) • There’s actually N+1 task queues, 1 worker1 results task queue general ("tasks") and N specific hadoop files ("tasks:worker1" … "tasks:workerN") worker2 results queue • Workers listen to the general task … queue and also the one specific workerN results to themselves tasks • In this way, we can send targeted tasks to specific workers 2. User communicates workload/tasks to 👥 redis server 👥 user submits 3. Workers perform work and push output UAF condor jobs into results queue 4. User checks results queue and combines/reduces task = (function, arguments) broker compress decompress with lz4 � 4

  5. Simple example ⚫ Someone hosts a redis server (I have one on uaf-10 right now) ⚫ User first externally submits 30 jobs to condor, each of which is a worker ⚫ User writes a function and has the option to map it over a list of arguments locally or remotely ⚫ Very low overhead — 40ms to send tasks to 10 workers and get the results back � 5

  6. Dimuon example ⚫ Take 200M event subset of Run2 / DoubleMuon/ ⚫ Make 448 chunks of up to 500k events each, over the 133 files ⚫ Make function that histograms dimuon invariant mass • Simple selection reading 4 branches ⚫ remote_map runs in under 2 minutes — 2MHz with 30 workers � 6

  7. Dimuon example ⚫ Sum up the partial result histograms and plot the ~100M muon pairs � 7

  8. What did the workers do? ⚫ The results include metadata about start and stop time, so we can plot blocks during which the 30 workers are working • Tasks are distributed to workers on first-come first-serve basis → no scheduling • No white gaps → negligible communication overhead • Some "tail" workers can cause ine ffi ciency near the end ⚫ Sustained network read speed ~0.26GB/s while processing � 8

  9. Can we do better? More workers ⚫ Consider two toy examples • "single MET branch" — read 1 MET branch from DoubleMu dataset • "dimuon invariant mass" — function from previous slides, which reads 4 branches ⚫ Run a handful of times for di ff erent number of workers and plot the event rate (MHz) ⚫ Rate scales approximately linearly with number of branches • For dimuon/four branches, rate saturates at 7MHz with around 150 workers • cabinet/sdsc nodes have a 10gbit connection. With 150 workers we see on the right that the peak read speed is ~10gbit — this is an empirical statement, is it true? ⚫ Not shown here, but computation of invariant mass from momentum components is irrelevant here. Each worker can compute invariant mass at ~10MHz with numpy’s vectorization � 9

  10. Can we do better? Caching ⚫ Need to introduce scheduling of tasks in order to take advantage of caching • I.e., events A-B of file C were processed on sdsc-48 and branches were cached there, so subsequent submissions to put that task on sdsc-48 again • Ensure that the function has a placeholder for a branch cache , subsequent runs of remote_map will use the same workers and this cache parameter will be filled automatically on the workers � 10

  11. Can we do better? Caching ⚫ If we have 150 workers and each gets allocated 15GB of disk, that’s ~2TB of NanoAOD we could cache on disk instead of hadoop • Turns out this is slower than reading from hadoop because if we have 8 workers running in a given node they will compete to read from the same disk, introducing tail jobs ⚫ 6GB of RAM per worker means 1 TB of NanoAOD could be cached in RAM • Promising, especially if we operate on columns at a time from the same disk ⚫ Running the dimuon example a second time, we go from ~7MHz to 50-80MHz because branches have been cached 500k events/task 1M events/task � 11

  12. Possible next steps ⚫ Right now, we read through hadoop fuse mount since I couldn’t get xrootd wrappers installed with python3 • Potentially another big speedup if we submit jobs to other sites ⚫ Investigate "columnarization" of NanoAOD • Each branch gets converted into a single standalone file which is a gzipped numpy array • O ff ers 2-3x speedup over reading branches from ROOT files initially, but once branches are cached in RAM, speed is the same • If files are smaller than hadoop block size (64MB), we can submit workers to the same nodes hosting the file. Rough tests give me ~3x read speed increase when running over a file on the same node wrt a di ff erent node ⚫ Intelligent worker selection • Some nodes are worse than others • Important when nodes are shared with other people � 12

  13. Backup � 13

  14. Task overhead ⚫ 150 workers, 10k tasks that sleep a random amount from 0.2 to 0.4 seconds ⚫ E ffi ciency based on fraction of inter-task whitespace (ignoring whitespace on right side) is histogrammed over 150 workers on the right • Mean e ff is ~ 99% � 14

  15. Node metrics ⚫ Separate pub/sub queue to poll metrics for the worker process or whole node. E.g., for nodes as a whole, … � 15

  16. Vs dask ⚫ Also try dask-distributed vs mine ⚫ 100 workers, 8GB array cache • Run dask 6 times and mine 6 times. 3 with cold mine cache, 3 with warm cache ⚫ With warm cache, overhead with dask makes it cold cache roughly ~2x slower than mine when ignoring tail jobs • Lots of variance with dask ⚫ Is this because of allow_other_workers/work stealing causing cache to not be used? mine dask warm cache warm cache � 16

  17. Compression (LZMA) ⚫ Study a 1.7M event DoubleMuon nanoaod file — 1.4GB file, compressed with LZMA (default NanoAOD workflow) ⚫ Read ~20 branches on my laptop with SSD ⚫ Everything done with as warm cache as possible (run cells multiple times) ⚫ First, just open/initialize the file and TTree ⚫ Icicle plot shows time on x- axis, nested child function calls along y-axis ⚫ 1s overhead to open file and get tree � 17

  18. Compression (LZMA) ⚫ Now actually read those branches ⚫ Takes 18s and 80% of that is decompression � 18

  19. Compression (LZ4) ⚫ Take the same file and convert to LZ4 ( hadd -O -f404 doublemu_lz4_reopt.root doublemu_lzma.root ) — gives a 2.7GB file ⚫ Opening the file and getting the tree takes 0.74s and 1% of that is decompression • Thus, intrinsic overhead with uproot opening file/tree is on the order of 1s ‣ So 50 workers and 500 files of 1M events each will never surpass 50MHz � 19

  20. Compression (LZ4) ⚫ Reading branches ⚫ Takes 2.7s and 11% of that is spent decompressing ⚫ So, LZMA took 13s to decompress vs LZ4 with 0.28s — 40x faster to decompress, and the remaining walltime is interpretation overhead � 20

  21. Comparing compresison algos ⚫ Convert previous files with several algorithms • Compression enums from https://github.com/root-project/root/blob/master/core/zip/inc/ Compression.h • Default NanoAOD/LZMA is 2 08, and recommended LZ4 is 4 04 • 401-408 produce similar filesizes and decompression speeds, so proceed with 404 • Filesizes: ⚫ Time to read 1-12 branches, on right ⚫ LZ4 and uncompressed are 10x faster basically equivalent ⚫ LZ4 ~10x faster than LZMA � 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend