Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - - PowerPoint PPT Presentation
Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - - PowerPoint PPT Presentation
Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of overhead Transferring and setting up environment for every job Scheduling/internal condor overhead Condor jobs are not interactive
⚫ Condor jobs have a lot of overhead
- Transferring and setting up environment for every job
- Scheduling/internal condor overhead
⚫ Condor jobs are not interactive
- Designed for monolithic background processing
- Job status through condor_q spam
⚫ Interactive login node use will not really be "interactive"
anymore
- Several versions (e.g., years/periods) of ntuples →
everyone turning to parallel looping to make histograms on login nodes
Introduction
2
⚫ Goal: Enable faster interactive analysis with NanoAOD using a message
broker and HTCondor
- preliminary code in https://github.com/aminnj/redis-htcondor
⚫ Not a new concept! There are lots of tools designed to do this more
generally (dynamic task execution with brokered communication):
- htmap, parsl, dask, celery, ray, airflow, dramatiq, …
- Limitations on open ports/communication within/outside the batch
system means it’s difficult to use these out of the box
⚫ Wrote my own simple task queue system based on redis to allow low-level
customization to suit our use case (HTCondor, data locality, caching, compressed communication, hardware, hadoop, …).
- jobs are "embarrassingly parallel", so there's no real need for dynamic
logic, inter-worker communication, DAGs, etc,
- this relies on exactly one redis master server (one single public-facing
ip/port).
Introduction
3
- 1. First, user submits N condor jobs (each
- ne is 1 "worker") which listen to a FIFO
tasks queue of the redis server
- There’s actually N+1 task queues, 1
general ("tasks") and N specific ("tasks:worker1" … "tasks:workerN")
- Workers listen to the general task
queue and also the one specific to themselves
- In this way, we can send targeted
tasks to specific workers
- 2. User communicates workload/tasks to
redis server
- 3. Workers perform work and push output
into results queue
- 4. User checks results queue and
combines/reduces
Setup
4 UAF
👥
redis server (broker)
worker1 workerN worker2
tasks results …
condor jobs/workers
user submits condor jobs tasks results
hadoop files
👥
task queue results queue
task = (function, arguments) compress with lz4 decompress
broker
⚫ Someone hosts a redis
server (I have one on uaf-10 right now)
⚫ User first externally submits
30 jobs to condor, each of which is a worker
⚫ User writes a function and
has the option to map it
- ver a list of arguments
locally or remotely
⚫ Very low overhead — 40ms
to send tasks to 10 workers and get the results back
Simple example
5
⚫ Take 200M event
subset of Run2 / DoubleMuon/
⚫ Make 448 chunks of up
to 500k events each,
- ver the 133 files
⚫ Make function that
histograms dimuon invariant mass
- Simple selection
reading 4 branches
⚫ remote_map runs in
under 2 minutes — 2MHz with 30 workers
Dimuon example
6
⚫ Sum up the partial result histograms and plot the ~100M muon pairs
Dimuon example
7
⚫ The results include metadata about start and
stop time, so we can plot blocks during which the 30 workers are working
- Tasks are distributed to workers on first-come
first-serve basis → no scheduling
- No white gaps → negligible communication
- verhead
- Some "tail" workers can cause inefficiency
near the end
⚫ Sustained network read speed ~0.26GB/s while
processing
What did the workers do?
8
⚫ Consider two toy examples
- "single MET branch" — read 1 MET branch from DoubleMu dataset
- "dimuon invariant mass" — function from previous slides, which reads 4 branches
⚫ Run a handful of times for different number of workers and plot the event rate (MHz) ⚫ Rate scales approximately linearly with number of branches
- For dimuon/four branches, rate saturates at 7MHz with around 150 workers
- cabinet/sdsc nodes have a 10gbit connection. With 150 workers we see on the right that
the peak read speed is ~10gbit — this is an empirical statement, is it true?
⚫ Not shown here, but computation of invariant mass from momentum components is irrelevant
- here. Each worker can compute invariant mass at ~10MHz with numpy’s vectorization
Can we do better? More workers
9
⚫ Need to introduce scheduling of tasks in order to take advantage of caching
- I.e., events A-B of file C were processed on sdsc-48 and branches were cached
there, so subsequent submissions to put that task on sdsc-48 again
- Ensure that the function has a placeholder for a branch cache, subsequent runs of
remote_map will use the same workers and this cache parameter will be filled automatically on the workers 10
Can we do better? Caching
11
Can we do better? Caching
⚫ If we have 150 workers and each gets allocated 15GB of disk, that’s ~2TB of
NanoAOD we could cache on disk instead of hadoop
- Turns out this is slower than reading from hadoop because if we have 8 workers
running in a given node they will compete to read from the same disk, introducing tail jobs
⚫ 6GB of RAM per worker means 1 TB of NanoAOD could be cached in RAM
- Promising, especially if we operate on columns at a time from the same disk
⚫ Running the dimuon example a second time, we go from ~7MHz to 50-80MHz
because branches have been cached
500k events/task 1M events/task
⚫ Right now, we read through hadoop fuse mount since I couldn’t get xrootd
wrappers installed with python3
- Potentially another big speedup if we submit jobs to other sites
⚫ Investigate "columnarization" of NanoAOD
- Each branch gets converted into a single standalone file which is a
gzipped numpy array
- Offers 2-3x speedup over reading branches from ROOT files initially, but
- nce branches are cached in RAM, speed is the same
- If files are smaller than hadoop block size (64MB), we can submit workers
to the same nodes hosting the file. Rough tests give me ~3x read speed increase when running over a file on the same node wrt a different node
⚫ Intelligent worker selection
- Some nodes are worse than others
- Important when nodes are shared with other people
Possible next steps
12
Backup
13
⚫ 150 workers, 10k tasks that sleep a random amount from 0.2 to 0.4 seconds ⚫ Efficiency based on fraction of inter-task whitespace (ignoring whitespace on
right side) is histogrammed over 150 workers on the right
- Mean eff is ~99%
Task overhead
14
⚫ Separate pub/sub queue to poll metrics for the worker
process or whole node. E.g., for nodes as a whole, …
Node metrics
15
⚫ Also try dask-distributed vs mine ⚫ 100 workers, 8GB array cache
- Run dask 6 times and mine 6 times. 3 with cold
cache, 3 with warm cache
⚫ With warm cache, overhead with dask makes it
roughly ~2x slower than mine when ignoring tail jobs
- Lots of variance with dask
⚫ Is this because of allow_other_workers/work
stealing causing cache to not be used?
Vs dask
16 mine cold cache mine warm cache dask warm cache
⚫ Study a 1.7M event DoubleMuon nanoaod file — 1.4GB file, compressed with LZMA (default NanoAOD workflow) ⚫ Read ~20 branches on my laptop with SSD ⚫ Everything done with as warm cache as possible (run cells multiple times)
Compression (LZMA)
17
⚫ First, just open/initialize the
file and TTree
⚫ Icicle plot shows time on x-
axis, nested child function calls along y-axis
⚫ 1s overhead to open file
and get tree
Compression (LZMA)
18
⚫ Now actually read those branches ⚫ Takes 18s and 80% of that is decompression
⚫ Take the same file and convert to LZ4 (hadd -O -f404 doublemu_lz4_reopt.root doublemu_lzma.root) —
gives a 2.7GB file
⚫ Opening the file and getting the tree takes 0.74s and 1% of that is decompression
- Thus, intrinsic overhead with uproot opening file/tree is on the order of 1s
- So 50 workers and 500 files of 1M events each will never surpass 50MHz
Compression (LZ4)
19
Compression (LZ4)
20
⚫ Reading branches ⚫ Takes 2.7s and 11% of that is spent decompressing ⚫ So, LZMA took 13s to decompress vs LZ4 with 0.28s — 40x faster to decompress, and
the remaining walltime is interpretation overhead
⚫ Convert previous files with several algorithms
- Compression enums from https://github.com/root-project/root/blob/master/core/zip/inc/
Compression.h
- Default NanoAOD/LZMA is 208, and recommended LZ4 is 404
- 401-408 produce similar filesizes and decompression speeds, so proceed with 404
- Filesizes:
Comparing compresison algos
21
⚫ Time to read 1-12 branches, on right ⚫ LZ4 and uncompressed are
basically equivalent
⚫ LZ4 ~10x faster than LZMA
10x faster
⚫ Next, try to move away from ROOT and use hdf5 files ⚫ awkward array package (comes with uproot) allows us to serialize whole branches at a time into hdf5 files with
arbitrary compression algorithms
⚫ We can also test blosc, which has shuffle/bitshuffle filters that can get an additional 10-15% extra compression over
nanoAOD if used properly
⚫ Record time and space for compressing a handful of branches ⚫ LZMA is the clear winner for compression ratio, but blosc_lz4hc is a good choice with very fast decompression time
- Need further studies on bitshuffle. While it seems to get ~20% extra compression ratio for ~no loss in
decompression time, it messes up values at the end of arrays. Probably need to tune the type size/number of bits, otherwise it might be shuffling bits across floating point values.
Comparing compression algos
22 sorted by compression ratio sorted by decompression time
⚫ Why move away from ROOT and use hdf5 files?
- The overhead with opening a file and getting a tree is eliminated, and reading is even faster because blosc allows the use of
multiple threads when decompressing
⚫ Convert the 1.7Mevent DoubleMuon file from before into a ".awkd" file (blosc lz4hc and .h5 under the hood) and perform the same
exercise of reading ~20 branches
- LZMA was 1.4GB and this .awkd file is 1.5GB
- Opening the file AND reading takes 0.7s with 4 threads (by default), or 0.9s with 1 thread.
⚫ Downside:
- Can’t read subset of events out of the box (i.e., read only events x through y), so a job might "overread" if we only want a subset
- f events. However, if each job wants all the events in a file (1-2M events), this is not an issue.
Comparing compression algos
23
⚫ 1.7M event DoubleMuon NanoAOD file ⚫ open+read = initialize file/tree and read 20 branches
- warm cache + SSD on laptop
Compression study summary
24 LZMA .root 1.4GB 20s to open+read LZ4 .root 2.7GB 4s to open+read blosc LZ4HC .awkd 1.5GB 1s to open+read