Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - - PowerPoint PPT Presentation

interactive nanoaod analysis
SMART_READER_LITE
LIVE PREVIEW

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction - - PowerPoint PPT Presentation

Interactive NanoAOD analysis Nick Amin Aug 19, 2019 Introduction Condor jobs have a lot of overhead Transferring and setting up environment for every job Scheduling/internal condor overhead Condor jobs are not interactive


slide-1
SLIDE 1

Interactive NanoAOD analysis

Nick Amin Aug 19, 2019

slide-2
SLIDE 2

⚫ Condor jobs have a lot of overhead

  • Transferring and setting up environment for every job
  • Scheduling/internal condor overhead

⚫ Condor jobs are not interactive

  • Designed for monolithic background processing
  • Job status through condor_q spam

⚫ Interactive login node use will not really be "interactive"

anymore

  • Several versions (e.g., years/periods) of ntuples →

everyone turning to parallel looping to make histograms on login nodes

Introduction

2

slide-3
SLIDE 3

⚫ Goal: Enable faster interactive analysis with NanoAOD using a message

broker and HTCondor

  • preliminary code in https://github.com/aminnj/redis-htcondor

⚫ Not a new concept! There are lots of tools designed to do this more

generally (dynamic task execution with brokered communication):

  • htmap, parsl, dask, celery, ray, airflow, dramatiq, …
  • Limitations on open ports/communication within/outside the batch

system means it’s difficult to use these out of the box

⚫ Wrote my own simple task queue system based on redis to allow low-level

customization to suit our use case (HTCondor, data locality, caching, compressed communication, hardware, hadoop, …).

  • jobs are "embarrassingly parallel", so there's no real need for dynamic

logic, inter-worker communication, DAGs, etc,

  • this relies on exactly one redis master server (one single public-facing

ip/port).

Introduction

3

slide-4
SLIDE 4
  • 1. First, user submits N condor jobs (each
  • ne is 1 "worker") which listen to a FIFO

tasks queue of the redis server

  • There’s actually N+1 task queues, 1

general ("tasks") and N specific ("tasks:worker1" … "tasks:workerN")

  • Workers listen to the general task

queue and also the one specific to themselves

  • In this way, we can send targeted

tasks to specific workers

  • 2. User communicates workload/tasks to

redis server

  • 3. Workers perform work and push output

into results queue

  • 4. User checks results queue and

combines/reduces

Setup

4 UAF

👥

redis server (broker)

worker1 workerN worker2

tasks results …

condor jobs/workers

user submits condor jobs tasks results

hadoop files

👥

task queue results queue

task = (function, arguments) compress with lz4 decompress

broker

slide-5
SLIDE 5

⚫ Someone hosts a redis

server (I have one on uaf-10 right now)

⚫ User first externally submits

30 jobs to condor, each of which is a worker

⚫ User writes a function and

has the option to map it

  • ver a list of arguments

locally or remotely

⚫ Very low overhead — 40ms

to send tasks to 10 workers and get the results back

Simple example

5

slide-6
SLIDE 6

⚫ Take 200M event

subset of Run2 / DoubleMuon/

⚫ Make 448 chunks of up

to 500k events each,

  • ver the 133 files

⚫ Make function that

histograms dimuon invariant mass

  • Simple selection

reading 4 branches

⚫ remote_map runs in

under 2 minutes — 2MHz with 30 workers

Dimuon example

6

slide-7
SLIDE 7

⚫ Sum up the partial result histograms and plot the ~100M muon pairs

Dimuon example

7

slide-8
SLIDE 8

⚫ The results include metadata about start and

stop time, so we can plot blocks during which the 30 workers are working

  • Tasks are distributed to workers on first-come

first-serve basis → no scheduling

  • No white gaps → negligible communication
  • verhead
  • Some "tail" workers can cause inefficiency

near the end

⚫ Sustained network read speed ~0.26GB/s while

processing

What did the workers do?

8

slide-9
SLIDE 9

⚫ Consider two toy examples

  • "single MET branch" — read 1 MET branch from DoubleMu dataset
  • "dimuon invariant mass" — function from previous slides, which reads 4 branches

⚫ Run a handful of times for different number of workers and plot the event rate (MHz) ⚫ Rate scales approximately linearly with number of branches

  • For dimuon/four branches, rate saturates at 7MHz with around 150 workers
  • cabinet/sdsc nodes have a 10gbit connection. With 150 workers we see on the right that

the peak read speed is ~10gbit — this is an empirical statement, is it true?

⚫ Not shown here, but computation of invariant mass from momentum components is irrelevant

  • here. Each worker can compute invariant mass at ~10MHz with numpy’s vectorization

Can we do better? More workers

9

slide-10
SLIDE 10

⚫ Need to introduce scheduling of tasks in order to take advantage of caching

  • I.e., events A-B of file C were processed on sdsc-48 and branches were cached

there, so subsequent submissions to put that task on sdsc-48 again

  • Ensure that the function has a placeholder for a branch cache, subsequent runs of

remote_map will use the same workers and this cache parameter will be filled automatically on the workers 10

Can we do better? Caching

slide-11
SLIDE 11

11

Can we do better? Caching

⚫ If we have 150 workers and each gets allocated 15GB of disk, that’s ~2TB of

NanoAOD we could cache on disk instead of hadoop

  • Turns out this is slower than reading from hadoop because if we have 8 workers

running in a given node they will compete to read from the same disk, introducing tail jobs

⚫ 6GB of RAM per worker means 1 TB of NanoAOD could be cached in RAM

  • Promising, especially if we operate on columns at a time from the same disk

⚫ Running the dimuon example a second time, we go from ~7MHz to 50-80MHz

because branches have been cached

500k events/task 1M events/task

slide-12
SLIDE 12

⚫ Right now, we read through hadoop fuse mount since I couldn’t get xrootd

wrappers installed with python3

  • Potentially another big speedup if we submit jobs to other sites

⚫ Investigate "columnarization" of NanoAOD

  • Each branch gets converted into a single standalone file which is a

gzipped numpy array

  • Offers 2-3x speedup over reading branches from ROOT files initially, but
  • nce branches are cached in RAM, speed is the same
  • If files are smaller than hadoop block size (64MB), we can submit workers

to the same nodes hosting the file. Rough tests give me ~3x read speed increase when running over a file on the same node wrt a different node

⚫ Intelligent worker selection

  • Some nodes are worse than others
  • Important when nodes are shared with other people

Possible next steps

12

slide-13
SLIDE 13

Backup

13

slide-14
SLIDE 14

⚫ 150 workers, 10k tasks that sleep a random amount from 0.2 to 0.4 seconds ⚫ Efficiency based on fraction of inter-task whitespace (ignoring whitespace on

right side) is histogrammed over 150 workers on the right

  • Mean eff is ~99%

Task overhead

14

slide-15
SLIDE 15

⚫ Separate pub/sub queue to poll metrics for the worker

process or whole node. E.g., for nodes as a whole, …

Node metrics

15

slide-16
SLIDE 16

⚫ Also try dask-distributed vs mine ⚫ 100 workers, 8GB array cache

  • Run dask 6 times and mine 6 times. 3 with cold

cache, 3 with warm cache

⚫ With warm cache, overhead with dask makes it

roughly ~2x slower than mine when ignoring tail jobs

  • Lots of variance with dask

⚫ Is this because of allow_other_workers/work

stealing causing cache to not be used?

Vs dask

16 mine cold cache mine warm cache dask warm cache

slide-17
SLIDE 17

⚫ Study a 1.7M event DoubleMuon nanoaod file — 1.4GB file, compressed with LZMA (default NanoAOD workflow) ⚫ Read ~20 branches on my laptop with SSD ⚫ Everything done with as warm cache as possible (run cells multiple times)

Compression (LZMA)

17

⚫ First, just open/initialize the

file and TTree

⚫ Icicle plot shows time on x-

axis, nested child function calls along y-axis

⚫ 1s overhead to open file

and get tree

slide-18
SLIDE 18

Compression (LZMA)

18

⚫ Now actually read those branches ⚫ Takes 18s and 80% of that is decompression

slide-19
SLIDE 19

⚫ Take the same file and convert to LZ4 (hadd -O -f404 doublemu_lz4_reopt.root doublemu_lzma.root) —

gives a 2.7GB file

⚫ Opening the file and getting the tree takes 0.74s and 1% of that is decompression

  • Thus, intrinsic overhead with uproot opening file/tree is on the order of 1s
  • So 50 workers and 500 files of 1M events each will never surpass 50MHz

Compression (LZ4)

19

slide-20
SLIDE 20

Compression (LZ4)

20

⚫ Reading branches ⚫ Takes 2.7s and 11% of that is spent decompressing ⚫ So, LZMA took 13s to decompress vs LZ4 with 0.28s — 40x faster to decompress, and

the remaining walltime is interpretation overhead

slide-21
SLIDE 21

⚫ Convert previous files with several algorithms

  • Compression enums from https://github.com/root-project/root/blob/master/core/zip/inc/

Compression.h

  • Default NanoAOD/LZMA is 208, and recommended LZ4 is 404
  • 401-408 produce similar filesizes and decompression speeds, so proceed with 404
  • Filesizes:

Comparing compresison algos

21

⚫ Time to read 1-12 branches, on right ⚫ LZ4 and uncompressed are

basically equivalent

⚫ LZ4 ~10x faster than LZMA

10x faster

slide-22
SLIDE 22

⚫ Next, try to move away from ROOT and use hdf5 files ⚫ awkward array package (comes with uproot) allows us to serialize whole branches at a time into hdf5 files with

arbitrary compression algorithms

⚫ We can also test blosc, which has shuffle/bitshuffle filters that can get an additional 10-15% extra compression over

nanoAOD if used properly

⚫ Record time and space for compressing a handful of branches ⚫ LZMA is the clear winner for compression ratio, but blosc_lz4hc is a good choice with very fast decompression time

  • Need further studies on bitshuffle. While it seems to get ~20% extra compression ratio for ~no loss in

decompression time, it messes up values at the end of arrays. Probably need to tune the type size/number of bits, otherwise it might be shuffling bits across floating point values.

Comparing compression algos

22 sorted by compression ratio sorted by decompression time

slide-23
SLIDE 23

⚫ Why move away from ROOT and use hdf5 files?

  • The overhead with opening a file and getting a tree is eliminated, and reading is even faster because blosc allows the use of

multiple threads when decompressing

⚫ Convert the 1.7Mevent DoubleMuon file from before into a ".awkd" file (blosc lz4hc and .h5 under the hood) and perform the same

exercise of reading ~20 branches

  • LZMA was 1.4GB and this .awkd file is 1.5GB
  • Opening the file AND reading takes 0.7s with 4 threads (by default), or 0.9s with 1 thread.

⚫ Downside:

  • Can’t read subset of events out of the box (i.e., read only events x through y), so a job might "overread" if we only want a subset
  • f events. However, if each job wants all the events in a file (1-2M events), this is not an issue.

Comparing compression algos

23

slide-24
SLIDE 24

⚫ 1.7M event DoubleMuon NanoAOD file ⚫ open+read = initialize file/tree and read 20 branches

  • warm cache + SSD on laptop

Compression study summary

24 LZMA .root 1.4GB 20s to open+read LZ4 .root 2.7GB 4s to open+read blosc LZ4HC .awkd 1.5GB 1s to open+read