What to Learn From MicroBooNE DAQ?
Wesley Ketchum with input of lots of MicroBooNE people
30 October 2017
What to Learn From MicroBooNE DAQ? Wesley Ketchum with input of - - PowerPoint PPT Presentation
What to Learn From MicroBooNE DAQ? Wesley Ketchum with input of lots of MicroBooNE people 30 October 2017 2 First things first MicroBooNE Detector Paper: JINST 12, P02017 (2017) https://arxiv.org/abs/1612.05824 (basically) everything
Wesley Ketchum with input of lots of MicroBooNE people
30 October 2017
¡ MicroBooNE Detector Paper: JINST 12, P02017 (2017)
¡ https://arxiv.org/abs/1612.05824 ¡ (basically) everything in this talk that is not my opinion comes from there
¡ MicroBooNE continues running well
¡ Starting third year of data-taking ¡ >95% of POT delivered is recorded to tape ¡ That‘s integrated, so 5% loss not (all) due to DAQ (typical uptime >97%)
30 October 2017
2
Oct ‘15 Oct ‘17 Oct ‘16
30 October 2017
3
36 PMT channels
30 October 2017
4
¡ PMT readout ¡ Beam disc: unbiased readout for 23.4 us around trigger ¡ Cosmic disc: threshold requirement, readout for 625 ns ¡ Clock ¡ Common “frame number” (1.6 ms counter) from start of run ¡ Pules per second from GPS pulse latches time, allows for lookup map to real time ¡ Used for matching auxiliary data (like beam and cosmic ray tagger)
30 October 2017
5
¡ “Triggered” (NU):
¡ TPC lossless Huffman compressed ¡ PMT has no compression applied, readout 4 frames (no trimming) ¡ Data ~150 MB before compression, ~35 MB after compression
¡ “Continuous” (SN):
¡ TPC lossy zero-suppression, read-out frame-by-frame (15 MB/s per crate) ¡ PMT just reads out (~7 MB/s)
¡ Preference to triggered stream data ¡ Additional data
¡ Cosmic tagger panels added around MicroBooNE detector, being (or will be) combined in
¡ Readout continuously, matched to TPC data on timestamp
30 October 2017
6
Operating point
¡ First/foremost: it works for the needs of the experiment
¡ And works pretty darn well
¡ Largest struggles in dealing with real data
¡ PMT rate higher than expected à modifications of thresholds/buffers ¡ Likely leading cause for DAQ crash rates are FIFO overflow on PMT readout ¡ TPC noise higher/generally not as expected ¡ Huffman compression factor x5 instead of hoped-for x10 ¡ More complications on continuous readout mode rates ¡ Continuous stream competition for resources ¡ Despite dedicated readout stream, still some shared resources (data transmission on crate, go to same server)
¡ Lacked parasitic data-taking modes for testing DAQ components
¡ Hardware-based PMT trigger and continuous stream not online at start of beam à difficulty in commissioning without losing data ¡ Also, you really really need to use real data for commissioning
30 October 2017
7
¡ MicroBooNE doesn’t use artdaq, but shares the
¡ I’ll translate to artdaq names
¡ BoardReaders
¡ Receive data from hardware ¡ Move to large circular buffer ¡ Process, identify data belonging to single event, move to outbound queue ¡ Send to EventBuilder
¡ EventBuilder+Aggregator (one multi-threaded process for us)
¡ Collect fragments ¡ When event complete, transfer fragments to raw event queue ¡ Process raw events, apply software trigger, write to disk ¡ 50 events per file, no filtering into separate files
30 October 2017
8
¡ High-level trigger software trigger to reduce rate ¡ Low-level trigger from neutrino beam gates ¡ High-level trigger looks for coincident PMT signals above threshold ¡ Accepts prescaled unbiased data ¡ <~10 ms per event total alg time ¡ ~factor 20 reduction in data rate ¡ Trigger applied after event-building ¡ Limits low-level trigger rate to network bandwidth (20 Hz)/readout crate stability ¡ Better to have PMT info at low-level trigger…
30 October 2017
9
PMT Readout Event Builder TPC Readout Software Trigger Data Logger Pass Fail
¡ General strategy: everything needs to work, or we get nothing
¡ We rely on … ¡ Well-formatted data (well, with hard-coded exceptions) ¡ In-sync fragments, all fragments report ¡ Pros: simpler (no partial events to handle/monitor, everything in shared state); when it works you trust it ¡ Cons: one piece goes down, you have nothing; special modes really a bit special ¡ This has generally worked well for MicroBooNE ¡ Things much more often than not work! But it’s a simple system
¡ Data format: binary data
¡ Needs conversion to offline format, which didn’t really happen until later in commissioning à hectic moments in early commissioning to understand data
30 October 2017
10
¡ Run control
¡ Simple console-based python/shell scripts in VNC ¡ Highly automated ¡ Automatic re-lanching of runs, no selection of components, etc.: pick configuration, run length, and go ¡ Music to wake shifter in case of major errors
¡ Monitoring
¡ Custom metrics reported to real-time database with ganglia ¡ Some reported to SlowMonitoring / central alarm area ¡ The ones that aren’t are “expert” level ¡ Online data processing to monitor basic PMT and TPC waveforms/activity rates ¡ Runs off of spying data in shared memory, processes binary data ¡ Logging ¡ We just write log files out for history
¡ Configuration database
¡ PSQLßàFCL tool: upload new configs by making new fcl files
30 October 2017
11
¡ MicroBooNE gets away with a highly-automated console-based DAQ because not too many components, and overall simple system ¡ Configurations must be carefully maintained…can create high load on experts ¡ Online monitoring off of raw binary data separate from offline data format à rather only slow changes in the quantities to monitor ¡ In periods of duress, we demand the swift conversion of files and dedicated people to continuously analyze the data ¡ Don’t collect enough run information into databases ¡ E.g. local log text files written with run uptime, to be used for POT integration information ¡ àIf it’s worth having, plan to store in a database
30 October 2017
12
¡ DAQ responsibility ends once file hits local disk ¡ Online DM takes over for getting file from disk to tape-backed storage
¡ Automated local processes for … ¡ Search for new files ¡ Generate metadata/auxiliary files ¡ Copy to outbound dropbox ¡ Monitor when data is whisked away ¡ Cleanup local files
¡ Nearline/Offline DM takes it from there
¡ Automated processes on grid for ... ¡ Keep-up “swizzling” (reformatting) and reconstruction ¡ Occupies ~100 nodes for ”normal” (1 Hz) data rates
30 October 2017
13
¡ Requires very close coordination of DAQ and DM groups
¡ DM group needs local DAQ cluster resources (CPU, disk read/write, network bandwidth) that can compete with DAQ functions
¡ MicroBooNE woefully underestimated its data rate, volume, and resource needs
¡ From TDR: ¡ Expected final compression was x10 (we achieved only x5) ¡ Expected recorded data rate from BNB was 0.05 Hz (actual: ~0.15 Hz) ¡ No careful accounting of any additional trigger sources (reality à ~0.7 Hz total rate) ¡ And physics groups demand more data still... ¡ Need carefully validated and realistic data volume and resource estimates
¡ Additional considerations to ease offline DM?
¡ MicroBooNE DAQ writes everything to one file ¡ e.g. Filtering on trigger streams likely would help offline re-swizzling/reconstruction ¡ “Swizzling” takes significant resources ¡ Reduce reformatting? Improve/make less necessary decompression routines? ¡ To borrow from Josh Klein: try to be less paranoid and greedy
30 October 2017
14
¡ MicroBooNE DAQ is running, running well, and fits our physics needs ¡ Very useful experience running a real physics experiment
¡ With real results! And MORE COMING!
¡ Elements to learn, as discussed, from design point of view
¡ For multiple data streams, need careful evaluation of shared resources ¡ Need flexibility in data handling, compression, and triggering ¡ Early and close integration with data management ¡ Design for realistic data rates/volume at a global/integrated level (DAQ+DM) ¡ And then resist pressure for changes without complete reevaluation of entire chain ¡ Also, MicroBooNE has loads of operational experience/advice, but won’t dwell
¡ DISCUSSION TIME
30 October 2017
15