ProtoDUNE prototype TPC self-trigger
Philip Rodrigues
University of Oxford
24 September 2019
ProtoDUNE prototype TPC self-trigger Philip Rodrigues University of - - PowerPoint PPT Presentation
ProtoDUNE prototype TPC self-trigger Philip Rodrigues University of Oxford 24 September 2019 1 A self-triggered muon parallel to the APA face Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC) Z view U view V view
Philip Rodrigues
University of Oxford
24 September 2019
9300 9400 9500 9600 9700 Offline channel number
Z view U view V view
250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
◮ Read out 1000 ticks before and after the trigger time
◮ Why: Stepping stone to DUNE FD, which will need a self trigger ◮ How:
◮ Proof-of-concept; more concerned with data flow than physics ◮ Make the simplest thing that works; iterate if necessary ◮ Work in the existing ProtoDUNE DAQ, with incremental changes ◮ Downstream (lower data rates) shouldn’t impose requirements on upstream (higher data rates, performance more important)
FELIX BR FELIX BR
x10
FELIX BR
APA 5
Trigger cand’te BR Hit-finding FPGA Hit-sending BR
APA
Trigger cand’te BR Module-level trigger BR Routing Master/DFO FELIX BR FELIX BR
x10
FELIX BR
APA 6
Trigger cand’te BR
ptmp messages ptmp messages
ptmp messages ptmp messages
artdaq fragment FELIX BR
APA 4: on-host
Trigger cand’te BR
ptmp messages
◮ Implement code inside artdaq board readers:
◮ Advantages: integration with run control, log file handling, saving of raw data and metadata
◮ Trigger primitives and candidates are sent using (from Brett Viren:
https://github.com/brettviren/ptmp)
◮ Data structures ◮ Message passing ◮ Algorithms
◮ Not shown: SSPs don’t participate in generating triggers, but will provide data in response to a trigger request
◮ TrigPrim: TPC hit with channel, start time, time span, ADC sum, peak ADC (unused so far), error flags (unused so far) ◮ TPSet: a container for TrigPrims, with count, detector ID, creation time, time/channel span, actual list of TrigPrims ◮ TPSets can be passed as ZeroMQ messages over network or in-process. Fast, configurable, alternatives for handling backpressure (drop or wait) ◮ TPWindow algorithm repackages TrigPrims into fixed-time windows ◮ TPZipper algorithm aggregates multiple TPSet message streams in time order, with (soft) maximum latency guarantee
250 500 750 1000 1250 1500 1750 2000 Time (tick) 200 100 100 200 300 400 ADC (arbitrary offset)
Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
◮ Simple hit finding running in CPUs on FELIX BR hosts
◮ Uses about 60% of a CPU core per link (10/APA)
Figure: Jon Sensenig
◮ Trigger candidates are found per-APA ◮ Take 50 µs windows, find groups of hits contiguous in channel (with up to 4-wire gap) ◮ Generate TC if channel range of largest group is 100 or more
Figure: Jon Sensenig
◮ Stitch together TCs from individual APAs ◮ Make a sliding window of one drift time, look for consistent slope (∆(time)/∆(channel)) between TCs ◮ Trigger if stitched TCs result in 350 hits or more in each APA
9300 9400 9500 9600 9700 Offline channel number
Z view U view V view
250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
◮ Read out 1000 ticks before and after the trigger time
9300 9400 9500 9600 9700 Offline channel number
Z view U view V view
250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9406, event 4 (timestamp 0x11647d02ca488f4, 2019-08-23 17:07:39 UTC)
◮ Read out 1000 ticks before and after the trigger time ◮ Event from lower-purity period
Run 9413
20 40 60 80 Trigger Latency (ms) 100 101 102 103 Frequency
◮ Trigger latency, defined as time between data with timestamp T arriving in FELIX BR, and trigger request for data with timestamp T arriving in FELIX BR. One histogram per link ◮ ∼ 3000 triggers in this run, no latencies close to buffer depth (∼ 1s) ◮ CPU usage: TODO, but approx 2–3 cores per APA for work downstream of hit-finding ◮
◮ Incorporate hit finding into on-host FELIX config ◮ Improve characterization/monitoring. In particular, check rate of dropping data (better be negligible!) ◮ Handle multiple trigger candidate algorithms: more like DUNE ◮ Michel electron trigger. Work from Columbia University ◮ Write technical paper (NIM or JINST maybe) ◮ Investigate induction-wire hit finding
◮ Could open the possibility of smaller ROI readout in DUNE ◮ Easy first test: run collection-wire algorithm on induction wires ◮ Works, for the appropriate value of “works”. Ran out of cores at 4 links (out of 10) ◮ Nice idea from ICARUS collaborators for induction wire preprocessing: keep a running sum of waveform, to convert bipolar to unipolar
◮ Demonstration of self-triggering on TPC data using an entirely software-based trigger chain ◮ A base for more R&D towards DUNE
9300 9400 9500 9600 9700 Offline channel number
Z view U view V view
250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
◮ Read out 1000 ticks before and after the trigger time
◮ Each link uses two threads for hit finding: one to decode the funky WIB format and find collection-wire hits (as before), and another to find induction-wire hits. They each get a core to themselves, and use 75% and 65% CPU respectively ◮ Ran out of cores on the FELIX BR machine at 4 links. Expect improvements with a more favourable input data format. Also might just sum adjacent ticks to halve the data rate (there’s no information at frequencies that high anyway)
◮ Important distinction between “data time” and “computer time”:
◮ Data time is 50 MHz ticks from the timing system ◮ Computer time is microseconds according to a computer’s system clock
◮ TrigPrim, a trigger primitive: channel Offline channel number tstart Start time (data time) tspan Time span (data time) adcsum Sum of hit ADCs adcpeak Peak ADCs (not filled by CPU hit finding yet) flags Error flags. Not used yet ◮ TPSet, a set of connected trigger primitives: count Sequential count detid Which detector portion represented (in PDSP, this is FELIX link) created Time this TPSet created (computer time) tstart Earliest tstart of member TrigPrims (data time) tspan Time span of member TrigPrims (data time) chanbeg Lowest channel of TrigPrims in set chanend Highest channel of TrigPrims in set totaladc Total ADC of TrigPrims in set tps The actual list of TrigPrims ◮ Full details: https://github.com/brettviren/ptmp/blob/master/src/ptmp.proto ◮ count field means that lost data can be identified, but that’s not done yet ◮ By convention, empty TPSets are not sent
◮ TPSets are (de)serialized using protobuf from Google and the data sent using ZeroMQ ◮ Relevant features of ZeroMQ:
◮ Messages can go via network or efficiently within process. Just via a config change ◮ Can “subscribe” to messages in standalone code for debugging/data dumping ◮ Alternatives for handling backpressure: make upstream wait (PUSH/PULL) or drop messages (PUB/SUB) ◮ Good performance: millions of msgs/s, sub-ms latencies1
1http://zeromq.org/results:10gbe-tests-v432
TPSet 3
TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]
TPSet 2
TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]
TPSet 1
TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]
TPWindow span=100
TPSet 2
TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]
TPSet 1
TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]
◮ Repackage TrigPrims into new TPSets with start (data) times in windows starting on a fixed boundary ◮ Input hits not required to be sorted by start time. Hits are buffered for a configurable time before being sent out. Late hits are dropped ◮ Full details: https://github.com/brettviren/ptmp/blob/master/docs/tpwindow.org
TPSet 3 [t2, t3] TPZipper TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 3 [t2, t3] TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 3 [t2, t3] TPSet 1 [t0, t1] TPSet 1 [t0, t1] TPSet 1 [t0, t1] TPSet 2 [t1, t2] TPSet 2 [t1, t2] TPSet 3 [t2, t3] TPSet 3 [t2, t3] TPSet 3 [t2, t3]
◮ Aggregate multiple TPSet message streams, outputting unchanged TPSets in time order with a maximum latency ◮ Maximum latency is computer time between first TPSet of a given data time arriving, and all available TPSets for that data time being sent out ◮ TPSets arriving after “their” time has been sent out are considered “tardy” and dropped (eg green “TPSet 2” in my example above)
Run 9347 event 7
◮ Jen Haigh added code to read out SSPs in response to a request from the self-trigger system ◮ Data reads out successfully, but I don’t know enough about the photon system to interpret it. The plot shows the waveforms from the first few channels from a triggered event, back when the purity was fairly low. It has signal, so that means it worked??
Raw q subscriber processing trigger match
Recv netio msg, push to raw q push to processing q
raw q processing q
pop processing q find hits push hits to hit q, ptmp
Raw q
hit q
recv trig req Pop raw data q, hit q pass fragments to artdaq
◮ netio messages go into main BR buffer (for extracting if triggered, 2s deep) and TP finding queue ◮ TP finding threads pops netio messages off queue, expands WIB format, finds hits, puts hits in internal buffer (for extracting if triggered) and sends to ptmp ◮ Hits are created when they end ◮ One TPSet is created for every N messages, to reduce message overhead (default N = 20, ie 240 ticks = 120µs). The TPSet then contains all hits that ended in that 120µs
300 400 500 600 700 Time (TPC ticks) 20 10 10 20 30 40 50 60 70 ADCs (median subtracted)
Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
raw filt ped q25 q75 threshold hitend
◮ Pedestal estimated using modified version of “frugal streaming” algorithm (arXiv:1407.1121). Estimate 25th and 75th %ile similarly to get interquartile range (IQR): threshold is 5 times IQR ◮ Noise filtering is via a 7-tap finite impulse response filter. It doesn’t really do a lot here. Might be filter coeff choice/integer approx, but probably just too few taps
250 500 750 1000 1250 1500 1750 2000 Time (TPC ticks) 25 50 75 100 125 150 175 ADCs (median subtracted)
Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)
raw filt ped q25 q75 threshold hitend
◮ I implemented the current pedestal RMS scheme to show it can be done efficiently, and because it’s cool. Is it necessary?
◮ Setting hit threshold as “Nσ” is ∼fixed-rate, but setting hit threshold as N ADC is ∼fixed-efficiency. Not so obvious to me which is best ◮ Pedestal and RMS don’t really change on the single-tick timescale, so maybe this scheme isn’t worth the processing
◮ 5 “sigma” threshold is way too high for serious physics ◮ IQR calculated in raw ADC but applied in filtered ADC. Should be consistent
Before improvement After improvement ◮ Change which CPUs run which tasks. Big improvements in CPU usage, startup behaviour (probably startup latency too) because of improved memory access ◮ See backups and https://its.cern.ch/jira/browse/NP04DAQ-96 for details
CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1
x24 cores x24 cores
Fast Fast
◮ RAM is “closer” to one CPU than the other. CPUs have faster access to the “closer” RAM, slower access to the “further” RAM ◮ Items within the same NUMA node are closest
CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1
x24 cores x24 cores
100 Gb NIC
From FELIX host
DMA
... x10 links
◮ Dual socket machines. 100Gb NIC is connected to NUMA 0
CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1
x24 cores x24 cores
100 Gb NIC
From FELIX host
DMA
... x10 links
Link 0 sub Link 1 sub Link 0 hitfind Link 1 hitfind
◮ netio subscriber threads (1/link) poll for new data, add it to ring buffer, send another copy to hit-finding threads. All run on NUMA 0 ◮ Hit-finding threads take raw data, extract collection channels, do pedestal, filter, hit find. All run on NUMA 1 ◮ Full data stream crosses NUMA nodes (slow path)
CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1
x24 cores x24 cores
100 Gb NIC
From FELIX host
DMA
... x10 links
Link 0 sub Link 1 sub Link 0 hitfind Link 1 hitfind
◮ Half of links have subscriber and hit-finding threads on NUMA 0. Other half have subscriber and hit-finding threads on NUMA 1 ◮ Only half of data stream crosses NUMA nodes