[PPT] - ProtoDUNE prototype TPC self-trigger Philip Rodrigues University of PowerPoint Presentation

SLIDE 1

ProtoDUNE prototype TPC self-trigger

Philip Rodrigues

University of Oxford

24 September 2019

1

SLIDE 2

A self-triggered muon parallel to the APA face

9300 9400 9500 9600 9700 Offline channel number

Z view U view V view

250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

◮ Read out 1000 ticks before and after the trigger time

2

SLIDE 3

TPC-based self-trigger: why and how

◮ Why: Stepping stone to DUNE FD, which will need a self trigger ◮ How:

◮ Proof-of-concept; more concerned with data flow than physics ◮ Make the simplest thing that works; iterate if necessary ◮ Work in the existing ProtoDUNE DAQ, with incremental changes ◮ Downstream (lower data rates) shouldn’t impose requirements on upstream (higher data rates, performance more important)

3

SLIDE 4

System overview

FELIX BR FELIX BR

x10

FELIX BR

APA 5

Trigger cand’te BR Hit-finding FPGA Hit-sending BR

APA

Trigger cand’te BR Module-level trigger BR Routing Master/DFO FELIX BR FELIX BR

x10

FELIX BR

APA 6

Trigger cand’te BR

ptmp messages ptmp messages

artdaq fragment FELIX BR

APA 4: on-host

Trigger cand’te BR

ptmp messages

◮ Implement code inside artdaq board readers:

◮ Advantages: integration with run control, log file handling, saving of raw data and metadata

◮ Trigger primitives and candidates are sent using (from Brett Viren:

https://github.com/brettviren/ptmp)

◮ Data structures ◮ Message passing ◮ Algorithms

◮ Not shown: SSPs don’t participate in generating triggers, but will provide data in response to a trigger request

4

SLIDE 5

ptmp data structures, message passing and algorithms

◮ TrigPrim: TPC hit with channel, start time, time span, ADC sum, peak ADC (unused so far), error flags (unused so far) ◮ TPSet: a container for TrigPrims, with count, detector ID, creation time, time/channel span, actual list of TrigPrims ◮ TPSets can be passed as ZeroMQ messages over network or in-process. Fast, configurable, alternatives for handling backpressure (drop or wait) ◮ TPWindow algorithm repackages TrigPrims into fixed-time windows ◮ TPZipper algorithm aggregates multiple TPSet message streams in time order, with (soft) maximum latency guarantee

5

SLIDE 6

Trigger primitive (hit) finding

250 500 750 1000 1250 1500 1750 2000 Time (tick) 200 100 100 200 300 400 ADC (arbitrary offset)

Ch. 9580
Ch. 9300
Ch. 9702

Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

◮ Simple hit finding running in CPUs on FELIX BR hosts

1. Decode WIB format, select collection channels
2. Find pedestal and pedestal variance
3. Apply finite impulse response noise filter
4. Sum charge above threshold

◮ Uses about 60% of a CPU core per link (10/APA)

6

SLIDE 7

Trigger candidate (cluster) finding

Figure: Jon Sensenig

◮ Trigger candidates are found per-APA ◮ Take 50 µs windows, find groups of hits contiguous in channel (with up to 4-wire gap) ◮ Generate TC if channel range of largest group is 100 or more

7

SLIDE 8

Module-level trigger algorithm

Figure: Jon Sensenig

◮ Stitch together TCs from individual APAs ◮ Make a sliding window of one drift time, look for consistent slope (∆(time)/∆(channel)) between TCs ◮ Trigger if stitched TCs result in 350 hits or more in each APA

8

SLIDE 9

It works!

9300 9400 9500 9600 9700 Offline channel number

Z view U view V view

250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

◮ Read out 1000 ticks before and after the trigger time

9

SLIDE 10

Another event

9300 9400 9500 9600 9700 Offline channel number

Z view U view V view

250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9406, event 4 (timestamp 0x11647d02ca488f4, 2019-08-23 17:07:39 UTC)

◮ Read out 1000 ticks before and after the trigger time ◮ Event from lower-purity period

10

SLIDE 11

Characterization: latency and CPU usage

Run 9413

20 40 60 80 Trigger Latency (ms) 100 101 102 103 Frequency

◮ Trigger latency, defined as time between data with timestamp T arriving in FELIX BR, and trigger request for data with timestamp T arriving in FELIX BR. One histogram per link ◮ ∼ 3000 triggers in this run, no latencies close to buffer depth (∼ 1s) ◮ CPU usage: TODO, but approx 2–3 cores per APA for work downstream of hit-finding ◮

11

SLIDE 12

Next steps

◮ Incorporate hit finding into on-host FELIX config ◮ Improve characterization/monitoring. In particular, check rate of dropping data (better be negligible!) ◮ Handle multiple trigger candidate algorithms: more like DUNE ◮ Michel electron trigger. Work from Columbia University ◮ Write technical paper (NIM or JINST maybe) ◮ Investigate induction-wire hit finding

12

SLIDE 13

Next steps: induction wire hit finding

◮ Could open the possibility of smaller ROI readout in DUNE ◮ Easy first test: run collection-wire algorithm on induction wires ◮ Works, for the appropriate value of “works”. Ran out of cores at 4 links (out of 10) ◮ Nice idea from ICARUS collaborators for induction wire preprocessing: keep a running sum of waveform, to convert bipolar to unipolar

13

SLIDE 14

Conclusions

◮ Demonstration of self-triggering on TPC data using an entirely software-based trigger chain ◮ A base for more R&D towards DUNE

14

SLIDE 15

Backup slides

15

SLIDE 16

A self-triggered muon parallel to the APA face, with hits

9300 9400 9500 9600 9700 Offline channel number

Z view U view V view

250 500 750 1000 1250 1500 1750 2000 Time (tick) 4200 4300 4400 4500 4600 Offline channel number 250 500 750 1000 1250 1500 1750 2000 Time (tick) 250 500 750 1000 1250 1500 1750 2000 Time (tick) Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

◮ Read out 1000 ticks before and after the trigger time

16

SLIDE 17

Induction wire hit finding

◮ Each link uses two threads for hit finding: one to decode the funky WIB format and find collection-wire hits (as before), and another to find induction-wire hits. They each get a core to themselves, and use 75% and 65% CPU respectively ◮ Ran out of cores on the FELIX BR machine at 4 links. Expect improvements with a more favourable input data format. Also might just sum adjacent ticks to halve the data rate (there’s no information at frequencies that high anyway)

17

SLIDE 18

ptmp data structures

◮ Important distinction between “data time” and “computer time”:

◮ Data time is 50 MHz ticks from the timing system ◮ Computer time is microseconds according to a computer’s system clock

◮ TrigPrim, a trigger primitive: channel Offline channel number tstart Start time (data time) tspan Time span (data time) adcsum Sum of hit ADCs adcpeak Peak ADCs (not filled by CPU hit finding yet) flags Error flags. Not used yet ◮ TPSet, a set of connected trigger primitives: count Sequential count detid Which detector portion represented (in PDSP, this is FELIX link) created Time this TPSet created (computer time) tstart Earliest tstart of member TrigPrims (data time) tspan Time span of member TrigPrims (data time) chanbeg Lowest channel of TrigPrims in set chanend Highest channel of TrigPrims in set totaladc Total ADC of TrigPrims in set tps The actual list of TrigPrims ◮ Full details: https://github.com/brettviren/ptmp/blob/master/src/ptmp.proto ◮ count field means that lost data can be identified, but that’s not done yet ◮ By convention, empty TPSets are not sent

18

SLIDE 19

ptmp message passing

◮ TPSets are (de)serialized using protobuf from Google and the data sent using ZeroMQ ◮ Relevant features of ZeroMQ:

◮ Messages can go via network or efficiently within process. Just via a config change ◮ Can “subscribe” to messages in standalone code for debugging/data dumping ◮ Alternatives for handling backpressure: make upstream wait (PUSH/PULL) or drop messages (PUB/SUB) ◮ Good performance: millions of msgs/s, sub-ms latencies1

1http://zeromq.org/results:10gbe-tests-v432

19

SLIDE 20

ptmp algorithms: TPWindow

TPSet 3

TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]

TPSet 2

TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]

TPSet 1

TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]

TPWindow span=100

ffset=0

TPSet 2

TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]

TPSet 1

TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1] TrigPrim [t0, t1]

◮ Repackage TrigPrims into new TPSets with start (data) times in windows starting on a fixed boundary ◮ Input hits not required to be sorted by start time. Hits are buffered for a configurable time before being sent out. Late hits are dropped ◮ Full details: https://github.com/brettviren/ptmp/blob/master/docs/tpwindow.org

20

SLIDE 21

ptmp algorithms: TPZipper

TPSet 3 [t2, t3] TPZipper TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 3 [t2, t3] TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 2 [t1, t2] TPSet 1 [t0, t1] TPSet 3 [t2, t3] TPSet 1 [t0, t1] TPSet 1 [t0, t1] TPSet 1 [t0, t1] TPSet 2 [t1, t2] TPSet 2 [t1, t2] TPSet 3 [t2, t3] TPSet 3 [t2, t3] TPSet 3 [t2, t3]

◮ Aggregate multiple TPSet message streams, outputting unchanged TPSets in time order with a maximum latency ◮ Maximum latency is computer time between first TPSet of a given data time arriving, and all available TPSets for that data time being sent out ◮ TPSets arriving after “their” time has been sent out are considered “tardy” and dropped (eg green “TPSet 2” in my example above)

21

SLIDE 22

SSP readout

Run 9347 event 7

◮ Jen Haigh added code to read out SSPs in response to a request from the self-trigger system ◮ Data reads out successfully, but I don’t know enough about the photon system to interpret it. The plot shows the waveforms from the first few channels from a triggered event, back when the purity was fairly low. It has signal, so that means it worked??

22

SLIDE 23

Trigger primitive finding structure

Raw q subscriber processing trigger match

Recv netio msg, push to raw q push to processing q

raw q processing q

pop processing q find hits push hits to hit q, ptmp

Raw q

hit q

recv trig req Pop raw data q, hit q pass fragments to artdaq

◮ netio messages go into main BR buffer (for extracting if triggered, 2s deep) and TP finding queue ◮ TP finding threads pops netio messages off queue, expands WIB format, finds hits, puts hits in internal buffer (for extracting if triggered) and sends to ptmp ◮ Hits are created when they end ◮ One TPSet is created for every N messages, to reduce message overhead (default N = 20, ie 240 ticks = 120µs). The TPSet then contains all hits that ended in that 120µs

23

SLIDE 24

Hit-finding details

300 400 500 600 700 Time (TPC ticks) 20 10 10 20 30 40 50 60 70 ADCs (median subtracted)

Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

raw filt ped q25 q75 threshold hitend

◮ Pedestal estimated using modified version of “frugal streaming” algorithm (arXiv:1407.1121). Estimate 25th and 75th %ile similarly to get interquartile range (IQR): threshold is 5 times IQR ◮ Noise filtering is via a 7-tap finite impulse response filter. It doesn’t really do a lot here. Might be filter coeff choice/integer approx, but probably just too few taps

24

SLIDE 25

Hit-finding thoughts/possible improvements

250 500 750 1000 1250 1500 1750 2000 Time (TPC ticks) 25 50 75 100 125 150 175 ADCs (median subtracted)

Run 9619, event 6 (timestamp 0x1168a5f24e41158, 2019-09-09 15:41:35 UTC)

raw filt ped q25 q75 threshold hitend

◮ I implemented the current pedestal RMS scheme to show it can be done efficiently, and because it’s cool. Is it necessary?

◮ Setting hit threshold as “Nσ” is ∼fixed-rate, but setting hit threshold as N ADC is ∼fixed-efficiency. Not so obvious to me which is best ◮ Pedestal and RMS don’t really change on the single-tick timescale, so maybe this scheme isn’t worth the processing

◮ 5 “sigma” threshold is way too high for serious physics ◮ IQR calculated in raw ADC but applied in filtered ADC. Should be consistent

25

SLIDE 26

Improved hit-finding performance with CPU-pinning changes in mid-July

Before improvement After improvement ◮ Change which CPUs run which tasks. Big improvements in CPU usage, startup behaviour (probably startup latency too) because of improved memory access ◮ See backups and https://its.cern.ch/jira/browse/NP04DAQ-96 for details

26

SLIDE 27

Non-uniform memory architecture (NUMA) basics

CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1

x24 cores x24 cores

Slow

Fast Fast

◮ RAM is “closer” to one CPU than the other. CPUs have faster access to the “closer” RAM, slower access to the “further” RAM ◮ Items within the same NUMA node are closest

27

SLIDE 28

NUMA on the FELIX BR hosts

CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1

x24 cores x24 cores

100 Gb NIC

From FELIX host

DMA

... x10 links

◮ Dual socket machines. 100Gb NIC is connected to NUMA 0

28

SLIDE 29

Old (pre-July) CPU pinning

CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1

x24 cores x24 cores

100 Gb NIC

From FELIX host

DMA

... x10 links

Link 0 sub Link 1 sub Link 0 hitfind Link 1 hitfind

◮ netio subscriber threads (1/link) poll for new data, add it to ring buffer, send another copy to hit-finding threads. All run on NUMA 0 ◮ Hit-finding threads take raw data, extract collection channels, do pedestal, filter, hit find. All run on NUMA 1 ◮ Full data stream crosses NUMA nodes (slow path)

29

SLIDE 30

New (post-July) CPU pinning

CPU 0 Core 0 Core 1 RAM NUMA node 0 CPU 1 Core 0 Core 1 RAM NUMA node 1

x24 cores x24 cores

100 Gb NIC

From FELIX host

DMA

... x10 links

Link 0 sub Link 1 sub Link 0 hitfind Link 1 hitfind

◮ Half of links have subscriber and hit-finding threads on NUMA 0. Other half have subscriber and hit-finding threads on NUMA 1 ◮ Only half of data stream crosses NUMA nodes