daq development and operations at pdsp
play

DAQ development and operations at PDSP DUNE Collaboration Week - PowerPoint PPT Presentation

DAQ development and operations at PDSP DUNE Collaboration Week 2019-05-20 Roland Sipos CERN EP-DT Overview This talk is about the development and operations of ProtoDUNE-SP DAQ Three main elements: Detector operations Ensure


  1. DAQ development and operations at PDSP DUNE Collaboration Week 2019-05-20 Roland Sipos CERN EP-DT

  2. Overview This talk is about the development and operations of ProtoDUNE-SP DAQ Three main elements: Detector operations ● ○ Ensure system stability for data taking ○ Support for DAQ users ● Interventions and developments ○ Understanding limitations and issues, eliminate problems Tests of new features during the dedicated development days ○ ● R&D towards DUNE DAQ ○ DUNE DAQ components development ○ Integration of the stable new features 2

  3. DAQ operations ● Dedicated periods for DAQ tests + weekly DAQ Development Day ○ Development Fridays moved to Monday ○ In order to avoid starting the week with hidden issues in the system ● Several problems reported by DAQ users System wide stability issues in January ○ ○ Efforts for better issue tracking Requirements from Detector Operations ● ○ Extremely useful for understanding limitations of the system 3

  4. DAQ support approach Currently the DAQ support approach is informal, operates on best effort: This is not sustainable We are aware of it, and continuously working on improvements Planning of the re-introduction of on-call DAQ shift for ProtoDUNE: ● More formal ● More fair-share ● Full-remote is feasible (first level support) Better understanding of hidden problems ● ● Perfect crash course for new developers 4

  5. NP04 DAQ JIRA Slack is a great tool for communication, not so much for keeping track of progress We introduced an issue tracker for ongoing developments and pending problems ● JIRA seems to help tracking issues ○ Still not too many users, but slowly growing Long-time open To-Do tickets ● We need to encourage developers to follow up on issues, and also to track their progress ● Clear indication of components that lost manpower The lost manpower of critical components needs to be compensated 5

  6. New features ● Prepared ColdBox readout ○ Finalized: the DAQ is fully prepared for APA7 ○ Partition 6 with RCE readout ○ This might change, as we gradually move to full FELIX readout ● New hardware triggers HV current limit threshold ○ ○ Ground plane signals Purity monitor signals ○ Under development: ● Disable triggers from the DCS For automated purity monitor runs ○ 6

  7. Interventions DAQ servers Partition issues of servers eliminated ● ● Kernel upgrade campaign ○ To avoid early Meltdown/Spectre mitigation (retpoline) Aligned configuration ○ ● Cold restart test ○ Some servers have reboot/poweroff issues ● Device maintenance ○ SSD firmware upgrade of FELIX servers Intel QAT driver automation (with user support) ○ Services ● Supervised OP mon. restart procedures ○ WIB and SSP operational monitoring scripts Kibana log aggregation ● ○ Still have some issues 7

  8. High-rate runs Needed for several noise study runs, which takes substantial time (max.: 3 hours) ● Stabilized at 40Hz Design goal: 25Hz ○ Still there are some hidden issues under the hood! Investigating the 10Gb network ● ● UDP messages get lost in RoutingTable update Acknowledgements ○ And the introduction of the DFO will eliminate the use of UDP messaging for the routing table 8

  9. DAQ development ● There were/are several activities to improve the DAQ: ○ ArtDAQ ■ Several parts of the framework (details on next slide) ○ FELIX ■ Align software versions to newest ATLAS FELIX suite ■ Better operational monitoring and automated error recovery ○ Run Control ■ Alarm system improvements ○ System administration ■ Automation of missing elements (e.g.: FELIX) ● New features of different components ○ Feature requests Continuously discussed and followed up ○ 9

  10. ArtDAQ Substantial improvements in the DAQ software framework (Many thanks to Kurt and the ArtDAQ developers) ● Routing Master improvements ● RoundRobin routing policy issues fixed ● EventBuilder fault-tolerance Crashed EBs can be restarted in the same run! ○ ● EventBuilder FragmentWatcher plugin ○ Event integrity check ● Geographic grouping ○ Ongoing work to group FELIX data from each APA in its own art/ROOT data product Studies on better development workflow ● ○ And on work area packaging/handling ● Components for the self-triggering chain are under development 10

  11. Event integrity metrics Offline experts reported incomplete events in data, therefore a new plugin (FragmentWatcher) for the EventBuilder was introduced ● EB reports on event completeness ○ Missing fragments Empty fragments ○ ● ~1.5% of events have empty fragments! ○ Mostly from SSP BoardReaders 11

  12. RunControl improvements In order to improve user experience and warn them if the system is in error state Bug-fixes ● ● Separated logical elements from GUI ● EventBuilder metrics integrated with OP monitoring ● Improved alarms Introduction of a RunControl bot on Slack ○ 12

  13. Extending the FELIX readout RCE APAs are gradually moving to FELIX Planning of resources ● ○ APA4 moves to FELIX in June (Half side of the detector read-out by FELIX) ● Performance and topology evaluation of servers 13

  14. R&D towards DUNE ● ProtoDUNE is the potential test facility for DUNE DAQ prototypes ○ HitFinding ○ Self-triggering chain ○ Single host FELIX setup ○ Co-processor ○ Control and Configuration Management (CCM studies) ○ Fault-recovery ○ … and many more! The only (currently) available platform for system level integration ● 14

  15. HitFinding From Philip Rodrigues ● Software implementation, using Intel AVX2 registers and instructions ● Keeps up with the dataflow! 1 WIB frame (464 B) => 256 ADCs + headers @ 2 MHz 15

  16. Data reordering From Giovanna’s plenary DAQ talk ● Unpack and extend collection channels with AVX2 code ● This is a heavy operation for the CPU FELIX modified to perform channels reordering in FW ● ○ This implies the need of extending the FELIX Overlay with another version Preliminary tests show gain in CPU utilization 16

  17. TPC trigger From Giovanna’s plenary DAQ talk ● Get the complete stream of TPC raw data ● Reformat WIB frames to Expand 12 bit ADCs into 16 bits ○ ○ Reorder wires in order to select only collection plane Find “hits” in the stream ● ● Combine information of hits in order to form track candidates ● Implement a sw based trigger logics Full chain tests are already ongoing: FELIX BR -> HitFinder BR -> SoftwareTrigger BR This work is ongoing. Close to reality: full chain will be tested during next DAQ testing periods at NP04 17

  18. OnHost FELIX BoardReader Goal: Elimination of the 100Gb/s peer-to-peer connection between the FELIX host server, and the BoardReader application. Merging the FELIX data processing software with the BoardReader’s data selection. Gain: Less space requirement (1 less server) with less cost (1 x server and 2 x NICs) Also R&D towards DUNE approach First working version. (Not production ready, needs manual adjustments) 18

  19. Outlook ● Organizing the weekly development days in advance ● Preparation for the June/July DAQ testing period has the highest priority Main goals: ○ Introduction of the DataFlow Orchestrator in the EB ○ Eliminate the Routing Master ○ Still support only full event building ○ Introduce software hit finding, trigger candidate and module trigger applications ○ Introduce FELIX firmware with data reordered by planes ○ Provide FELIX readout for APA4 ● Collecting requirements from the WGs for the sprint 19

  20. End Thank you for your attention! 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend