DAQ development and operations at PDSP DUNE Collaboration Week - - PowerPoint PPT Presentation

daq development and operations at pdsp
SMART_READER_LITE
LIVE PREVIEW

DAQ development and operations at PDSP DUNE Collaboration Week - - PowerPoint PPT Presentation

DAQ development and operations at PDSP DUNE Collaboration Week 2019-05-20 Roland Sipos CERN EP-DT Overview This talk is about the development and operations of ProtoDUNE-SP DAQ Three main elements: Detector operations Ensure


slide-1
SLIDE 1

DAQ development and

  • perations at PDSP

DUNE Collaboration Week 2019-05-20 Roland Sipos CERN EP-DT

slide-2
SLIDE 2

Overview

This talk is about the development and operations of ProtoDUNE-SP DAQ Three main elements:

  • Detector operations

○ Ensure system stability for data taking ○ Support for DAQ users

  • Interventions and developments

○ Understanding limitations and issues, eliminate problems ○ Tests of new features during the dedicated development days

  • R&D towards DUNE DAQ

○ DUNE DAQ components development ○ Integration of the stable new features

2

slide-3
SLIDE 3

DAQ operations

  • Dedicated periods for DAQ tests + weekly DAQ Development Day

○ Development Fridays moved to Monday ○ In order to avoid starting the week with hidden issues in the system

  • Several problems reported by DAQ users

○ System wide stability issues in January ○ Efforts for better issue tracking

  • Requirements from Detector Operations

○ Extremely useful for understanding limitations of the system

3

slide-4
SLIDE 4

DAQ support approach

Currently the DAQ support approach is informal, operates on best effort: This is not sustainable We are aware of it, and continuously working on improvements Planning of the re-introduction of on-call DAQ shift for ProtoDUNE:

  • More formal
  • More fair-share
  • Full-remote is feasible (first level support)
  • Better understanding of hidden problems
  • Perfect crash course for new developers

4

slide-5
SLIDE 5

NP04 DAQ JIRA

Slack is a great tool for communication, not so much for keeping track of progress We introduced an issue tracker for ongoing developments and pending problems

  • JIRA seems to help tracking issues

○ Still not too many users, but slowly growing

  • Long-time open To-Do tickets

We need to encourage developers to follow up on issues, and also to track their progress

  • Clear indication of components that lost manpower

The lost manpower of critical components needs to be compensated

5

slide-6
SLIDE 6

New features

  • Prepared ColdBox readout

○ Finalized: the DAQ is fully prepared for APA7 ○ Partition 6 with RCE readout ○ This might change, as we gradually move to full FELIX readout

  • New hardware triggers

○ HV current limit threshold ○ Ground plane signals ○ Purity monitor signals

Under development:

  • Disable triggers from the DCS

○ For automated purity monitor runs

6

slide-7
SLIDE 7

Interventions

DAQ servers

  • Partition issues of servers eliminated
  • Kernel upgrade campaign

○ To avoid early Meltdown/Spectre mitigation (retpoline) ○ Aligned configuration

  • Cold restart test

○ Some servers have reboot/poweroff issues

  • Device maintenance

○ SSD firmware upgrade of FELIX servers ○ Intel QAT driver automation (with user support)

Services

  • Supervised OP mon. restart procedures

○ WIB and SSP operational monitoring scripts

  • Kibana log aggregation

○ Still have some issues

7

slide-8
SLIDE 8

High-rate runs

Needed for several noise study runs, which takes substantial time (max.: 3 hours)

  • Stabilized at 40Hz

○ Design goal: 25Hz

Still there are some hidden issues under the hood!

  • Investigating the 10Gb network
  • UDP messages get lost in RoutingTable update Acknowledgements

○ And the introduction of the DFO will eliminate the use of UDP messaging for the routing table

8

slide-9
SLIDE 9

DAQ development

  • There were/are several activities to improve the DAQ:

○ ArtDAQ ■ Several parts of the framework (details on next slide) ○ FELIX ■ Align software versions to newest ATLAS FELIX suite ■ Better operational monitoring and automated error recovery ○ Run Control ■ Alarm system improvements ○ System administration ■ Automation of missing elements (e.g.: FELIX)

  • New features of different components

○ Feature requests ○ Continuously discussed and followed up

9

slide-10
SLIDE 10

ArtDAQ

Substantial improvements in the DAQ software framework (Many thanks to Kurt and the ArtDAQ developers)

  • Routing Master improvements
  • RoundRobin routing policy issues fixed
  • EventBuilder fault-tolerance

○ Crashed EBs can be restarted in the same run!

  • EventBuilder FragmentWatcher plugin

○ Event integrity check

  • Geographic grouping

○ Ongoing work to group FELIX data from each APA in its own art/ROOT data product

  • Studies on better development workflow

○ And on work area packaging/handling

  • Components for the self-triggering chain are under development

10

slide-11
SLIDE 11

Event integrity metrics

Offline experts reported incomplete events in data, therefore a new plugin (FragmentWatcher) for the EventBuilder was introduced

  • EB reports on event completeness

○ Missing fragments ○ Empty fragments

  • ~1.5% of events have empty fragments!

○ Mostly from SSP BoardReaders

11

slide-12
SLIDE 12

RunControl improvements

In order to improve user experience and warn them if the system is in error state

  • Bug-fixes
  • Separated logical elements from GUI
  • EventBuilder metrics integrated with OP monitoring
  • Improved alarms

○ Introduction of a RunControl bot on Slack

12

slide-13
SLIDE 13

Extending the FELIX readout

RCE APAs are gradually moving to FELIX

  • Planning of resources

○ APA4 moves to FELIX in June (Half side of the detector read-out by FELIX)

  • Performance and topology evaluation of servers

13

slide-14
SLIDE 14

R&D towards DUNE

  • ProtoDUNE is the potential test facility for DUNE DAQ prototypes

HitFinding

Self-triggering chain

Single host FELIX setup

Co-processor

Control and Configuration Management (CCM studies)

Fault-recovery

… and many more!

  • The only (currently) available platform for system level integration

14

slide-15
SLIDE 15

HitFinding

15

From Philip Rodrigues

  • Software implementation, using Intel AVX2 registers and instructions
  • Keeps up with the dataflow!

1 WIB frame (464 B) => 256 ADCs + headers @ 2 MHz

slide-16
SLIDE 16

Data reordering

  • Unpack and extend collection channels with AVX2 code
  • This is a heavy operation for the CPU
  • FELIX modified to perform channels reordering in FW

○ This implies the need of extending the FELIX Overlay with another version

Preliminary tests show gain in CPU utilization

16

From Giovanna’s plenary DAQ talk

slide-17
SLIDE 17

TPC trigger

  • Get the complete stream of TPC raw data
  • Reformat WIB frames to

○ Expand 12 bit ADCs into 16 bits ○ Reorder wires in order to select only collection plane

  • Find “hits” in the stream
  • Combine information of hits in order to form track candidates
  • Implement a sw based trigger logics

Full chain tests are already ongoing: FELIX BR -> HitFinder BR -> SoftwareTrigger BR This work is ongoing. Close to reality: full chain will be tested during next DAQ testing periods at NP04

17

From Giovanna’s plenary DAQ talk

slide-18
SLIDE 18

OnHost FELIX BoardReader

Goal: Elimination of the 100Gb/s peer-to-peer connection between the FELIX host server, and the BoardReader application. Merging the FELIX data processing software with the BoardReader’s data selection. Gain: Less space requirement (1 less server) with less cost (1 x server and 2 x NICs) Also R&D towards DUNE approach First working version. (Not production ready, needs manual adjustments)

18

slide-19
SLIDE 19

Outlook

  • Organizing the weekly development days in advance
  • Preparation for the June/July DAQ testing period has the highest priority

Main goals:

○ Introduction of the DataFlow Orchestrator in the EB ○ Eliminate the Routing Master ○ Still support only full event building ○ Introduce software hit finding, trigger candidate and module trigger applications ○ Introduce FELIX firmware with data reordered by planes ○ Provide FELIX readout for APA4

  • Collecting requirements from the WGs for the sprint

19

slide-20
SLIDE 20

End

Thank you for your attention!

20