Data Format and Packaging, An Update Kurt Biery 18 March 2020 - - PowerPoint PPT Presentation

data format and packaging an update
SMART_READER_LITE
LIVE PREVIEW

Data Format and Packaging, An Update Kurt Biery 18 March 2020 - - PowerPoint PPT Presentation

Data Format and Packaging, An Update Kurt Biery 18 March 2020 DUNE DAQ Dataflow Working Group Meeting Data Format At the DAQ workshop, it was proposed to focus our data format investigations on A DUNE-specific binary format stored


slide-1
SLIDE 1

Data Format and Packaging, An Update

Kurt Biery 18 March 2020 DUNE DAQ Dataflow Working Group Meeting

slide-2
SLIDE 2

Data ‘Format’

At the DAQ workshop, it was proposed to focus our data format investigations on

  • A DUNE-specific binary format stored in HDF5 files

In the (admittedly small number of) subsequent discussions, this has been received positively. Eric Flumerfelt has done preliminary work in demonstrating the writing of artdaq::Fragments (a la PDSP) in HDF5. Next step: share information on what has been done so far with a few more technical experts from offline and online, gather feedback, etc. Run tests, encoding/decoding speed, etc.

18-Mar-2020 2 Data Format and Packaging Update

slide-3
SLIDE 3

Data ‘Packaging’

‘Packaging’ ~= ‘grouping and subdividing’

  • Determining how file boundaries are managed…
  • What appears in each file…

At the DAQ workshop, we used different ‘types of data’ as a starting point for discussion - possibly misleading. Here, I’d like to start with different types of packaging and come back to different types of data later…

18-Mar-2020 3 Data Format and Packaging Update

slide-4
SLIDE 4

Data packaging choices (parameters)

Some of the parameters that can be used to specify how data is grouped into files:

  • 1. Whether or not the data in each file on disk will have

geographically complete coverage (superset, TD has details)

  • 1. If not, what subdivision will be used
  • 2. The maximum size of the files that will be created
  • 3. The maximum time interval/duration that will be stored in

a single file (data time or wall clock time both seem possible)

18-Mar-2020 4 Data Format and Packaging Update

slide-5
SLIDE 5

Priority of the choices

Given the choices described on the previous slide, we can imagine sets of answers/values for #1, 2, 3 that can’t simultaneously be satisfied. So, we would need to specify which one(s) are the most

  • important. For example,
  • For trigger type Y during normal running, the file size

specification is the most important.

18-Mar-2020 5 Data Format and Packaging Update

slide-6
SLIDE 6

Part of the configuration for DF?

Should we (Dataflow subsystem) support a set of configuration parameters, keyed by data type, that specifies how the data for that data type is packaged? I believe that we can identify the set of parameters that will be needed to specify how the data files for a given data type should be handled. Discussion of the parameter values can, and should, be deferred until closer to data taking.

18-Mar-2020 6 Data Format and Packaging Update

slide-7
SLIDE 7

Easing back into data types…

My sense is that we have two high-level types:

  • Triggered data
  • Trigger Records that are produced in response to a Trigger

Decision

  • Streaming data
  • Data that is collected without Trigger Record boundaries
  • E.g. WIB debugging data
  • The Trigger Primitive stream might also fit in this category

18-Mar-2020 7 Data Format and Packaging Update

slide-8
SLIDE 8

Data packaging choices, take 2

For Triggered data:

  • 1. Whether each file on disk will have an integer number of

Trigger Records, or whether each file can have a fractional number of Trigger Record(s) For both Triggered and Streaming data:

  • 2. Whether or not the data in each file on disk will have

geographically complete coverage (superset, TD has details)

1.

If not, what subdivision will be used

  • 3. The maximum size of files that will be created
  • 4. The maximum time interval/duration that will be stored in a

single file (data time or wall clock time both seem possible)

We will still need to specify priority among these…

18-Mar-2020 8 Data Format and Packaging Update

slide-9
SLIDE 9

Different ‘types of data’

Types mentioned in earlier discussions:

  • Local Trigger Records – e.g. beam triggers
  • Extended Trigger Records – e.g. SNB triggers
  • Trigger Primitive stream – all TPs
  • WIB debugging stream – temporary stream that can be

enabled for debugging Others may be mentioned/proposed over time…

18-Mar-2020 9 Data Format and Packaging Update

slide-10
SLIDE 10

Possible choices for 1 data type

Beam Trigger Records:

  • Integer number of Trigger Records per file: Yes
  • Geographically complete data in each file: Yes (TPC, PDS, Trigger,

Timing; superset, Trigger specifies details in the Trigger Decision)

  • Maximum file size: <optimized for offline use>
  • Maximum time duration per file: TBD (0.5 hour?)
  • Priorities: TBD (to be determined)

1.

If TR size < max_file_size, integer # of TRs; otherwise, file size

2.

Etc. ** These value choices are for illustration only. If we support configurable data packaging in the Dataflow subsystem, then the values can be changed, under the direction of the appropriate physics groups, offline folks, online folks, etc.

18-Mar-2020 10 Data Format and Packaging Update

slide-11
SLIDE 11

Possible choices for a 2nd data type

Supernova Burst Trigger Records:

  • Integer number of Trigger Records per file: No
  • Geographically complete data in each file: No
  • Files split by APA (for example) (PDS, etc details TBD)
  • Maximum file size: <optimized for offline use>
  • Maximum time window per file: TBD
  • Priorities: TBD

1.

File size

2.

Etc. ** These value choices are for illustration only.

18-Mar-2020 11 Data Format and Packaging Update

slide-12
SLIDE 12

Possible choices for a 3rd data type

The Trigger Primitive Stream:

  • Integer number of Trigger Records per file: n/a
  • Geographically complete data in each file: Yes (TPC, PDS, Trigger,

Timing; superset, subdetector components which don’t have TPs won’t contribute)

  • Maximum file size: <optimized for offline use>
  • Maximum time window per file: TBD
  • Priorities:

1.

File size

2.

Etc. ** These value choices are for illustration only. If we support configurable data packaging in the Dataflow subsystem, then the values can be changed, under the direction of the appropriate physics groups, offline folks, online folks, etc.

18-Mar-2020 12 Data Format and Packaging Update

slide-13
SLIDE 13

Possible choices for a 4th data type

The WIB Debug Stream:

  • Integer number of Trigger Records per file: n/a
  • Geographically complete data in each file: No
  • Files split by <TBD>
  • Maximum file size: <optimized for offline use>
  • Maximum time window per file: TBD
  • Priorities:
  • File size
  • Etc.

** These value choices are for illustration only.

18-Mar-2020 13 Data Format and Packaging Update

slide-14
SLIDE 14

Comments

1. Choosing to support this configurability does not necessarily mean that we will need to build a general-purpose rules engine. The

  • ptions aren’t that numerous; we could simply encapsulate them in a

class. 2. Remember that we’re talking about interfaces here… Data handoff, and the specification of the packaging.

  • Implementation details within both the online and the offline have

freedom… 3. New trigger types that have readout windows in the range of 10-100 seconds can easily be supported by a configurable DF data packaging system – the packaging config would be part of the proposal from the physics group or whomever. 4. Files wouldn’t necessarily need to have consistent “spans” (time window or number of TRs) [metadata files discussed next slide]

18-Mar-2020 14 Data Format and Packaging Update

slide-15
SLIDE 15

Ideas

  • 1. Data challenge in Feb 2021
  • 2. Metadata and manifest files…
  • Metadata file for each raw data file
  • Manifest file for each TR that spans multiple files
  • Metadata could instead be internal to the raw data file
  • Sample metadata information for SNB files:
  • the trigger number/identifier
  • the APA number (or whatever geographic identifier(s) are appropriate)
  • the beginning and ending timestamps of the trigger window (or start

time and window size)

  • the beginning and ending timestamps of the interval that is covered by

the individual file (or start time and window size

18-Mar-2020 15 Data Format and Packaging Update

slide-16
SLIDE 16

Backup slides

18-Mar-2020 16 Data Format and Packaging Update

slide-17
SLIDE 17

Some topics that have come up

Where to save information about which components are in the partition. In each data file? Each metadata file? (configuration archive, for sure) Reminder: partitions do not span detector cryomodules.

18-Mar-2020 17 Data Format and Packaging Update

slide-18
SLIDE 18

Reminder about Tom’s requirements

Tom has summarized the following requirements: 1. longevity of support 2. integrity checks – for the file format as well as the data fragments 3. ability to read in small subsets of the trigger records and drop from memory data no longer being used 4. ability to navigate through a trigger record to get the adjacent time or space samples 5. compression tools 6. browsable with a lightweight, interactive tool 7. ability to handle evolution of data formats and structure gracefully with backward compatibility ensured

https://wiki.dunescience.org/wiki/Project_Requirement_Brainstorming#Data_Format

18-Mar-2020 18 Data Format and Packaging Update

slide-19
SLIDE 19

Notes on trigger & stream types

  • 'Local Trigger Records' in Georgia's summary of the Aug/Sep 2019 data model

discussions, which is on pages 2-4 here - these data are made up of Trigger Records from triggers that specify the readout of the detector for time intervals measured in

  • milliseconds. The Trigger system will decide which fraction of a single detector module

to include in each Trigger Record, as we've discussed many times before.

  • Supernova burst data - the ~100 seconds of readout of a full detector module resulting

from a supernova burst trigger

  • The Trigger Primitive stream - the constant flow of Trigger Primitives, which is planned to

be persisted by the DAQ system (in addition to being delivered to the Trigger system for use in generating Trigger Decisions)

  • A WIB debug stream - an available, but not constant, stream of data (raw waveforms?)

from a manageable number of TPC electronics channels that is used for

  • debugging. This stream would be enabled and disabled by authorized humans, when it

is needed for debugging.

  • The identification of these different types of data is not intended to imply anything about

how they are handled operationally, or how much data they produce, or what fraction of the data is transferred to the offline, etc.

18-Mar-2020 19 Data Format and Packaging Update