DUNE Software and Computing Incomplete Overview for NPPS Brett - - PowerPoint PPT Presentation

dune software and computing incomplete overview for npps
SMART_READER_LITE
LIVE PREVIEW

DUNE Software and Computing Incomplete Overview for NPPS Brett - - PowerPoint PPT Presentation

DUNE Software and Computing Incomplete Overview for NPPS Brett Viren Physics Department NPPS 2019-06-05 Outline Experiment Online Computing (Far Detector DAQ) Offline Computing Brett Viren (BNL) DUNE S&C 05 June 2019 2 / 24


slide-1
SLIDE 1

DUNE Software and Computing Incomplete Overview for NPPS Brett Viren

Physics Department NPPS – 2019-06-05

slide-2
SLIDE 2

Outline

Experiment Online Computing (Far Detector DAQ) Offline Computing

Brett Viren (BNL) DUNE S&C 05 June 2019 2 / 24

slide-3
SLIDE 3

Experiment

DUNE Experiment and Physics

  • Long-baseline neutrino beam: discover ν CP-violation, ν mass hierarchy,

precision ν oscillation parameter measurements.

  • Nucleon decay: targeting SUSY-favored modes (p → K +¯

ν)

  • Supernova Neutrino Burst (SNB): sensitive to the galaxy, sensitive to νe

(complementary to water-Chernkov).

Brett Viren (BNL) DUNE S&C 05 June 2019 3 / 24

slide-4
SLIDE 4
slide-5
SLIDE 5

Experiment Brett Viren (BNL) DUNE S&C 05 June 2019 5 / 24

slide-6
SLIDE 6

Experiment Brett Viren (BNL) DUNE S&C 05 June 2019 6 / 24

slide-7
SLIDE 7

Experiment Brett Viren (BNL) DUNE S&C 05 June 2019 7 / 24

slide-8
SLIDE 8

Experiment Brett Viren (BNL) DUNE S&C 05 June 2019 8 / 24

slide-9
SLIDE 9

Experiment Brett Viren (BNL) DUNE S&C 05 June 2019 9 / 24

slide-10
SLIDE 10

Online Computing (Far Detector DAQ)

DUNE Far Detector DAQ Overview

  • Common DAQ for 4 loosely-coupled 10kton modules.
  • LAr 3-plane wire readout, LAr/GAr 2-plane strip readout, future LAr pixel?
  • O(106) channels, 2 MHz, 12bit waveforms, 3 – 4 TB/s into DAQ.
  • Output to tape: 30 PB/year ≈ 3000× reduction.
  • Physics drivers of DAQ design:
  • supernova neutrino burst: 10s of pre-trigger buffer, 100s full readout
  • natural 39Ar decay: 10 MHz, 0.5 MeV endpoint energy, reject
  • Self-triggering on ionization activity (largely software-based)
  • Fermilab’s artDAQ used now in protoDUNE, expected to

provide basis for DAQ back-end.

  • considering to aggregate triggered data “event” via distributed file

system directory (eg, glusterfs), or as entry in key-value store (eg, DAQ-DB), following ATLAS R&D and technology studies.

Brett Viren (BNL) DUNE S&C 05 June 2019 10 / 24

slide-11
SLIDE 11

Online Computing (Far Detector DAQ)

FELIX as DUNE FD DAQ Input Interface

FELIX collaboration: BNL, ANL, Bologna, CERN, FNAL, Irvine, Nikhef, UCL and Weizmann BNL: hardware design and co-development of firmware.

  • Thin custom hardware and FPGA between detector electronics and

DAQ’s commodity computing.

  • R&D shared with ATLAS, protoDUNE, sPHENIX, Belle2, others.
  • Powerful FPGA, 48 optical I/O (460 Gbps), support for daughterboard.
  • Commodity host PC interface: PCIe gen3 x16, (16 GB/s).
  • goal: 75 front-end PCs, each with 2 FELIX PCIe gen3 boards (per 10kt)
  • stretch goals: reduce PC/board count by 2 – 4 with PCIe gen4

Brett Viren (BNL) DUNE S&C 05 June 2019 11 / 24

slide-12
SLIDE 12

Offline Computing

Data Transfer and File Catalog

  • For prototype detectors (ProtoDUNE) transfer raw data from

CERN to FNAL via Fermi-FTS using SAM for data catalog.

  • For DUNE, expecting to transition to Rucio for online→offline

and production data management.

  • Replacement for SAM under consideration.

Brett Viren (BNL) DUNE S&C 05 June 2019 12 / 24

slide-13
SLIDE 13

Offline Computing

Major Offline Processing Stages

Signal processing noise filters and detector response

  • deconvolution. Heavy use of FFTs. Output

signal-regions-of-interest ≈ 100× data reduction. 3D Imaging reconstruct ionization activity patterns. Fast, compressed sensing techniques. Conventional reconstruction clustering, track/show modeling, pattern recognition. Machine learning dense and sparse CNN, Graph NN, GANS.

Brett Viren (BNL) DUNE S&C 05 June 2019 13 / 24

slide-14
SLIDE 14

Offline Computing

Wire-Cell Toolkit

  • Provides leading LAr TPC signal+noise simulation, noise

filtering, signal processing, 3D imaging algorithms.

  • pattern recognition, charge/light matching in prototype
  • Toolkit supports data flow programming paradigm.
  • dynamic plugin system, comprehensive configuration via Jsonnet
  • Abstract DFP graph execution engine, multiple

implementations

  • Default is low-memory, single-threaded
  • Experimental multi-thread based on Intel TBB
  • Future, multi-node engine possible
  • Developed and maintained by BNL.
  • Initially for MicroBooNE, now for ProtoDUNE, ICARUS, DUNE....
  • Runs stand-alone CLI or embedded in Fermilab’s art/LArSoft

framework

Brett Viren (BNL) DUNE S&C 05 June 2019 14 / 24

slide-15
SLIDE 15

Offline Computing

WCT Job Graph for ProtoDUNE 3D Imaging

AnodePlane [apa0] faces = list(2) ident = 0 nimpacts = 10 wire_schema = WireSchemaFile AnodePlane [apa1] faces = list(2) ident = 1 nimpacts = 10 wire_schema = WireSchemaFile AnodePlane [apa2] faces = list(2) ident = 2 nimpacts = 10 wire_schema = WireSchemaFile AnodePlane [apa3] faces = list(2) ident = 3 nimpacts = 10 wire_schema = WireSchemaFile AnodePlane [apa4] faces = list(2) ident = 4 nimpacts = 10 wire_schema = WireSchemaFile AnodePlane [apa5] faces = list(2) ident = 5 nimpacts = 10 wire_schema = WireSchemaFile BlobClustering [blobclustering-apa0] spans = 1 BlobGrouping [blobgrouping-apa0] BlobClustering [blobclustering-apa1] spans = 1 BlobGrouping [blobgrouping-apa1] BlobClustering [blobclustering-apa2] spans = 1 BlobGrouping [blobgrouping-apa2] BlobClustering [blobclustering-apa3] spans = 1 BlobGrouping [blobgrouping-apa3] BlobClustering [blobclustering-apa4] spans = 1 BlobGrouping [blobgrouping-apa4] BlobClustering [blobclustering-apa5] spans = 1 BlobGrouping [blobgrouping-apa5] BlobSolving [blobsolving-apa0] threshold = 0 BlobSolving [blobsolving-apa1] threshold = 0 BlobSolving [blobsolving-apa2] threshold = 0 BlobSolving [blobsolving-apa3] threshold = 0 BlobSolving [blobsolving-apa4] threshold = 0 BlobSolving [blobsolving-apa5] threshold = 0 1 BlobSetSync [blobsetsync-apa0] multiplicity = 2 1 BlobSetSync [blobsetsync-apa1] multiplicity = 2 1 BlobSetSync [blobsetsync-apa2] multiplicity = 2 1 BlobSetSync [blobsetsync-apa3] multiplicity = 2 1 BlobSetSync [blobsetsync-apa4] multiplicity = 2 1 BlobSetSync [blobsetsync-apa5] multiplicity = 2 JsonClusterTap [clustertap-apa0] drift_speed = 0.0016 filename = clusters-apa0-%04d.json JsonClusterTap [clustertap-apa1] drift_speed = 0.0016 filename = clusters-apa1-%04d.json JsonClusterTap [clustertap-apa2] drift_speed = 0.0016 filename = clusters-apa2-%04d.json JsonClusterTap [clustertap-apa3] drift_speed = 0.0016 filename = clusters-apa3-%04d.json JsonClusterTap [clustertap-apa4] drift_speed = 0.0016 filename = clusters-apa4-%04d.json JsonClusterTap [clustertap-apa5] drift_speed = 0.0016 filename = clusters-apa5-%04d.json ChannelSplitter [peranode] anodes = list(6) tag_rules = list(6) 1 2 3 4 5 SumSlices [slicing-apa0] anode = AnodePlane:apa0 tag = tick_span = 4 SumSlices [slicing-apa1] anode = AnodePlane:apa1 tag = tick_span = 4 SumSlices [slicing-apa2] anode = AnodePlane:apa2 tag = tick_span = 4 SumSlices [slicing-apa3] anode = AnodePlane:apa3 tag = tick_span = 4 SumSlices [slicing-apa4] anode = AnodePlane:apa4 tag = tick_span = 4 SumSlices [slicing-apa5] anode = AnodePlane:apa5 tag = tick_span = 4 ClusterSink [clustersink-apa0] filename = clusters-apa-apa0-%d.dot ClusterSink [clustersink-apa1] filename = clusters-apa-apa1-%d.dot ClusterSink [clustersink-apa2] filename = clusters-apa-apa2-%d.dot ClusterSink [clustersink-apa3] filename = clusters-apa-apa3-%d.dot ClusterSink [clustersink-apa4] filename = clusters-apa-apa4-%d.dot ClusterSink [clustersink-apa5] filename = clusters-apa-apa5-%d.dot GridTiling [tiling-apa0-face0] anode = AnodePlane:apa0 face = 0 GridTiling [tiling-apa0-face1] anode = AnodePlane:apa0 face = 1 GridTiling [tiling-apa1-face0] anode = AnodePlane:apa1 face = 0 GridTiling [tiling-apa1-face1] anode = AnodePlane:apa1 face = 1 GridTiling [tiling-apa2-face0] anode = AnodePlane:apa2 face = 0 GridTiling [tiling-apa2-face1] anode = AnodePlane:apa2 face = 1 GridTiling [tiling-apa3-face0] anode = AnodePlane:apa3 face = 0 GridTiling [tiling-apa3-face1] anode = AnodePlane:apa3 face = 1 GridTiling [tiling-apa4-face0] anode = AnodePlane:apa4 face = 0 GridTiling [tiling-apa4-face1] anode = AnodePlane:apa4 face = 1 GridTiling [tiling-apa5-face0] anode = AnodePlane:apa5 face = 0 GridTiling [tiling-apa5-face1] anode = AnodePlane:apa5 face = 1 SliceFanout [slicefanout-apa0] multiplicity = 2 1 SliceFanout [slicefanout-apa1] multiplicity = 2 1 SliceFanout [slicefanout-apa2] multiplicity = 2 1 SliceFanout [slicefanout-apa3] multiplicity = 2 1 SliceFanout [slicefanout-apa4] multiplicity = 2 1 SliceFanout [slicefanout-apa5] multiplicity = 2 1 wclsCookedFrameSource [sigs] art_tag = xxxx frame_tags = list(0)

Brett Viren (BNL) DUNE S&C 05 June 2019 15 / 24

slide-16
SLIDE 16

Offline Computing

“Bee” BNL’s Web-based Visualization

https://www.phy.bnl.gov/wire-cell/bee/set/bccdc2d7-16e1-4363-8034-032d9fe2de50/event/0/

Developed/maintained by Chao Zhang. WebGL, GPU accelerated. PC/phone browsers. JSON data, user uploads.

Brett Viren (BNL) DUNE S&C 05 June 2019 16 / 24

slide-17
SLIDE 17

Offline Computing

art/LArSoft

art Fermilab’s event processing framework forked from CMSSW and used by several Intensity Frontier experiments. Fermilab is now considering to “merge” art back into CMSSW (or something). LArSoft “application” layer for general LArTPC detectors, shared by multiple experiments, provides data model, art “modules” (equiv to Gaudi “algorithms”) and services. Maintained by Fermilab, contributions from many. Includes a Wire-Cell Toolkit integration layer. dunetpc Further “application” layer on top of LArSoft with DUNE-specific modules/services.

  • Provides Wire-Cell its interface to Geant4 and raw detector data.
  • art/LS support effectively required for running code in production DUNE jobs.
  • LArSoft includes “competing” drift simulation + signal processing based
  • n simple, 1D field response functions.
  • somewhat faster but less correct compared to WCT’s 2D versions.

Brett Viren (BNL) DUNE S&C 05 June 2019 17 / 24

slide-18
SLIDE 18

Offline Computing

art/LArSoft/dunetpc/WCT package deps

Presenter Name | Presentation Title 9

dunetpc and its Dependencies

ups sets all of this up when you set up dunetpc

Mike Kirby, DUNE Collab Meeting May 2019.

Brett Viren (BNL) DUNE S&C 05 June 2019 18 / 24

slide-19
SLIDE 19

Offline Computing

Coordination with Worldwide LHC Computing Grid

  • WLCG undergoing a transition in the organizational structure towards a

Scientific Computing Infrastructure Steering Committee

  • Approved middleware and interfaces (CREAM, ARC, HTCondorCE,

dCache,EOS, SRM, etc)

  • Tickets (GGUS), registry (GOCDB), and accounting (APEL) from EGI
  • DUNE is now an observing member of the WLCG Management Board -

announced at the HSF/OSG/WLCG Workshop at JLab this spring

  • will participate in Global Deployment Board (GBD) for infrastructure planning
  • DUNE has already started to participate at an operational level through GGUS

tickets - report and aid in identification of site specific issues (UK and French sites currently)

  • Hoping to have a GDB meeting in the US later this year with focus on DUNE-

relevant areas

  • Technical contributions made with development and evaluation of services

(not focused on getting everything into production operations) – Edinburgh + FNAL: development work on RUCIO for Data Management – RAL-PPD + Manchester: exploring DIRAC capabilities for Workload Management (FNAL kx509 certs and file catalog filed w/ metadata from SAM) – RAL-PPD: Running small MC simulation on WLCG and RAL SE integration

3

Mike Kirby, DUNE Collab Meeting May 2019.

Brett Viren (BNL) DUNE S&C 05 June 2019 19 / 24

slide-20
SLIDE 20

Offline Computing

Initial DUNE FD CPU Estimate

  • Current protoDUNE-SP data production
  • 42 M triggered events. 1.8 PB raw data sample.
  • Processed 8M “good” events in 2.5 M core-hours (300 core-years)

⇒ 1500 core-years for full sample

  • DUNE 10kt module estimate ∼8000 core-year/year-of-data.

As is, it is modest but lacks some things:

  • Three other 10kt modules + near detector.
  • Full reconstruction software still in development, new algorithms may

require substantially more CPU.

  • Resources to support machine learning not yet fully understood
  • Requires high-quality, somewhat slower “2D” Wire-Cell simulation and signal

processing.

  • Training set size NMC = X · Ndata where X is 10-100?

Brett Viren (BNL) DUNE S&C 05 June 2019 20 / 24

slide-21
SLIDE 21

Offline Computing

DUNE / Wire-Cell HPC/GPU “strategies”

  • DUNE CPU requirements are maybe modest (not ATLAS), but

can DUNE afford to not try to exploit HPC resources?

  • BNL CSI + Physics EDG with CCE funds somehow under

LArSoft umbrella (?) are evaluating Wire-Cell Toolkit for HPC/GPU porting

  • Accelerate the many FFTs for signal simulation and signal processing.
  • A lot of “heuristic” code remains not GPU-friendly.
  • Will it be a good match to HPC? Will GPUs be under-utilized?
  • Working w/ machine learning collaborators to use Wire-Cell
  • WCT’s quality 2D sim/sigproc a must for reasonable 2D CNN.
  • WCT sigproc/ROIs and 3D imaging natural to feeds “sparse array

CNN” and Graph NN techniques, friendly on limited GPU RAM

· Dense 2D CNN typically must strongly down sample their images

  • Can we make a highly-GPU accelerated chain: WCT sim, sigproc,

3D img + DL training to effectively utilize GPU?

  • Considering adding HDF5 format support to WCT.
  • New EDG post-doc on this starting next month.

Brett Viren (BNL) DUNE S&C 05 June 2019 21 / 24

slide-22
SLIDE 22

Offline Computing

DUNE T.B.D.

Some major items with computing implications are still unknown: near detector measure neutrino cross sections, project neutrino flux to FD. Detector design not yet finalized. Data rates, CPU requirements unknown. far detector full makeup of all 4 FD modules not yet known. For sure 1 SP , maybe 2 SP , maybe 1 DP . Maybe 4th with pixel readout, potentially producing far more data than the other technologies.

Brett Viren (BNL) DUNE S&C 05 June 2019 22 / 24

slide-23
SLIDE 23

Offline Computing

BNL/DUNE Computing

  • The DUNE Computing “Consortium” is solidifying now.
  • Fermilab mgt, “wants to break down the computing fortress model”
  • BNL can offer expertise in many needed areas: Rucio, SAM file catalog

replacement, conditions DB, workflow mgt, HPC/GPU (CSI).

  • Above my pay grade: how to develop trust and a collaborative plan,

agreement, something between the two labs?

  • I had many good discussions during DUNE collab meeting with FNAL

computing mgt, LArSoft mgt and other fellow software/computer types.

  • Pervasive, positive expressions for increasing collaboration with BNL on

DUNE software and computing.

Brett Viren (BNL) DUNE S&C 05 June 2019 23 / 24

slide-24
SLIDE 24

Offline Computing

DUNE Computing Meetings

Upcoming workshops: Data Model 14-16 Aug 2019 at BNL. data structure, DAQ/Offline handoff, production processing requirements. Computing Model 9-11 Sep 2019 at FNAL (likely). computing infrastructure, contributions from other lab/uni. Weekly Monday meetings (via Zoom): 10:00 “global computing sites” https://indico.fnal.gov/category/827/ 10:30 “core computing” https://indico.fnal.gov/category/496/

Brett Viren (BNL) DUNE S&C 05 June 2019 24 / 24