modeling a large data acquisition network in a simulation
play

Modeling a Large Data-Acquisition Network in a Simulation Framework - PowerPoint PPT Presentation

1st IEEE Workshop on High-Performance Interconnectjon Networks towards the Exascale and Big-Data Era Chicago, 8 September 2015 Modeling a Large Data-Acquisition Network in a Simulation Framework Tommaso Colombo 1,2 Holger Frning 2 Pedro Javier


  1. 1st IEEE Workshop on High-Performance Interconnectjon Networks towards the Exascale and Big-Data Era Chicago, 8 September 2015 Modeling a Large Data-Acquisition Network in a Simulation Framework Tommaso Colombo 1,2 Holger Fröning 2 Pedro Javier Garcìa 3 Wainer Vandelli 1 1 Physics Department, CERN 2 Instjtut für Technische Informatjk, Universität Heidelberg 3 Departamento de Sistemas Informátjcos, Universidad de Castjlla-La Mancha

  2. Data-acquisition systems ● In a scientjfjc experiment, a data-acquisitjon (DAQ) system handles the experimental signals ● Main functjons: – Signal processing (e.g. analog-to-digital conversion) – Data gathering (collectjon of signals from difgerent devices) – Filter / Trigger (discarding faulty / uninterestjng data) – Storage ● Usually implemented as a mix of custom hardware and sofuware running on commodity hardware T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 2

  3. Data-acquisition systems ● Key requirement: DAQ effjciency – Fractjon of correctly acquired experimental data – Ideally 100%: experimental data is precious! – An ineffjcient DAQ might introduce bias in the data ⬇ ● Stringent requirements on – System availability – Bufger depth – Latency T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 3

  4. Data-acquisition systems ● Systematjcally studying the performance envelope of a DAQ system is diffjcult: – A DAQ system is a mission-critjcal component of an experiment – System availability for performance studies is limited – Hardware or system sofuware modifjcatjons are usually not possible ● Simulatjon models give more freedom – Must be accurate enough in reproducing the system's behavior – Must be reasonably fast T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 4

  5. Case study: the ATLAS experiment Large scale machine built to discover and study rare partjcle physics phenomena T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 5

  6. Case study: the ATLAS experiment Observes proton collisions delivered by the LHC accelerator at CERN T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 6

  7. Case study: the ATLAS experiment ● Basic parameters: – LHC delivers a collision “event” every 25 ns (40 MHz) – Each event is separately detected and measured – An event corresponds to 1-2 MB ● The data-acquisitjon system incorporates a data fjltering component: – If all collision events were acquired ATLAS would produce up to 80 TB/s and hundreds of EB per year! – Afuer two fjltering stages, ~1/10000 events survive – Data is recorded at 1-4 GB/s T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 7

  8. First stage: custom hardware ● Synchronous, pipelined electronics Readout channels (~80 million) ● Selects and acquires ~1800 Level-1 1/400 events accept Event fragments Event fragments Readout drivers Level-1 (~2000 per event) ● 40 MHz input, Level-1 ~100 Full events Full events result Readout systems 100 kHz output up to 24 Region-of-interest builder ● ~80 million input Readout buffers channels, aggregated Event clear Region of interest into ~2000 outputs ~2000 High-Level Trigger worker nodes (“Event fragments”) order of 10 Region of interest High-Level Trigger High-Level Trigger Data Collection ● Output is striped over supervisor processing units Manager HLT decision ~100 “Readout” nodes Event clear 10 with deep bufgers Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 8

  9. Second stage: distributed software Readout channels (~80 million) ● Commodity hardware: ~1800 Level-1 accept Event fragments Event fragments ~10000 CPU cores in Readout drivers Level-1 (~2000 per event) ~2000 worker nodes Level-1 ~100 Full events Full events result Readout systems ● Events are processed up to 24 Region-of-interest builder Readout buffers in parallel, as soon as acquired by fjrst stage Event clear Region of interest ~2000 ● 100 kHz input, High-Level Trigger worker nodes order of 10 Region of ~1 kHz output interest High-Level Trigger High-Level Trigger Data Collection supervisor processing units Manager HLT decision Event clear 10 Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 9

  10. Second stage: distributed software ● A “Supervisor” assigns events to free cores Readout channels (~80 million) (“Processing Units”) ~1800 Level-1 ● Each Unit handles a accept Event fragments Event fragments Readout drivers Level-1 (~2000 per event) difgerent event Level-1 ~100 Full events Full events result Readout systems – Retrieves event up to 24 Region-of-interest builder fragments Readout buffers – Decides if the event is Event clear Region of to be kept interest ~2000 High-Level Trigger worker nodes – Avg tjme per event: order of 10 Region of interest High-Level Trigger 50 ms High-Level Trigger Data Collection supervisor processing units Manager HLT decision ● I/O is mediated by a Event clear 10 per-node “Manager” Permanent Data loggers storage T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 10

  11. Second stage: commodity hardware ● Datacenter technologies ● Two large 10GbE routers Readout 1 Gbps System (x 98) – Several hundreds ports 10 Gbps ● Readout bufger nodes x 196 x 196 HLT Supervisor – 2x 10GbE links to each router Router Router ● Worker nodes organized in x 10 x 50 racks of 40 nodes each x 50 x 10 – One switch per rack Switch – GbE links from nodes Data x 40 Logger (x 10) to switch HLT rack (x 50) – 10GbE links from switch to each core router T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 11

  12. DAQ traffic pattern ● Need to aggregate data from difgerent instruments ➡ Communicatjon patuern: many-to-one ● Data transfers are driven by the experimental conditjons ➡ Bursty traffjc ● In ATLAS: From Readout System – Event fragments are striped over all the readout nodes – A processing unit needs fragments from Funneling Router multjple nodes at the same tjme – Many nodes will start sending fragments at Bandwidth the same tjme to the same destjnatjon, mismatch Switch 1 Gbps creatjng instantaneous network congestjon 10 Gbps To HLT node T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 12

  13. DAQ traffic pattern ● On a lossy network such as Ethernet, the DAQ traffjc patuern leads to the incast pathology: – A client (the worker node) simultaneously receives short bursts of data from multjple sources (the readout nodes) – The switch bufgers are overfmown – All the packets from one source are dropped – TCP congestjon control mechanisms cannot prevent this ● A dramatjc increase in data transfer latency is observed – Incast triggers slow TCP tjmeout-based retransmission – Causes under-utjlizatjon of computjng power – Can lead to violatjng DAQ latency requirements T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 13

  14. DAQ traffic pattern ● Simple incast mitjgatjon strategy: client-side traffjc shaping Smoothing the rate of data requests limits the maximum size of the traffjc bursts ● Key metric: data collectjon tjme Time required to gather all fragments of an event ● Implementatjon in ATLAS: – Each worker node has a fjxed number of credits available – Each requested fragment “costs” one credit ● Results: – Few traffjc shaping credits: data collectjon tjme grows because the worker nodes cannot fully utjlize the network bandwidth – Many traffjc shaping credits: high latency due to incast – Optjmal working point must be found manually T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 14

  15. Quantifying the problem ● Measurements in test system: one worker rack ● Synthetjc traffjc: – 2.1 MB events, assigned to Processing Units at 750 Hz 1.6 GB/s input ➡ ● Core routers have huge bufgers no drops ➡ ● Two worker rack switches tested: Per-port bufgers (600 kB each) Shared bufgers (2x 10 MB) T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework HiPINEB • Chicago • 8 Sept 2015 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend