Modeling a Large Data-Acquisition Network in a Simulation Framework - - PowerPoint PPT Presentation

modeling a large data acquisition network in a simulation
SMART_READER_LITE
LIVE PREVIEW

Modeling a Large Data-Acquisition Network in a Simulation Framework - - PowerPoint PPT Presentation

1st IEEE Workshop on High-Performance Interconnectjon Networks towards the Exascale and Big-Data Era Chicago, 8 September 2015 Modeling a Large Data-Acquisition Network in a Simulation Framework Tommaso Colombo 1,2 Holger Frning 2 Pedro Javier


slide-1
SLIDE 1

Modeling a Large Data-Acquisition Network in a Simulation Framework

Tommaso Colombo1,2 Holger Fröning2 Pedro Javier Garcìa3 Wainer Vandelli1

1 Physics Department, CERN 2 Instjtut für Technische Informatjk, Universität Heidelberg 3 Departamento de Sistemas Informátjcos, Universidad de Castjlla-La Mancha

1st IEEE Workshop on High-Performance Interconnectjon Networks towards the Exascale and Big-Data Era Chicago, 8 September 2015

slide-2
SLIDE 2

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

2

Data-acquisition systems

  • In a scientjfjc experiment, a data-acquisitjon (DAQ) system

handles the experimental signals

  • Main functjons:

– Signal processing (e.g. analog-to-digital conversion) – Data gathering (collectjon of signals from difgerent devices) – Filter / Trigger (discarding faulty / uninterestjng data) – Storage

  • Usually implemented as a mix of custom hardware and

sofuware running on commodity hardware

slide-3
SLIDE 3

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

3

Data-acquisition systems

  • Key requirement: DAQ effjciency

– Fractjon of correctly acquired experimental data – Ideally 100%: experimental data is precious! – An ineffjcient DAQ might introduce bias in the data

  • Stringent requirements on

– System availability – Bufger depth – Latency

slide-4
SLIDE 4

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

4

Data-acquisition systems

  • Systematjcally studying the performance envelope of a DAQ

system is diffjcult:

– A DAQ system is a mission-critjcal component of an experiment – System availability for performance studies is limited – Hardware or system sofuware modifjcatjons are usually not possible

  • Simulatjon models give more freedom

– Must be accurate enough in reproducing the system's behavior – Must be reasonably fast

slide-5
SLIDE 5

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

5

Case study: the ATLAS experiment

Large scale machine built to discover and study rare partjcle physics phenomena

slide-6
SLIDE 6

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

6

Case study: the ATLAS experiment

Observes proton collisions delivered by the LHC accelerator at CERN

slide-7
SLIDE 7

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

7

Case study: the ATLAS experiment

  • Basic parameters:

– LHC delivers a collision “event” every 25 ns (40 MHz) – Each event is separately detected and measured – An event corresponds to 1-2 MB

  • The data-acquisitjon system incorporates a data fjltering

component:

– If all collision events were acquired ATLAS would produce up to 80

TB/s and hundreds of EB per year!

– Afuer two fjltering stages, ~1/10000 events survive – Data is recorded at 1-4 GB/s

slide-8
SLIDE 8

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

8

First stage: custom hardware

  • Synchronous,

pipelined electronics

  • Selects and acquires

1/400 events

  • 40 MHz input,

100 kHz output

  • ~80 million input

channels, aggregated into ~2000 outputs (“Event fragments”)

  • Output is striped over

~100 “Readout” nodes with deep bufgers

High-Level Trigger worker nodes ~2000 High-Level Trigger processing units

  • rder of 10

Data Collection Manager HLT decision Readout systems ~100 Readout buffers up to 24 High-Level Trigger supervisor Data loggers

10

Permanent storage Level-1 Readout drivers ~1800 Region-of-interest builder Event fragments Full events Event clear Event clear Region of interest Level-1 accept Level-1 result Region of interest

Readout channels (~80 million) Event fragments (~2000 per event) Full events

slide-9
SLIDE 9

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

9

Second stage: distributed software

  • Commodity hardware:

~10000 CPU cores in ~2000 worker nodes

  • Events are processed

in parallel, as soon as acquired by fjrst stage

  • 100 kHz input,

~1 kHz output

High-Level Trigger worker nodes ~2000 High-Level Trigger processing units

  • rder of 10

Data Collection Manager HLT decision Readout systems ~100 Readout buffers up to 24 High-Level Trigger supervisor Data loggers

10

Permanent storage Level-1 Readout drivers ~1800 Region-of-interest builder Event fragments Full events Event clear Event clear Region of interest Level-1 accept Level-1 result Region of interest

Readout channels (~80 million) Event fragments (~2000 per event) Full events

slide-10
SLIDE 10

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

10

Second stage: distributed software

  • A “Supervisor” assigns

events to free cores (“Processing Units”)

  • Each Unit handles a

difgerent event

– Retrieves event

fragments

– Decides if the event is

to be kept

– Avg tjme per event:

50 ms

  • I/O is mediated by a

per-node “Manager”

High-Level Trigger worker nodes ~2000 High-Level Trigger processing units

  • rder of 10

Data Collection Manager HLT decision Readout systems ~100 Readout buffers up to 24 High-Level Trigger supervisor Data loggers

10

Permanent storage Level-1 Readout drivers ~1800 Region-of-interest builder Event fragments Full events Event clear Event clear Region of interest Level-1 accept Level-1 result Region of interest

Readout channels (~80 million) Event fragments (~2000 per event) Full events

slide-11
SLIDE 11

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

11

Second stage: commodity hardware

  • Datacenter technologies
  • Two large 10GbE routers

– Several hundreds ports

  • Readout bufger nodes

– 2x 10GbE links to each router

  • Worker nodes organized in

racks of 40 nodes each

– One switch per rack – GbE links from nodes

to switch

– 10GbE links from switch

to each core router

Data Logger (x 10) Readout System (x 98) HLT Supervisor HLT rack (x 50) 1 Gbps 10 Gbps Router x 50 x 10 x 196 x 196 x 40

Switch

x 50 Router x 10

slide-12
SLIDE 12

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

12

DAQ traffic pattern

  • Need to aggregate data from difgerent instruments

➡ Communicatjon patuern: many-to-one

  • Data transfers are driven by the experimental conditjons

➡ Bursty traffjc

From Readout System

Switch

To HLT node Funneling Router Bandwidth mismatch 1 Gbps 10 Gbps

  • In ATLAS:

– Event fragments are striped over all the

readout nodes

– A processing unit needs fragments from

multjple nodes at the same tjme

– Many nodes will start sending fragments at

the same tjme to the same destjnatjon, creatjng instantaneous network congestjon

slide-13
SLIDE 13

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

13

DAQ traffic pattern

  • On a lossy network such as Ethernet, the DAQ traffjc

patuern leads to the incast pathology:

– A client (the worker node) simultaneously receives short bursts of

data from multjple sources (the readout nodes)

– The switch bufgers are overfmown – All the packets from one source are dropped – TCP congestjon control mechanisms cannot prevent this

  • A dramatjc increase in data transfer latency is observed

– Incast triggers slow TCP tjmeout-based retransmission – Causes under-utjlizatjon of computjng power – Can lead to violatjng DAQ latency requirements

slide-14
SLIDE 14

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

14

DAQ traffic pattern

  • Simple incast mitjgatjon strategy: client-side traffjc shaping

Smoothing the rate of data requests limits the maximum size of the traffjc bursts

  • Key metric: data collectjon tjme

Time required to gather all fragments of an event

  • Implementatjon in ATLAS:

– Each worker node has a fjxed number of credits available – Each requested fragment “costs” one credit

  • Results:

– Few traffjc shaping credits: data collectjon tjme grows because the

worker nodes cannot fully utjlize the network bandwidth

– Many traffjc shaping credits: high latency due to incast – Optjmal working point must be found manually

slide-15
SLIDE 15

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

15

Quantifying the problem

  • Measurements in test system: one worker rack
  • Synthetjc traffjc:

– 2.1 MB events, assigned to Processing Units at 750 Hz

1.6 GB/s input ➡

  • Core routers have huge bufgers

no drops ➡

  • Two worker rack switches tested:

Per-port bufgers (600 kB each) Shared bufgers (2x 10 MB)

slide-16
SLIDE 16

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

16

Modeling the system: framework

  • Solutjons to incast are usually very invasive:

– Modifjed TCP implementatjons – Network-wide traffjc orchestrators

  • Must be tried in simulatjon fjrst

– Can we model the relevant features of the DAQ system?

  • Simulatjon framework: OMNeT++

– Widely accepted in the academic community – Easy to use for network modeling – Provides INET, a protocol model library including

Ethernet, IP and TCP

– TCP implementatjon can preserve applicatjon-level message

boundaries ➡ simplifjed modeling of message-based applicatjons

slide-17
SLIDE 17

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

17

Modeling the system: applications

  • Supervisor & Readout applicatjons

– In real system: completely data-driven (from fjrst stage input)

S ➡ imulatjon should use pre-recorded traces

– In test system: synthetjc traffjc patuerns

Simplifjed modeling ➡

  • Supervisor:

simple periodic scheduler which assigns events to free Processing Units with confjgurable policies

  • Readout applicatjons:

trivial servers, responding to event fragment requests with a confjgurable delay and response size

slide-18
SLIDE 18

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

18

Modeling the system: applications

  • Processing Units and worker node Manager

– Simulatjon can share code with the real implementatjons – Processing unit:

simulates per-event iteratjve collectjon and processing of data

  • Is assigned an event from the Supervisor
  • Requests data from the Readout nodes with confjgurable

patuerns

  • Simulates processing by waitjng

– Worker node Manager:

interfaces Processing Units with TCP

  • Maps data requests from the processing units to messages to

multjple Readout nodes

  • Enforces the traffjc-shaping algorithm
slide-19
SLIDE 19

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

19

Modeling the system: nodes

  • Re-used as much of the INET stack

as possible: Ethernet MAC; IP; TCP

  • Custom implementatjon of

switches:

– No low level details

Assume that the switch speedup is high enough to prevent input head-of-line blocking, and to make the packetjzatjon delay

  • f the switch cells negligible

– Focus on packet queuing and

bufgering

  • Packet droppers enforce bufgering

space limits

  • Can model both per-port and

shared bufgering schemes

Ethernet MAC output Ethernet MAC output Ethernet MAC output Ethernet MAC output Ethernet MAC input Ethernet MAC input Ethernet MAC input Ethernet MAC input Relay unit Dropper Queue Queue Queue Queue Ethernet MAC output Ethernet MAC output Ethernet MAC output Ethernet MAC output Ethernet MAC input Ethernet MAC input Ethernet MAC input Ethernet MAC input Relay unit Queue Queue Queue Queue Dropper Dropper Dropper Dropper

slide-20
SLIDE 20

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

20

Model validation

  • Validatjon strategy: compare model predictjons with test

system measurements

  • Focus on key metric: data collectjon tjme
  • Good agreement for both switch types

Some parameter tuning needed

Per-port bufgers (600 kB each) Shared bufgers (2x 10 MB)

slide-21
SLIDE 21

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

21

Further simulation results

  • Impact of event to Processing Unit assignment policy

– Real system uses FCFS – More Units per node

➡ uneven network traffjc

– Other policies (e.g. random) are easy to implement in the model,

diffjcult in practjce due to performance concerns

Per-port bufgers (600 kB each) Shared bufgers (2x 10 MB)

slide-22
SLIDE 22

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

22

Further simulation results

  • Impact of bufger sizes:

what is the minimum bufger size that will prevent drops?

Per-port bufgers Shared bufger

slide-23
SLIDE 23

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

23

Conclusions

  • The simulatjon model can reproduce the key performance

traits of a complex data-acquisitjon system such as ATLAS's

  • Comparing the simulated and measured results yields useful

indicatjons that can drive optjmizatjons and further development of the system

  • The modeling approach can be extended to systems with

similar traffjc patuerns

slide-24
SLIDE 24

HiPINEB • Chicago • 8 Sept 2015

  • T. Colombo • Modeling a Large Data-Acquisitjon Network in a Simulatjon Framework

24

Future directions

  • Using this model as a base, more speculatjve trials can be

undertaken

  • Three categories of solutjons to the incast pathology could

be modeled:

– More sophistjcated applicatjon-level traffjc shaping – Alteratjons of the transport protocol itself (requiring kernel

modifjcatjons non feasible in the real system)

  • Altering TCP's retransmission tjmeout
  • Experimental incast-aware TCP variants

– Changing the link-layer technology

  • Lossless networks (Infjniband)
  • IEEE 802.1Q Congestjon Notjfjcatjon