Markus Frank (CERN) & Albert Puig (UB) An opportunity - - PowerPoint PPT Presentation

markus frank cern albert puig ub an opportunity
SMART_READER_LITE
LIVE PREVIEW

Markus Frank (CERN) & Albert Puig (UB) An opportunity - - PowerPoint PPT Presentation

Markus Frank (CERN) & Albert Puig (UB) An opportunity (Motivation) Adopted approach Implementation specifics Status Conclusions 2 Readout Network Online cluster Data logging (event selection) facility CPU CPU CPU


slide-1
SLIDE 1

Markus Frank (CERN) & Albert Puig (UB)

slide-2
SLIDE 2

 An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions

2

slide-3
SLIDE 3

Data logging facility

Readout Network

CPU CPU CPU CPU CPU CPU CPU CPU CPU

Online cluster (event selection)

Storage

3

slide-4
SLIDE 4

4

 ~16000 CPU cores foreseen (~1000 boxes)

 Environmental constraints:

 2000 1U boxes space limit  50 x 11 kW cooling/power limit

 Computing power equivalent to that provided by all

Tier 1’s to LHCb.

 Storage system:

 40 TB installed  400-500 MB/s

Data logging facility

Readout Network

CPU CPU CPU CPU CPU CPU CPU CPU CPU

Online cluster (event selection)

Storage

slide-5
SLIDE 5

 Significant idle time of the farm

 During LHC winter shutdown (~ months)  During beam period, experiment and machine

downtime (~ hours)

5

Could we use it for reconstruction?

 Farm is fully LHCb controlled

 Good internal network connectivity

  • Slow disk access (only fast for a very few nodes via Fiber

Channel interface)

slide-6
SLIDE 6

 Background information:

+ 1 file (2GB) contains 60.000 events. + It takes 1-2s to reconstruct an event.

 Cannot reprocess à la Tier-1 (1 file per core)

 Cannot perform reconstruction in short idle periods:

 Each file takes 1-2 s/evt * 60k evt ~ 1 day.

 Insufficient storage or CPUs not used efficiently:

 Input: 32 TB (16000 files * 2 GB/file)  Output: ~44 TB (16000 * 60k evt * 50 kB/evt)

 A different approach is needed

 Distributed reconstruction architecture.

6

slide-7
SLIDE 7

 Files are split in events and distributed to many

cores, which perform reconstruction:

 First idea: full parallelization (1 file/16k cores)

 Reconstruction time: 4-8 s  Full speed not reachable (only one file open!)

 Practical approach: split the farm in slices of

subfarms (1 file/n subfarms).

 Example: 4 concurrent open files yield a reconstruction

time of 30s/file.

7

slide-8
SLIDE 8

8

ECS

Control and allocation of resources

Reco Manager

Database Database

Job steering

DIRAC

Connection to LHCb production system

Stor Storage Switch

Storage Nodes Subfarms

… … … …

Farm

See A.Tsaregorodtsev talk, DIRAC3 - the new generation of the LHCb grid software

slide-9
SLIDE 9

9

ECS

Control and allocation of resources

Farm

Stor Storage Switch

Subfarms

… … … …

Reco Manager

Database Database

Job steering

DIRAC

Connexion to production system Storage Nodes

slide-10
SLIDE 10

10

 Control using standard LHCb ECS software:

 Reuse of existing components for storage and subfarms.  New components for reconstruction tree management.  See Clara Gaspar’s talk (LHCb Run Control System).

 Allocate, configure, start/stop resources (storage and

subfarms).

 Task initialization slow, so tasks don’t restart on file

change.

 Idea: tasks sleep during data-taking, and are only restarted

  • n configuration change.
slide-11
SLIDE 11

subfarm

11

PVSS Control x 50 subfarms

1 control PC each 4 PC each

x 8 cores/PC

1 Reco task/core Data management tasks

Reco Reco Reco Reco Reco Reco Reco Reco Output Input

Event (data) processing Event (data) management

slide-12
SLIDE 12

Target Node

12

Consumer Producer Buffer manager

Processing Node

 Data pr

Data processing block

  • cessing block

 Producers put events in a buffer

manager (MBM)

 Consumers receive events from the

MBM

Buffer

Sender Receiver  Data transf

Data transfer block er block

 Senders access events from MBM  Receivers get data and declare it

to MBM

Source Node Buffer

slide-13
SLIDE 13

13

Stor Storage e Storage Reader Receiver Brunel Brunel Brunel Reco Storage Writer Sender

Storage nodes Worker node 1 per core Output Events Events

Sender Receiver

Input

slide-14
SLIDE 14

14

ECS

Control and allocation of resources

Reco Manager

Database Database

Job steering

DIRAC

Connection to LHCb production system

Stor Storage Switch

Subfarms

… … … …

Farm

Storage Nodes

slide-15
SLIDE 15

15

 Granularity to file level.

 Individual event flow handled

automatically by allocated resource slices.

 Reconstruction: specific

actions in specific order.

 Each file is treated as a Finite

State Machine (FSM)

 Reconstruction information

stored in a database.

 System status  Protection against crashes

PREPARED PROCESSING PREPARING TODO DONE ERROR

slide-16
SLIDE 16

16

 Job steering done by a Reco Manager:

 Holds the each FSM instance and moves it through

all the states based on the feedback from the static resources.

 Sends commands to the readers and writers: files

to read and filenames to write to.

 Interacts with the database.

slide-17
SLIDE 17

17

 The Online Farm will be treated like a CE

connected to the LHCb Production system.

 Reconstruction is formulated as DIRAC jobs,

and managed by DIRAC WMS Agents.

 DIRAC interacts with the Reco Manager through

a thin client, not directly with the DB.

 Data transfer in and out of the Online farm

managed by DIRAC DMS Agents.

slide-18
SLIDE 18

18

 Current performance constrained by

hardware.

 Reading from disk: ~130 MB/s.

 FC saturated with 3 readers.  Reader saturates CPU at 45MB/s.

 Test with dummy reconstruction

 Just copy data from input to output

 Stable throughput 105 MB/s  Constrained by Gigabit network

 Upgrade to 10 Gigabit planned Stor Storage e

Storage Reader

Receiver

Reco

Storage Writer Sender Sender Receiver

 Resource handling and Reco Manager implemented.  Integration in LHCb Production system recently decided,

not implemented yet.

slide-19
SLIDE 19

19

 Software

 Pending implementation of thin client for

interfacing with DIRAC.

 Hardware

 Network upgrade to 10 Gbit in storage nodes

before summer.

 More subfarms and PCs to be installed.

 From current ~4800 cores to the planned 16000.

slide-20
SLIDE 20

20

 The LHCb Online cluster needs huge resources for data

event selection of LHC collisions.

 These resources have much idle time (50% of the time).  They can be used on idle periods by applying a

parallelized architecture to data reprocessing.

 A working system is already in place, pending integration

in the LHCb Production system.

 Planned hardware upgrades to meet DAQ requirements

should overcome current bandwidth constraints.