ATLAS Production System in ATLAS Data Challenge 2 Luc Goossens - - PowerPoint PPT Presentation

atlas production system in atlas data challenge 2
SMART_READER_LITE
LIVE PREVIEW

ATLAS Production System in ATLAS Data Challenge 2 Luc Goossens - - PowerPoint PPT Presentation

ATLAS Production System in ATLAS Data Challenge 2 Luc Goossens (CERN/EP/ATC) Kaushik De (UTA) 27 September 2004 CHEP 2004 1 in this talk introduction terminology and conceptual model architecture and components


slide-1
SLIDE 1

27 September 2004 CHEP 2004 1

ATLAS Production System in ATLAS Data Challenge 2

Luc Goossens (CERN/EP/ATC) Kaushik De (UTA)

slide-2
SLIDE 2

27 September 2004 CHEP 2004 2

  • in this talk

– introduction – terminology and conceptual model – architecture and components – experience so far – conclusions and outlook

slide-3
SLIDE 3

27 September 2004 CHEP 2004 3

  • introduction

– ATLAS decided to undertake a series of Data Challenges (DC) in

  • rder to validate its Computing Model, its software, its data model

– DC2 started July 2004: – introduced the new ATLAS Production System (prodsys) :

  • unsupervised production across many sites spread over three

different Grids (US Grid3, NorduGrid, LCG-2)

  • 4 major components:

– production supervisor – executor – » one executor per “grid-flavor” developed by corresponding grid experts – common data management system – common central production database for all ATLAS

slide-4
SLIDE 4

27 September 2004 CHEP 2004 4

logFile job(transf) logFile logFile job(transf) logFile

  • terminology and conceptual model

dataset task(transf) dataset logFile job(transf) logFile

prodDB/AMI prodDB

slide-5
SLIDE 5

27 September 2004 CHEP 2004 5

  • architecture

– as simple as possible (well almost) – flexible – target automatic production – based on DC1 experience with AtCom (DC1 interactive production system) and GRAT

  • core engine with plug-ins

– some buzz technologies

  • XML, Jabber, Webservices, ...

– federation of grids

  • LCG, Nordugrid, Grid3
  • legacy systems only as backup

– use middleware components as much as possible

  • avoid inventing ATLAS’ own version of grid

– broker, catalogs, information system, ...

  • risky dependency !
slide-6
SLIDE 6

27 September 2004 CHEP 2004 6

LCG NG Grid3 legacy

LCG executor LCG executor NG executor Grid3 executor legacy executor supervisor supervisor supervisor supervisor super

prodDB dms

LRC RLS RLS jabber jabber jabber jabber jabber

slide-7
SLIDE 7

27 September 2004 CHEP 2004 7

LCG NG Grid3 LSF

LCG exe LCG exe NG exe G3 exe legacy exe super super super super super

prodDB dms

LRC RLS RLS jabber jabber jabber jabber jabber

legacy

slide-8
SLIDE 8

27 September 2004 CHEP 2004 8

  • prodDB = production database

– holds records for

  • job transformations
  • job definitions

– status of jobs

  • job executions
  • logical files

– Oracle database hosted at CERN

slide-9
SLIDE 9

27 September 2004 CHEP 2004 9 jobName jobXML currentState lastAttempt supervisor priority ...

jobDefinition

uses implementation formalPars ...

jobTrans

logicalFileName logicalCollection datasetName guid metadata ...

logicalFile

attemptNr jobstatus supervisor executor joboutputs metadata …

jobExecution

slide-10
SLIDE 10

27 September 2004 CHEP 2004 10

<signature> <formalPar> <name>inputfile</name> <position>1</position> <type>LFN</type> <metaType>inputLFN</metaType> </formalPar> <formalPar> <name>outputfile</name> <position>2</position> <type>LFN</type> <metaType>outputLFN</metaType> </formalPar> ... <formalPar> <name>ranseed</name> <position>7</position> <type>natural</type> <metaType>plain</metaType> </formalPar> </signature>

jobTrans:formalPars

slide-11
SLIDE 11

27 September 2004 CHEP 2004 11

<jobDef> <jobPars> <actualPar> <name>inputfile</name> <position>1</position> <type>LFN</type> <metaType>inputLFN</metaType> <value>dc2.003014.evgen.M1_minbias._00020.pool.root</value> </actualPar> ... </jobPars> <jobInputs> <fileInfo> <LFN>dc2.003014.evgen.M1_minbias._00020.pool.root</LFN> <logCol>/datafiles/dc2/evgen/dc2.003014.evgen.M1_minbias/</logCol> </fileInfo> </jobInputs> <jobOutputs>...</jobOutputs> <jobLogs>...</jobLogs> </jobDef>

jobDefinition:jobXML

slide-12
SLIDE 12

27 September 2004 CHEP 2004 12

<jobDef> <jobPars>...</jobPars> <jobInputs> ... </jobInputs> <jobLogs> <fileInfo> <stream>stdboth</stream> <LFN>dc2.003014.simul.M1_minbias._00980.job.log</LFN> <logCol>/logfiles/dc2/simul/dc2.003014.simul.M1_minbias/</logCol> <dataset><name>dc2.003014.simul.M1_minbias.log</name></dataset> <SEList><SE>castorgrid.cern.ch</SE></SEList> </fileInfo> </jobLogs> <jobOutputs> <fileInfo> <LFN>dc2.003014.simul.M1_minbias._00980.pool.root</LFN> <logCol>/datafiles/dc2/simul/dc2.003014.simul.M1_minbias/</logCol> <dataset><name>dc2.003014.simul.M1_minbias</name></dataset> <SEList><SE>castorgrid.cern.ch</SE></SEList> </fileInfo> </jobOutputs> </jobDef>

jobDefinition:jobXML

slide-13
SLIDE 13

27 September 2004 CHEP 2004 13

LCG NG Grid3 legacy

LCG exe LCG exe NG exe G3 exe legacy exe supervisor supervisor supervisor supervisor supervisor

prodDB dms

LRC RLS RLS jabber jabber jabber jabber jabber

slide-14
SLIDE 14

27 September 2004 CHEP 2004 14

  • supervisor

– consumes jobs from the production database – submits them to one of the executors it is connected with – follows up on the job – validates presence of expected outputs – takes care of final registration of output products in case of success – possibly takes care of clean-up in case of failure – will retry n times if necessary – implementation -> Windmill

  • http://heppc12.uta.edu/windmill/

– no brokering

  • “how-many-jobs-do-you-want” protocol

– possibly stateless – uses Jabber to communicate with executors

slide-15
SLIDE 15

27 September 2004 CHEP 2004 15

LCG NG Grid3 LSF

LCG executor LCG executor NG executor G3 executor legacy executor super super super super super

prodDB dms

LRC RLS RLS jabber jabber jabber jabber jabber

legacy

slide-16
SLIDE 16

27 September 2004 CHEP 2004 16

  • executor

  • ne for each facility flavor
  • LCG (lexor), NG (dulcinea), GRID3 (capone), PBS, LSF, BQS, Condor?, …

– translates facility neutral job definition into facility specific language

  • XRSL, JDL, wrapper scripts, …

– implements facility neutral interface

  • usual methods: submit, getStatus, kill, …

– possibly stateless – two implementation strategies

  • executor subclass
  • SOAP adapter + executor webservice (Capone)

– see other talks in this conference

slide-17
SLIDE 17

27 September 2004 CHEP 2004 17

LCG NG Grid3 LSF

LCG exe LCG exe NG exe G3 exe legacy exe super super super super super

prodDB dms

LRC RLS RLS jabber jabber jabber jabber jabber

legacy

slide-18
SLIDE 18

27 September 2004 CHEP 2004 18

  • data management system

– allows global cataloguing of files

  • we have opted to interface to existing replica catalog flavors

– allows global file movement

  • an ATLAS job can get/put a file anywhere

– presents a uniform interface on top of all the facility native data management tools – we only counted on ability to do inter-grid file transfers

  • ideally jobs should be able to use input files located in other grids

and write output files into other grids

  • this was not exercised

– stateless – implementation -> Don Quijote

  • see separate talk by Miguel Branco
slide-19
SLIDE 19

27 September 2004 CHEP 2004 19

  • experience

– since start of DC2 (July) the system has

  • 235000 jobexecution, 158000 jobdefinition, 251000 logicalfile

  • approx. evenly distributed over the three Grid flavors
  • 157 task, 22 jobtrans
  • consumed ~ 1.5 million SI2k months of CPU (~ 5000 CPU months)

– we had high dependency on middleware

  • broker in LCG, RLS in Grid3/NG, ...
  • we suffered a lot !
  • many bugs were found and corrected

– DC2 started before development was finished

  • we suffered a lot !
  • many bugs were found and corrected

– detailed experience reports per Grid in other talks

slide-20
SLIDE 20

27 September 2004 CHEP 2004 20

  • conclusion

– for DC2 ATLAS relies completely on a federation of grid systems (LCG, Nordugrid, Grid3) – the ATLAS production system allows for an automatic production

  • n this federation of grids

– the ATLAS production system is based directly on the services

  • ffered by these grids

– stress-testing these services in the context of a major production was a new experience and many lessons were learned – it was possible, but not easy

  • a lot of manpower was needed to compensate for missing and/or

buggy software