Computing at LHC experiments in the first year of data taking at 7 - - PowerPoint PPT Presentation

computing at lhc experiments in the first year of data
SMART_READER_LITE
LIVE PREVIEW

Computing at LHC experiments in the first year of data taking at 7 - - PowerPoint PPT Presentation

Computing at LHC experiments in the first year of data taking at 7 TeV Daniele Bonacorsi [ deputy CMS Computing coordinator - University of Bologna, Italy ] on behalf of ALICE, ATLAS, CMS, LHCb Computing Growing up with Grids


slide-1
SLIDE 1

Computing at LHC experiments in the first year of data taking at 7 TeV

Daniele Bonacorsi

[ deputy CMS Computing coordinator - University of Bologna, Italy ]

  • n behalf of ALICE, ATLAS, CMS, LHCb Computing
slide-2
SLIDE 2

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Growing up with Grids

LHC
Compu)ng
Grid
(LCG)
approved
by
CERN
Council
in
2001

✦ First
Grid
Deployment
Board
(GDB)
in
2002

Since
then,
LCG
was
built
on
services
developed
in
EU
and
US

✦ LCG
has
collaborated
with
a
number
of
Grid
Projects

It
evolved
into
the
Worldwide
LCG
(WLCG)


✦ EGEE,
NorduGrid,
and
Open
Science
Grid
(OSG) ✦ CoordinaLon
and
service
support
for
the
operaLons
of
the
4
LHC


experiments

Compu)ng
for
LHC
experiments
grew
up
together
with
Grids

✦ Distributed
compuLng
achieved
by
previous
experiments

LHC
experiments
started
on
this
environment,
in
which
most
resources
were located
away
from
CERN

✦ A
huge
collaboraLve
effort
throughout
the
years,
and
massive
cross‐

ferLlizaLons

2

Grid Solution for Wide Area Computing and Data Handling Grid Solution for Wide Area Computing and Data Handling

NORDUGRID

slide-3
SLIDE 3

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

11
Tier‐1
centres,
>140
Tier‐2
centres
(plus
Tier‐3s)

✦ ~150k
CPU
cores,
hit
1M
jobs/day ✦ >50
PB
disk

WLCG today for LHC experiments

slide-4
SLIDE 4

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Site reliability in WLCG

4

Basic
monitoring
of
WLCG
 services

✦ at
Tier‐0/1/2
levels

Sites
reliability
is
a
key
ingredient
 in
the
success
of
LHC
CompuLng


✦ Result
of
a
huge
collaboraLve
work ✦ Thanks
to
WLCG
and
site
admins!

Jul’06 Feb’11 2006 2007 2008 2009 2010

2010
data
taking start
at
7
TeV

2011 2010 2009

2010
data
taking start
at
7
TeV

slide-5
SLIDE 5

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Readiness of WLCG Tiers

5

~ plateau

7 40

Site
Availability
Monitoring

✦ CriLcal
tests,
per
Tier,
per
experiment

Some
experiments
built
their
own
 readiness
criteria
on
top
of
basic
ones

✦ e.g.
CMS
defines
a
“site
readiness”
based


  • n
a
boolean
‘AND’
of
many
tests

Easy
to
be
OK
on
some

Hard
to
be
OK
on
all,
and
in
a
stable
manner...

Sep’08 Mar’11 2010
data
taking start
at
7
TeV Sep’08 Mar’11

CMS Tier-1s CMS Tier-2s

slide-6
SLIDE 6

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

LHC Computing models

6

T0

T1 T1 T1

...

T2 T2 T2 T2 T2

...

T0

T1 T1 T1

...

T2 T2 T2 T2 T2

...

ATLAS

example

LHC
CompuLng
models
are
based
on
the
MONARC
model

✦ Tiered
compuLng
faciliLes
to
meet
the
needs
of
the
LHC
experiments

MONARC
was
developed
more
than
a
decade
ago

✦ It
served
the
community
remarkably
well,
evoluLons
in
progress

“cloud” full mesh

CMS

example

slide-7
SLIDE 7

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

From commissioning to data taking

7

2004 2005 2006 2007 2008 2009 2010 2011

pp+HI
data
taking pp+HI
data
taking

“Service Challenges”: since 2004, to demonstrate service aspects:

  • DM and sustained data transfers
  • WM and scaling of job workloads
  • Support processes
  • Interoperability
  • Security incidents (“fire drills”)

Run the service(s): Focus on real and continuous production use

  • f the services over several years:
  • simulations (since 2003)
  • cosmics data taking, …

+ “Readiness/Scale Challenges”: Data/Service Challenges to exercise aspects

  • f the overall service at the same time
  • if possible with VO overlap

DC04 (ALICE, CMS, LHCb) DC2 (ATLAS)

“Data Challenges”: experiment-specific, independent tests

(first full chain of computing models on grids)

SC1 (network transfer tests) SC2 (network transfer tests) SC3 (sustained transfer rates,

DM, service reliability)

SC4 (nominal LHC rates,

disk→tape tests, all T1, some T2s)

CCRC08 (phase I - II)

(readiness challenge, all exps, ~full computing models)

STEP’09

(scale challenges, all exps + multi-VO overlap, FULL computing models) More experiment-specific challenges... More experiment-specific challenges...

slide-8
SLIDE 8

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

LHC data taking 2010

8

Remarkable
ramp‐up
in
lumi
in
2010

✦ At
the
beginning,
a
“good”
weekend
could


double
or
triple
the
dataset

✦ a
significant
failure
or
outage
for
a
fill


would
be
a
big
fracLon
of
the
total
data

NOTE:
log
scale

Original
planning
for
CompuLng
in
the
first
6
months foresaw
higher
data
volumes
(tens
of
pb‐1)

✦ Time
in
stable
beams
per
week
reached
40%
only
few
Lmes

Load
on
compuLng
systems
lower
than
expected,
no
stress
on
resources

✦ Slower
ramp
has
allowed
predicted
acLviLes
to
be
performed
more
frequently

This
will
definitely
not
happen
again
in
2011,
we
will
be
resource
constrained

PRELIMINARY

slide-9
SLIDE 9

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011 9

OPN
links
now
fully
redundant

✦ Means
no
service
interrupLons

See
the
fiber
cut
during
STEP’09

Networks

slide-10
SLIDE 10

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Networks in operations

10

Excellent
monitoring
systems

slide-11
SLIDE 11

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

CERN→T1 data transfers

11

CERN
outbound
traffic
showed
high
performance
and
reliability

✦ Very
well
serving
the
needs
of
LHC
experiments ✦ A
joined
and
long
commissioning
and
tesLng
effort
to
achieve
this

STEP’09 challenge ICHEP’10 Conference CCRC’08 challenge

(phase I and II) 1 PB

All experiments

slide-12
SLIDE 12

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011 12

Jan Dec 2010

GB/s per day

2 4 6

An example: ATLAS data transfers

MC reproc

Feb Mar Apr May Jun Jul Aug Sep Oct Nov

2009
Data reproc

2010
data
taking
 start
at
7
TeV

Data
+
MC reproc Data
taking
+
MC
prod 2010
pp reproc PbPb reproc
@T1s

Data
available
on‐site
aker
few
hrs.

Traffic
on
OPN
measured
up
to
70
Gbps

ATLAS
massive
reprocessing
campaigns

MC
transfers
in
clouds Data
consolida)on (MC
transfers
extra‐clouds) T0
export (incl.
calib
streams) Data
brokering (Analysis
data) User
subscrip)ons

Transfers
on
all
routes
(among
all
Tier
levels)

✦ Average:
~2.3
GB/s
(daily
average) ✦ Peak:
~7
GB/s
(daily
average)


Aver: ~2.3

ATLAS

slide-13
SLIDE 13

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

An example: CMS data transfers

13

Massive
commissioning,
now
in
conLnuous
producLon‐mode
of
ops

✦ Can
sustain
up
to
>200
TB/day
of
producLon
transfers
on
the
overall
topology


CMS
improved
by
ad‐hoc
challenges
of
increasing
 complexity
and
by
compuLng
commissioning
acLviLes

NOTE:
log
scale

CMS PhEDEx

slide-14
SLIDE 14

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011 14

More examples: ALICE and LHCb data transfers

# done transfers 325k

ALICE
transfers
among
all
Tiers LHCb
data
is
successfully
 transferred
on
a
regular
basis

✦ RAW
data
is
replicated
to
one
of
the


T1
sites

T0→T1

GB 80k

LHCb ALICE

slide-15
SLIDE 15

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Reprocessing

15

Once
landed
at
the
T1
level,
LHC
data
gets
reprocessed
as
needed

✦ New
calibraLons,
improved
sokware,
new
data
formats

ALICE

Pass-2 reco HI reco: opportunistic usage of resources

LHCb ATLAS

4
reproc
campaigns
in
2010

✦

Feb’10:
2009
pp
data
+
cosmics ✦

Apr’10:
2009/2010
data ✦

May’10:
2009/2010
data+MC ✦

Nov’10:
full
2010
data+MC
(from
tapes)

+
HI
reprocessing
foreseen
in
Mar’11

CMS

(reprocessing passes only)

# jobs

6k

# jobs # jobs

~
a
dozen
of
reproc
 passes
in
2010

16k 6k

slide-16
SLIDE 16

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Reprocessing profile

16

0.25 0.50 0.75 1.00 1 2 3 4 5 6 7 8 9 10 11 12

ESD dESD, AOD CA DE ES FR IT ND NL UK US

Campaign Day

Fraction Complete (normalised for each T1)

In
2010,
possible
to
reprocess
even
more
frequently
than
originally
planned

ATLAS
reprocessed
100%
of
data

✦ RAW→ESD ✦ ESD
merge ✦ ESD
→dESD,
AOD ✦ Grid
distribuLon
of
derived
data 1.5G
evts

About
a
dozen
of
CMS
 reprocessing
passes
in
 2010

Actually,
from
7
days
onwards mostly
dealing
with
tails

ATLAS CMS

slide-17
SLIDE 17

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production

Simulated
Event
ProducLon
on
Grid
very
successful

✦ Accounts
for
a
large
fracLon
of
the
global
Grid
usage
 ✦ One
of
the
earliest
Grid
applicaLons

Other
factors,
like
e.g.
pile‐up,
make
this
a
more
interesLng
problem

SimulaLon
producLon
conLnued
in
the
background
all
the
Lme

✦ FluctuaLons
caused
by
a
range
of
causes

including
release
cycles,
sites
downLmes,
etc.

Depending
on
the
experiment,
it
is
done
mainly
on
the
T1/T2
level

✦ E.g.
LHCb:
SimulaLon
50%,
Analysis
29%,
ReconstrucLon
21% ✦ E.g.
CMS:
iniLally
on
50%
of
T2
resources,
recently
expanded
to
T1s
as
well

17

slide-18
SLIDE 18

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production in ALICE

Performed
on
all
T1/T2
sites

✦ A
large
fracLon
of
the
overall
Grid
usage

18

‘aliprod’ only

Average: ~8.8k Peak: ~23k

All users

Average: ~12k Peak: ~27k

ALICE

slide-19
SLIDE 19

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production in LHCb

Using
the
LHCb
producLon
system

✦ SimulaLon,
reconstrucLon,
stripping,
WG
analysis
(μDST)

Mainly
at
T2
level

✦ >
100
T2
sites
stably
supporLng
simulaLon

Several
T2s
contribuLng
as
if
they
were
“small
T1s”

19

Example:

MC’09
producLon

✦ LHCb
hit
~140
T2s ✦ 4.2M
jobs

Jun’09 Feb’10

LHCb

slide-20
SLIDE 20

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production in ATLAS

Processing
is
managed
centrally
on
all
Tiers
resources

20

# jobs

Each color is a regional “cloud”

60k Feb’10 Feb’11

Production share per “cloud” Production share per “T1”

(including CERN) Digging into

  • ne cloud (e.g. IT):

CNAF T2s

2010
data
taking
 start
at
7
TeV

ATLAS

slide-21
SLIDE 21

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production in CMS

21

Mostly
T2s
and
some
opportunisLc
T3s Since
August
2010,
T1s
also

20k

Jan’10 Jan’11

Each color is a Tier

Tiers
usage
for
MC
producLon
 is
driven
by
requests

✦ Only
Tiers
that
pass
the
“Site


Readiness”
criteria
are
used

600M

CMS

slide-22
SLIDE 22

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

MC production: data formats

22

Simulated data produced (Jul’09 → Feb’10)

ATLAS:
more
simulated
 ESDs
produced
since
Dec'09

✦ to
match
real
data
analysis

MC production (Jan’10 → Feb’11)

CMS:
analysis
is
moving
to
AODs

✦ being
produced
already
in
2010

ATLAS CMS

slide-23
SLIDE 23

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Analysis: shielding the complexity

At
the
level
of
the
analysis
tools
developed
by
the
experiments,
a
key
 point
is
to
shield
the
user
from
structure/complexity
of
the
Grids

✦ Each
framework
implement
in
different
ways
an
instance
of
this
concept

pAthena
and
Ganga
in
ATLAS,
Alien
in
ALICE,
CRAB
in
CMS,
Ganga
and
Dirac
in
LHCb

✦ They
manage
creaLon,
submission,
tracking
of
jobs
and
return
results
to
users

A
lot
of
things
need
to
go
right
for
this
to
work: We
see
a
margin
of
improvement
here,
at
several
levels

✦ Efficiency
of
compleLon,
CPU
efficiency,
user
experience,
status
tracking,


monitoring&accounLng,
debugging
and
troubleshooLng,
…

Nevertheless,
LHC
exps
successfully
make
analysis
on
Grid!
(see
next
slides)

23

Local Environment Packaged Communicate job status and write results somewhere Arrival at Batch farm Success submission through site grid interface Choice of site(s) with the desired data files or resources Discovery and

  • pen local

data file or remote file Authenticate User and VO Find local environment

slide-24
SLIDE 24

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Data placement for analysis: an example

Once
data
is
onto
the
WLCG,
it
must
be
made
accessible
to
analysis
applicaLons

✦ Largest
fracLon
of
analysis
compuLng
at
LHC
is
at
the
T2
level ✦ Flexibility
of
the
transfer
model
help
to
reduce
the
latency
seen
by
the
analysis
end‐users

24

NOTE:
log
scale T1‐T2
dominates T2‐T2
emerges 











 
 
 
 


>95%
of
the
enLre matrix
commissioned

Up
to
30
links
commissioned
per
 day,
average
is
~7
links/day
over
 the
first
6
months
of
data
taking #
T2‐T2
links
used
for
data
transfers
monthly

(not
always
the
same
ones)

2010

Aver:
7

CMS

slide-25
SLIDE 25

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Analysis in ALICE

25

# running jobs Aver: ~1.7k

Increasing trend in the # end-user analysers

(continues after Xmas)

Xmas + AliEn release ‘alitrain’ only

InteresLng
analysis
train
model

✦ User
code
is
picked
up
and
executed
with
other
analyses

On
average,
1.7k
concurrent
 user
jobs
in
2010

>9M
user
jobs
completed
over
 last
12
months

~200
disLnct
users
on
 average,
and
increasing

Each color is a user, blue is the total

ALICE

slide-26
SLIDE 26

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Analysis in ATLAS

26

2010
data
taking
 start
at
7
TeV

Increase
in
analysis
load
aker
the
start
of
2010
data
taking

✦ Aker
that,
roughly
stable
load

Holidays
holes,
as
well
as
acLviLes
peaks
before
major
conferences,
are
visible

e.g.
ICHEP’10

Analysis share per “cloud”

(only
for
pAthena‐Panda
system;
ganga‐WMS
not
counted)

20k # jobs

Feb’10 Feb’11

ATLAS

slide-27
SLIDE 27

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Analysis in CMS

27

Constant
increase
in
#
users

~300‐350
disLnct
daily
users

Up
to
>500
users
per
week
during
peaks

>800
individuals
per
month

2008 2009 2010 2007

Apr’10 Feb’11

Analysis
at the
T2
level.

NOTE: weekends and Xmas “holes”: visible only in the distributed analysis pattern, not in scheduled processing (e.g. MC, re-reco) Ad-hoc CMS Computing scale exercise focussed

  • n Analysis

CMS

CMS CMS

CMS

slide-28
SLIDE 28

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Analysis in LHCb

28

2010
data
taking
 start
at
7
TeV

6k

# running user jobs

No
a‐priori
assignment
of
site

✦ Share
by
availability
of
resources
and
data

Only
~2%
of
analysis
at
T2s

✦ Toy
MC,
private
small
simulaLons,
etc

~320
unique
analysis
users

Successful user jobs at T1s

Roughly,
~50%
of
LHCb
analysis
 is
performed
out
of
CERN

Mar’10 Feb’11

CERN

Sep’10 Feb’11 CERN CNAF GRIDKA RAL ...

Aver: 1.4k

Aver: 310

# jobs/hr

Each color is a user

LHCb LHCb LHCb

slide-29
SLIDE 29

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Do analysis teams complete their tasks?

29

~81%
successful LHCb
jobs
on
Grid

Apr’10 Feb’11

[ NOTE: the line is _not_ a fit ]

Training
and
experience:

✦ is
allowing
wider
access
to
Grid(s) ✦ is
building
more
solid
data
analysis
teams

E.g.
see
here
for
CMS

LHCb CMS

slide-30
SLIDE 30

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Conclusions

The
overall
Grid
infrastructure
is
working
for
LHC
Physics
at
7
TeV.

✦ Able
to
cope
with
the
load
in
all
sectors

(Rare)
backlogs
or
(rare)
service
losses
showed
no
impact
whatsoever
on
physics

✦ Real
data
is
collected,
stored,
reprocessed,
skimmed,
transferred ✦ Simulated
data
are
produced
‐
according
to
the
physics
needs ✦ Data
and
SimulaLons
are
successfully
delivered
to
analysis
end‐users

Impressive
to
see
such
a
big
collabora)ve
work
on
such
a
large
scale

✦ If
you
worked
on
even
some
bits
of
this,
YOU
are
part
of
this
success

Integrated
volume
of
data
and
live
Lme
of
the
accelerator
are
sLll
lower
than
 expected,
but
the
plan
calls
for
“interesLng”
Lmes
soon...

✦ Not
all
resources
equally
uLlized
in
2010 ✦ A
resource‐constrained
environment
in
2011

AcLvity
level
is
high

✦ Enthusiasm
and
hope
for
discoveries
is
very
high

Stay
tuned...

30

slide-31
SLIDE 31

Daniele Bonacorsi [CMS] ISGC’11 - Taipei - 22 Marzo 2011

Acknowledgements

Thanks
to
the
LHC
CompuLng
teams
(managers
/
 coordinators
/
operators)
‐
and
to
the
LHC
accelerator
 division!
‐
for
such
a
fruiyul
2010. Thanks
to
the
Grid
developers,
the
enLre
WLCG
 community
and
all
the
site
admins
at
the
distributed
 Tiers
for
their
commized,
competent
and
constant
work.

31