LHC CMS Tier2 facility at TIFR http://indiacms.res.in Kajari - - PowerPoint PPT Presentation

lhc cms tier2 facility at tifr
SMART_READER_LITE
LIVE PREVIEW

LHC CMS Tier2 facility at TIFR http://indiacms.res.in Kajari - - PowerPoint PPT Presentation

LHC CMS Tier2 facility at TIFR http://indiacms.res.in Kajari Mazumdar Department of High Energy Physics Department of High Energy Physics Tata Institute of Fundamental Research Mumbai India Mumbai, India. Plan Introduction


slide-1
SLIDE 1

LHC‐CMS Tier2 facility at TIFR

http://indiacms.res.in

Kajari Mazumdar Department of High Energy Physics Department of High Energy Physics Tata Institute of Fundamental Research Mumbai India Mumbai, India.

Plan

  • Introduction
  • Introduction
  • Quick tour of LHC and the computing grid
  • Issues related to network and CMS-T2 at Mumbai

EU-India Grid meeting, APAN conference, Delhi August 24, 2011

slide-2
SLIDE 2

e-Science and e-Research

ll b h h d bl b h

  • Collaborative research that is made possible by sharing resources across

the internet (data, computation, people’s expertise...) – Crosses organisational , national and International boundaries – Often very compute intensive and/or data intensive CERN-LHC project is an excellent example of all the above – CERN-LHC project is an excellent example of all the above – HEP-LHC has been a driving force for GRI D technology

– Worldwide LHC Computing GRI D (WLCG) is a natural

evolution of internet technology WWW was born in CERN to satisfy the needs of previous WWW was born in CERN to satisfy the needs of previous generation HEP experiments

slide-3
SLIDE 3

Large Hadron Collider (LHC)

Largest ever scientific project 20 years to plan, build 20 years to work with

  • 27 km circumference
  • at 1.9 K

t 10 13 T

  • at 10-13 Torr
  • at 50-175 m below surface
  • > 10K magnets

4 big experiments: >10K scientists, students,engineers. Operational since 2009 Q4 Operational since 2009, Q4 excellent performance fast reap of science!

slide-4
SLIDE 4

LHC: ~ 10 LHC: ~ 10-12

12 seconds (

seconds (p p-

  • p

p) ) 10 10 6 d ( d (Pb Pb Pb Pb) ~ 10 ~ 10-6 seconds ( seconds (Pb Pb-Pb Pb) Big Bang Big Bang WMAP WMAP (2001)

COBE( COBE(1989)

Today Today Experiments in Astrophysics & Cosmology Experiments in Astrophysics & Cosmology

~300‘000 years ~300‘000 years

slide-5
SLIDE 5

Enter a New Era in Fundamental Science Enter a New Era in Fundamental Science

LHCb LHCb CMS CMS ATLAS ATLAS

Exploration of a new energy frontier Exploration of a new energy frontier in in p-p and and Pb Pb-Pb Pb collisions collisions

ALICE ALICE

in in p p and and Pb Pb Pb Pb collisions collisions

LHC ring: LHC ring: 27 km circumference 27 km circumference

ALICE ALICE

slide-6
SLIDE 6

What happens in LHC experiment

Proton‐Proton 1400 bunch/beam Summer, 2011 / Protons/bunch 2. 1011 Beam energy 3.5 TeV (1TeV = 1012 eV) Luminosity 2.1033 /cm 2/s Crossing rate 20 MHz

Mammoth detectors register signals for

Collisions 108 Hz

Mammoth detectors register signals for energetic, mostly (hard) inelastic collisions involving large momentum transfer.

slide-7
SLIDE 7

Motivation of LHC experiments

LHC is meant to resolve some of the most puzzling issues in physics:

  • Nature of elementary particles and interactions shortly after Big Bang

y p y g g how many interactions when the universe was much hotter which elementary particles existed with what properties?

  • we recreate conditions of very early universe at LHC

we recreate conditions of very early universe at LHC.

  • Origin of the mass

mass patterns among different particles in today’s universe mass patterns among different particles in today s universe why photon is massless, while carriers of weak interaction are massive? if th t i b k t l h t i th i t ? if the symmetry is broken spontaneously, what is the signature? existence of the God Particle?

  • The Higgs boson yet to be discovered, but coming soon, stay tuned!

LHC is at the threshold of discovery likely to change the way we are y y g y used to think of Nature!

slide-8
SLIDE 8

Grand menu from LHC

  • Nature of dark matter: we know only 4% of the

constituent of the universe

  • A good 25% of the rest is

massive enough to dictate the motion of galaxies non luminous and hence “dark”

  • bserved

non-luminous, and hence dark LHC can tell us the nature of this dark matter! LHC ill l h d li h LHC will also shed light on:

  • why there is only matter and no antimatter today.
  • properties of the 4th state of matter:

Expected from visible Distribution of matter

p p Quark-Gluon-Plasma which existed 1 pico sec. after the big bang, before formation of neutrons and protons.

Distribution of matter

….

All thi i ibl b LHC i ti ll i All this is possible because LHC is essentially a microscope AND a telescope as well!

slide-9
SLIDE 9

To begin at the end: To begin at the end:

  • the operations of the LHC machine and the experiments have been a

great success

  • the experiments produced fantastic results, often only days after the

data was taken more than 200 publications in < 2 years data was taken more than 200 publications in < 2 years

  • This is partly because of the long lead time experiments had
  • This is partly because of the long lead‐time experiments had

– This has had implications in the analysis patterns

  • But there have been lessons to be learned

But there have been lessons to be learned – And we have just started on a treadmill, which will require continual development

LHC Computing Grid is the backbone of the success story

slide-10
SLIDE 10

Example of a modern detector

3170 scientists and engineers including ~800 students 169 institutes in 39 countries 169 institutes in 39 countries

India

slide-11
SLIDE 11

Data rates @ CMS as foreseen for design parameters

Presently event size ~ 1MB (beam flux lower than design ) Presently event size 1MB (beam flux lower than design ) data collection rate ~ 400 Hz

slide-12
SLIDE 12
  • Versatile experiments, equipped with very

Challenges

specialized detectors. ~107 electronic channels per experiment, ready every 25 ns to collect information y y

  • f debris from violent collisions.

Reconstruct 20K charged tracks in a single event (lead lead collisions at LHC)

Charged tracks from heavy ion collision vertex

single event (lead-lead collisions at LHC)

  • Event size related to

ion collision vertex

flux/intensity

  • 1.5 Billion events recorded

in 2010 in 2010

  • > 2B events, much more

complicated, to be d d d i 2011 recorded during 2011 resource utilization to be prioritized by carefully

10 vertices in a single p-p collision, to be discriminated from the interesting process

throwing soft, mundane collisions.

slide-13
SLIDE 13

The GRID Computing Goal

S i i h b d

  • Science without borders
  • Provide Resources and Services to store/serve O(10) PB data/year
  • Provide access to all interesting physics events to O(1500) collaborators

g p y ( ) around the world

  • Minimize constraints due to user localisation and resource variety
  • Decentralize control and costs of computing infrastructure
  • Share resources with other LHC experiments
  • Share resources with other LHC experiments

Solution through Worldwide LfHC Computing GRID

  • Today >140 sites

Delivery of physics should be fast Workhorse for production data handling Today >140 sites

  • ~250k CPU cores
  • ~100 PB disk
slide-14
SLIDE 14

Tier 0

CERN

Layered Structure of CMS GRID connecting computers across globe

Several

Tier 0

Experimental site

CERN computer centre,

Online data recording

Petabytes/sec.

Tier 1

N i l

Geneva

ASIA Germany USA

10 Gbps

CERN National centres

Tier 2

S (Taiwan) France Italy Germany USA

1-2.5 Gbps

CERN Regional groups in a continent/nation India China Korea Taiwan

I di

Pakistan Different Universities, TIFR BARC Panjab Univ.

Indiacms T2_IN_TIFR

Delhi Univ. Institutes in a country Individual scientist’s TIFR PC,laptop, ..

slide-15
SLIDE 15

CMS in Total: 1 Ti CERN (GVA) 1 Tier-0 at CERN (GVA) 7 Tier-1s on 3 continents 50 Tier-2s on 4 continents

CMS T2 in India : one of the 5 in Asia-Pacific region Today : 6 collaborating institutes in CMS , ~ 50 scientists +students 2 1% f i i th i bli ti 2.1% of signing authors in publication, Contributing to computing resource of CMS ~ 3%

slide-16
SLIDE 16

First job of the offline system is to process and monitor data at T0

Quick description of LHC grid tiers

First job of the offline system is to process and monitor data at T0 Processing time depends on the experiment (less than anticipated for CMS) / ff G G T0: 1 M jobs/day + test-jobs. traffic : 4Gbps input, > 13 Gbps served CERN Tier 0 moves ~ 1 PB data per day, automated subscription T1 processes data further several times a year, coordinates with T2s p y , Challenges at T2 T2s Hosts specific data streams for physics analysis T2s Hosts specific data streams for physics analysis As local resources demand, data is “tidied up” at T2 Gets main data from T1s, recently more communications among T2s Real workhorses of the system with growing roles Analysis system/paths working well y y p g Site readiness is at high level (lot of testing), Availability keeps improving (with lot of effort!) Typically 100k analysis jobs/day/expt Typically 100k analysis jobs/day/expt

slide-17
SLIDE 17
  • Distributed Analysis is not a wish it is a necessity

General T2 issues

  • Distributed Analysis is not a wish, it is a necessity

tools has to be reliable!

  • Operations are far too effort‐intensive

– More automation needed + pro‐activeness of sites

  • More monitoring is not enough, need more history

– more coherent and compact views for sites needed

  • Need views for distinct purposes

– View for the experiment: Is the end‐to‐end system working? View for the experiment: Is the end‐to‐end system working? – View for the site: Are the local services working? – View for the managers : Are the resources being used effectively, and if not, g g y, , where is the problem?

  • Storage access remains a problem (data deletion works fine though!)

– Typical failure mode of analysis jobs is storage related

slide-18
SLIDE 18

Resources (Q3 2011)

T2-IN-TIFR, Mumbai

  • CMS T2 @India is contributing

Resources (Q3 2011)

  • CPU (condor) : 320 cores
  • Disk (dpm) : 500 TB
  • CMS T2 @India is contributing

well to the comuting efforts of CMS experiment credits earned

  • Nework (WAN) : 1 Gbit/s
  • Central CMS Analysis activities :

physics, HO detetector calibration

against mandatory service jobs.

  • Part of LHCONE test operation

p y ,

  • T3 : 20 TB Disk, local access to

T2 data. ~45 users many students

  • Part of LHCONE test operation

(previledged access to private network involving T1s) Middleware

  • ~45 users, many students.

‐Storage Elements ‐ Computing Elements ‐ Workload Management L l il C l ‐ Local File Catalog ‐ Information System ‐ Virtual Organisation management management ‐ Inter‐operability among different GRIDs (EGEE, OSG, NorduGriD.)

slide-19
SLIDE 19

Network connections

1 Gbps to CERN peered to GEANT

250 Mbps NKN + 2 5 Gbps TEIN3 2.5 Gbps TEIN3 1 Gbps G O ??? GLORIAD ???

TIFR‐INDIACMS T2 VECC/SINP: INDIA

100 Mbps to VECC RRCAT, IPR

slide-20
SLIDE 20

Data Transfer challenges

T0 – T1: multi‐virtual organization (VO) challenge T0 T1: multi virtual organization (VO) challenge T1 – T1: Analysis Data Object (AOD)s to be replicated at all T1s – Simultanous transfers of ~50 TB data without major bottle‐necks. j T1 – T2: Data Serving (typically residing on tape, hence staging step needed T2 – T2: Stronger involvement than foreseen during design phase T2 T2: Stronger involvement than foreseen during design phase – Heavy activity for CMS link commissioning experts and sites

Many thanks to our funding agencies, we do have reasonable connectivity for

  • ur T2s (CMS@ Mumbai and ALICE@Kolkata).

But it is not enough for equal-share performance within collaboration. From India we would like to take up computing shifts (monitoring grid From India, we would like to take up computing shifts (monitoring grid centres). More importantly within the country much better connectivity is essential More importantly, within the country much better connectivity is essential For T2s to effectively serve collaborators within India. ~6 CMS T3s at present The community is growing fast and so is the demand for fast access f d t t M b i

slide-21
SLIDE 21

Physics datasets for analysis Physics datasets for analysis

  • Distribution of data to

participating centres all over the world

  • Huge datasets (~ TB)

cannot be transferred t h th to each user the analysis jobs go to “where the data is”

slide-22
SLIDE 22

Data Transfers from/to TIFR

upload

  • Total data volume at

present ~ 250 TB

  • Total transfers during

last 6 months ~ 70 TB

  • TIFR hosting

i) centrally managed d ( i l d

download

data (simulated, custodial) ii)collision data skims

download

  • Current CMS total CPU pledge at T2s : 18k jobs slots
  • Nominal Analysis pledge : 50%
  • Slot utilization during Summer/Fall 09 was reasonable

Slot utilization during Summer/Fall 09 was reasonable but need to go into sustained analysis mode

slide-23
SLIDE 23

CMS Production at TIFR

Total no. of events processed ~ 10 Billion in last 6 months 10 Billion in last 6 months

slide-24
SLIDE 24

How CMS rates performance of a site

  • Availability and reliability charts mostly green for last 1 year
  • Data transfers (PhEDEx, minimum upload/download rates)
  • Job Robot (simulates small analysis jobs/user jobs)
  • Job Robot (simulates small analysis jobs/user jobs)

for failures, the logfile describes the nature. the investigations carried out locally + central experts. the investigations carried out locally + central experts. If JR success rate is good the T2 is doing fine.

Site availability: fraction of time all functional tests in a site are successful y

Link quality: number of data transfer links with an acceptable quality

slide-25
SLIDE 25

The KEY of GRID Computing: NETWORK

slide-26
SLIDE 26

The KEY of GRID Computing: NETWORK II

ERNET< NKN+TIEN3 facility has been crucial in general internet connectivity About 40% of data trasfers takes place through TIEN3 link, at TIMES 100%

slide-27
SLIDE 27

August 15-18, 2011 Maximum: 1.5 Gbps

  • Avg. : 1Gbps
slide-28
SLIDE 28

Needed from the network

  • Bandwidth, Reliability, Connectivity: collaborators on 6 continents

– Have set up group to express these requirements in conjunction with network communities network communities.

  • But we also need a service:

– Monitoring is largely missing today – we have ofetn a hard time to understand if there is a network problem. O ti l t – Operational support

  • Is a complex problem with many organisations involved. Who owns

the problem? How can we (user) track progress?

These issues are to be addressed urgently for the full utilization of h f ili il bl di i i the facility available to Indian scientists deterrant for a CMS/ALICE operations centre which involve almost 100 people contributing for mandatory p p g y service jobs towards the experiments.

slide-29
SLIDE 29

M ki h t h t d t i bl ( ft ti t ) i

Conclusion: augmenting the success of grid

  • Making what we have today more sustainable (software, operations, etc) is a

challenge

  • Data issues:

Data issues: – Data management and access – How to make reliable systems from commodity (or expensive!) hardware – Fault tolerance

  • Need to adapt to changing technologies:

– Use of many‐core CPUs (and other processor types?) – Soon global filesystems , virtualization?

  • Network infrastructure
  • Network infrastructure

– This is the most reliable service we have! – Invest in networks and make full use of the distributed system! Invest in networks and make full use of the distributed system!

Sincere thanks to ERNET EU India grid and NKN for the Sincere thanks to ERNET, EU-India grid and NKN for the support towards our effort in puttng India in CMS-Grid map

slide-30
SLIDE 30

Backup Backup

slide-31
SLIDE 31

Production of simulated data

  • Good measure of the Computing performance on the GRID
  • ProdAgent : main CMS Production tool
  • ProdAgent : main CMS Production tool

– modular Python deamons on top of MySQL DB, interacting with CMS Data Management systems – Used by several CMS workflows (Tier‐0, Tier‐1, Tier‐2)

  • CMS has a centralized production strategy handled by the Data

Operations teams 6 „regional teams“ + Production Manager 1 team (6 people) managing submissions in defined „T1‐regions“

31 P.Kreuzer ‐ GRID Computing ‐ Mumbai

slide-32
SLIDE 32

India: Introducing the players

  • Initial impetus: . Indian groups in CMS and ALICE, pledged Tier‐2 regional

computing centers. CERN‐IT collab with DAE to develop global GRID components & tools for WLCG. ‐ Regional Computing Centers: at TIFR, Mumbai for CMS; at SINP/VECC, Kolkata for ALICE ‐ The major funding agencies DAE (Dept of Atomic Agency) and DST (Dept of Science & Tech) became involved Not only for international but also domestic Science & Tech) became involved. Not only for international but also domestic connectivity for Tier‐2 Tier‐3 connections. ‐ Many DAE institutions, projects DAE‐GRID

  • Parallel effort: EU (GEANT‐Dante) in touch with DIT‐ERNET for setting up Europe‐

India connectivity into GEANT for educational, scientific collaboration. (DIT: Dept

  • f Information Tech, ERNET: implementing agency). 50% funding of Europe‐India

link from EU link from EU

slide-33
SLIDE 33

Players cont Players…cont.

  • 2005‐06: marriage of the two, 45 Mbps link under GEANT‐ERNET agreement, of

which 34 Mbps for India‐LHC connectivity. MOU between DIT‐DST‐TIFR(DAE) on Indian side to share costs Indian side to share costs.

  • GARUDA, National GRID Initiative: project for domestic connectivity & GRID

infrastructure among Indian educational and R&D institutions. By C‐DAC‐ERNET. (C‐ g y ( DAC: Center for Development of Advanced Computing).

  • EU‐IndiaGRID project started in Oct 2006. The Indian LHC experiments groups and
  • thers in biology condensed matter physics earth and atmospheric sciences
  • thers in biology, condensed matter physics, earth and atmospheric sciences.

Work on INTEROPERABILITY OF GRIDS. International and domestic connectivity via the GEANT‐ERNET link and GARUDA.

  • End‐2007: 0.3‐1 Gbps DAE direct TIFR‐CERN link for LHC startup. Peering to GEANT

at CERN end for connectivity to other CMS Tier‐1 centers.

slide-34
SLIDE 34

Computing Resources: Setting the scale p g g

Data Tier RAW RECO SIMRAW SIMRECO AOD

  • Run 2009‐10 : 300 Hz / 7 x 106 sec / 2.2 x 109 events

Si &CPU t

Data Tier RAW RECO SIMRAW SIMRECO AOD <Size> [kB] 200 400 2,000 500 150 CPU [HS06‐sec] ‐ 100 1000

  • Size&CPU per event
  • CMS datasets

O(10) P i D t t (RAW RECO AOD) ‐ O(10) Primary Datasets (RAW, RECO, AOD) ‐ 40‐70 Secondary Datasets (RECO, AOD) ‐ x 1.5 more Simulated Events (SIMRAW, SIMRECO, AOD)

HEP-Spec2006 :

  • Modern CPU ~ 8 HS06 / core

100 HS06 12 5 / t

195186 200

  • 3 full re‐Reconstructions are
  • planed. CMS resource request :

disk [PB] / CPU [kHS06]

  • 100 HS06-sec ~ 12.5 sec/event
  • 100 kHS06 ~ 12,500 cores

142 75 101 100 97 107 186 100 120 140 160 180 200

p q

‐ 400 kHS06 CPU ‐ 26 PB disk 38 PB tape

disk [PB] / CPU [kHS06]

35 2 7.3 60 8 16 9 45 2 7 46 7 12 75 3 4 15 13 23 9 4 15 13 25 13 20 40 60 80

‐ 38 PB tape

  • Resources ratio CERN / (T1+T2) :

‐ CPU : 25% , Disk : 15%

CPU Disks Tape CPU Disks Tape CPU Disks CERN T0 T1 T2 Installed Oct09 CMS request 09 CMS request 10 Pledges 10

34 P.Kreuzer ‐ GRID Computing ‐ Mumbai

slide-35
SLIDE 35

Site Description

  • 6 collaborating institutes at present, more in near future.
  • BARC, Delhi Uni., Panjab Uni., TIFR (EHEP & HECR), VisvaBharati Uni.

About 50 physicists About 50 physicists 3.5 FTE to manage the site till now, reducing gradually. User Access to T2‐IN‐TIFR

  • High end server as User Interface with the latest glite, root, CMS

software versions (concurrent versions)

  • GRID computing analysis facility using CMS‐specific package CRAB
  • Directly connected to T2 LAN
  • Directly connected to T2 LAN
  • Fast access to storage using RFIO
  • 20 TB local disk space to users with individual directories

p

  • AFS client, dedicated job slots (PBS) facilities
  • Latest OS security patches
  • ……………
slide-36
SLIDE 36

ISP

VECC (Tier2)

IndiaCMS Tier2 ISP ERNet Router ASR‐1002

VECC (Tier2) RRCAT IPR Public Switch FLAG R t CE SE MON NAT‐GW Link

nk

Share 1Gbps NKN Router FLAG Router Cisco 3750 G Geant 2 Private Switch 1 DPMS01

3 175 Mbps Internet Link Via Milano

Gloriad

NKN Li

NKN Router Cisco @

Computer Center

2 3 1 DPMS02 trunking

3

To Other T1 by Peering

CERN 3 DPMS18

WN01 WN02 WN48 3 1 Gbps Link

by Peering

NIC TEIN3 2.5 Gbps Europe / Singapore Gloriad 1Gbps

WN01 WN02 WN48 p Via Amsterdam (1:1)

Gloriad 1Gbps Singapore