17th Dec 2014 1
Regional WLCG connectivity in India and LHC Open Network Environment - - PowerPoint PPT Presentation
Regional WLCG connectivity in India and LHC Open Network Environment - - PowerPoint PPT Presentation
Third Annual Workshop-Indian Institute of Technology, Guwahati (IITG) Regional WLCG connectivity in India and LHC Open Network Environment Brij Kishor Jashal Tata Institute of Fundamental Research, Mumbai Email brij.jashal@tifr.res.in
2
Agenda
- LHC data challenges and evolution of LHC data access model
- LHC open network environment - LHCONE
- India-CMS site infrastructure
- Network at T2_IN_TIFR
- Expectations from networks for LHC Run2
- WLCG regional connectivity in India by NKN
Complexity of LHC experiments
1 Billion collisions per second (for each experiment) Each collision generates particles that often decay in complex ways into even more particles. Electronic circuits record the passage of each particle through a detector as a series of electronic signals, and send the data to the CERN Data Centre (DC) for digital reconstruction. 1 million Gigabytes (1PB) per second of raw data The digitized summary is recorded as a "collision event". Physicists must sift through the data produced to determine if the collisions have thrown up any interesting physics.
Source – cern.ch
4
Source: Bill Johnston, ESNet
5
Evolution of computing models -1
- When in the late nineties the computing
models for the 4 LHC experiments were developed, networking was still a scarce resource and almost all models reflect this ,although in different ways
- Instead of relying on the ability to provide
the data when needed for analysis, the data is replicated to many places on the grid shortly after it has been produced to be readily available for user analysis
- This model has been proved to work well
for the early stages of analysis but is limited by the ever-increasing need for disk space as the data volume from the machine increases with time
6
Evolution of computing models
- Bos-Fisk report1 on Tier 2 (T2) requirements for the Large
Hadron Collider (LHC)
- From Hierarchical to distributed
- Network is the most reliable resource for
LHC computing.
- This makes it possible to reconsider the
data models and not rely on pre-placement
- f the data for analysis and running the
jobs only where the data is. Instead, jobs could pull the data from somewhere else if not already available locally
- It will depend on the data needs to decide
whether to copy the data locally or to access the data remotely
- If the data is copied locally the storage
turns into a cache that is likely to hold the selection of the data that is most popular for analysis at that time
APAN-38 7
Managing large scale science traffic in a shared infrastructure
- The traffic from T0 to T1 was well served by LHCOPEN
- The traffic from the Tier 1 data centres to the Tier 2 sites (mostly
universities) where the data analysis is done is now large enough that it must be managed separately from the general R&E traffic
- In aggregate the Tier 1 to Tier 2 traffic is equal to the Tier 0 to Tier 1
(there are about 170 Tier 2 sites)
- Managing this with all possible combinations of Tier 2
- Special infrastructure was required for this:
The LHC’s Open Network Environment LHCONE was designed for this purpose
- LHCONE is an overlay network supported by Over 15 national and
international RENs
- Several Open Exchange Points including NetherLight, StarLight, MANLAN and
- thers
- Trans-Atlantic connectivity provided by ACE, GEANT, NORDUNET and
USLHCNET
- Over 50 end sites connected to LHCONE
45 Tier 2s 10 Tier 1s https://twiki.cern.ch/twiki/bin/view/LHCONE/LhcOneVRF#Connected_sites
APAN-38 8
Source:- Bill Johnston, ESNet
9
LHCONE L3VPN architecture
TierX sites connected to National-VRFs or Continental-VRFs
What LHCONE L3VPN is:
- Layer3 (routed) Virtual Private Network
- Dedicated worldwide backbone connecting Tier1s and Tier2s (and Tier3s) at high
bandwidth
- Reserved to LHC data transfers and analysis
10
11
The TierX site needs to have:
- Public IP addresses
- A public Autonomous System (AS) number
- A BGP capable router
LHCONE Acceptable Use Policy (AUP):
- Use of LHCONE should be restricted to WLCG related traffic
- IP addresses announced to LHCONE:
- should be assigned only to WLCG servers
- cannot be assigned to generic campus devices (desktop and portable
computers, wireless devices, printers, VOIP phones....) Routing setup
- A BGP peering is established between the TierX and the VRF border routers
- The TierX announce only the IP subnets used for WLCG servers
- The TierX accepts all the prefixes announced by the LHCONE VRF router
- The TierX must ensure traffic symmetry: Injects only packets sourced by
the announced subnets
12
Symmetric paths must be ensured To achieve symmetry, we can use any one of the following techniques:
- Policy Base Routing (source-destination routing)
- Physically Separated networks
- Virtually separated networks (VRF)
- Science DMZ
Policy Base Routing (source-destination routing) If a single border router is used to connect to the Internet and LHCONE, source-destination routing must be used
13
Physically Separated networks Different routers can be used for Generic and LCG Hosts Virtually separated networks (VRF) Traffic separation is achieved with Virtual Routing instances on the same physical box VRF = Virtual Routing Forwarding (virtual routing instance)
14
15
16
T2_IN_TIFR Resources
Networking
- Dedicated P2P link to LHCONE, upgraded to 4 G guaranteed with best effort up to 10 G
- Two 2.5 Gigabit links – one to Europe and other to Singapore through TEIN4
- Internal network upgraded with new chassis switch with total switching throughput of
3.5 Tbps and new storage nodes upgraded with 10 G link.
- T2-IN-TIFR is a part of LHCONE via CERNLite from it’s inception.
Computing
- Total no if physical cores 1024, Total average of runs executed on a machine
( Special Performance Evaluation for HEP code ) i.e HEP-SPEC06 is 7218.12 Storage
- Total Storage capacity of 28 DPM Disk Nodes is aggregated to more than 1PB
(1020 TB)
- Regional XrootD federation redirector for India
17
Network at T2_IN_TIFR
Last week GEANT upgraded backbone link between CERNLite router (where Indian link terminates at CERN to GEANT POP) to 10G
18
81% 98% 74% 96% 93% 93% 85% 98% 96% 98% 81% 100% 76% 100% 100% 93% 85% 98% 96% 100% 0% 20% 40% 60% 80% 100% 120% Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 Jun-14 Jul-14 Aug-14 Sep-14 Site Availability Site Relaiability
Source:
https://espace2013.cern.ch/WLCG-document-repository/ReliabilityAvailability/Forms/AllItems.aspx?RootFolder=%2fWLCG-document- repository%2fReliabilityAvailability%2f2014&FolderCTID=0x0120000B00526BA2DC7C4AAEB1B0E517F001F8
WLCG site availability and reliability report India-CMS TIFR In 2014 (Dec-13 to Sep-14)
Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 Jun-14 Jul-14 Aug-14 Sep-14 Site Availability 81% 98% 74% 96% 93% 93% 85% 98% 96% 98% Site Relaiability 81% 100% 76% 100% 100% 93% 85% 98% 96% 100% CPU Usage in Month (HEPSPEC06-Hrs) 250,432 6,01,448 3,23,452 11,86,232 12,03,460 11,98,572 9,44,016 10,86,608 8,71,916 7,31,964
19
20
21
22
APAN-38 23
Total cumulative data transfers for last five months is 700 TB
Production Instance Uploads Production Instance Downloads Debug Instance Downloads Debug Instance Uploads
24
Austria, 0.972 Belgium, 3 Brazil, 0.025 China, 0.000829 Estonia, 0.01 Finland, 0.008 France, 21 Germany, 68 Hungary, 0.824 Swiss / Italy , 156 Netherland, 8 Others, 5 Portugal, 5 Russia, 17 Russian-federation, 0.102 South Koria, 1 Spain, 18 Switzerland, 5 Taiwan, 2 UK, 57 Ukrain, 0.004 USA, 194
Country wise WLCG traffic to India Total volume in last one year - 557 TB
Austria Belgium Brazil China Estonia Finland France Germany Hungary Italy Netherland Others Portugal Russia Russian-federation South Koria Spain Switzerland Taiwan UK Ukrain USA
Only FTS +PHEdeX
Period 2013-12-01 00:00 UTC to 2014-12-15 00:00 UTC
25
Austria 0% Belgium 0% Brazil 0% China 0% Estonia 0% France 2% Germany 3% Hungary 0% India 0% Italy 4% South Koria 1% Spain 39% Swiss 1% Taiwan 3% UK 41% USA 4% Others 0%
Country wise WLCG traffic from India ( FTS + PhEDEX) Total Volume – 408 TB
Austria Belgium Brazil China Estonia France Germany Hungary India Italy South Koria Spain Swiss Taiwan UK USA Others
Only FTS +PHEdeX
Period 2013-12-01 00:00 UTC to 2014-12-15 00:00 UTC
26
Collaborating Indian Institutes at LHC (14 or more)
- Tata Institutes of Fundamental Research (TIFR, Mumbai) WLCG Site
- Variable Energy Cyclotron Centre ( VECC, Kolkata ) WLCG Site
- Bhabha Atomic Research Centre (BARC, Mumbai)
- Delhi University
- Saha Institute of Nuclear Physics (SINP, Kolkata)
- Punjab University
- Indian Institute of Technology, Bombay (IITB, Mumbai)
- Indian Institute of Technology, Madras ( IITC, Chennai)
- Raja Ramanna Centre for Advanced Technology ( RRCAT, Indore)
- Indian Institute of Technology Bhubaneswar (IITBBS)
- Institute for Plasma Research (IPR, Ahmedabad)
- National Institute of Science Education and Research (NISER,
Bhubneshwar)
- Vishva-Bharti University (Santiniketan, WB)
- Indian Institute of Science Education and Research, Pune
(More then 75 users active users from these institutes have accounts at
T2_IN_TIFR and TIFR Tier III )
27
IndiaCMS – NKN - TIFR - LHCONE
- Currently NKN traffic to LHCONE and CERN is via TEIN4 link from NKN-TEIN4
(Mumbai) – Madrid – GEANT – LHCONE on 2.5 G link.
NKN is an integral part of the connectivity for the regional component of Worldwide LHC Computing Grid (WLCG).
APAN-38 28
traceroute to srm-cms.cern.ch (128.142.162.99), 30 hops max, 40 byte packets 1 router.puhep.res.in (144.16.112.177) 2.862 ms 2.992 ms 3.149 ms 2 202.141.110.229 (202.141.110.229) 0.353 ms 0.368 ms 0.372 ms 3 10.144.20.5 (10.144.20.5) 38.901 ms 38.968 ms 38.976 ms 4 10.255.238.189 (10.255.238.189) 38.659 ms 38.643 ms 38.625 ms 5 mb-xe-01-v4.bb.tein3.net (202.179.249.41) 38.128 ms 38.254 ms 38.076 ms 6 eu-mad-pr-v4.bb.tein3.net (202.179.249.118) 144.511 ms 144.504 ms 144.507 ms 7 ae3.mx1.par.fr.geant.net (62.40.98.65) 170.136 ms 170.188 ms 170.033 ms 8 switch-bckp-gw.mx1.par.fr.geant.net (62.40.124.82) 224.438 ms 190.136 ms 190.056 ms 9 e513-e-rbrxl-2-te20.cern.ch (192.65.184.70) 177.140 ms 177.028 ms 177.135 ms 10 e513-e-rbrxl-1-ne0.cern.ch (192.65.184.37) 173.402 ms 163.651 ms 172.637 ms traceroute to 144.16.111.24 (144.16.111.24), 30 hops max, 40 byte packets 1 router.puhep.res.in (144.16.112.177) 0.685 ms 1.063 ms 1.262 ms 2 202.141.110.229 (202.141.110.229) 0.357 ms 0.403 ms 0.375 ms 3 * * * 4 * * * 5 202.141.153.30 (202.141.153.30) 31.708 ms 32.060 ms 31.628 ms 6 202.141.153.29 (202.141.153.29) 38.695 ms 38.578 ms 38.657 ms 7 ci-connect.indiacms.res.in (144.16.111.24) 38.329 ms 38.317 ms 38.291 ms
Trace Route analysis.
29
[root@cmst3ui2 ~]# traceroute -I 192.65.184.73 traceroute to 192.65.184.73 (192.65.184.73), 30 hops max, 60 bytepackets 1 172.16.11.252 (172.16.11.252) 2.353 ms 2.642 ms 2.883 ms 2 172.16.0.254 (172.16.0.254) 0.580 ms 0.574 ms 0.560 ms 3 vpn2.saha.ac.in (14.139.193.1) 0.945 ms 0.969 ms 0.968 ms (NKN) 4 10.118.248.93 (10.118.248.93) 0.951 ms 0.954 ms 0.954 ms ( NKN Private core) 5 * * * 6 * * * 7 10.255.221.34 (10.255.221.34) 31.804 ms 31.700 ms 31.694 ms (NKN Private core) 8 115.249.209.6 (115.249.209.6) 37.302 ms 37.541 ms 37.541 ms ( RCOM – Andhra) 9 * * * 10 * * * 11 62.216.147.73 (62.216.147.73) 46.588 ms 46.577 ms 46.556 ms (UK) 12 xe-0-0-0.0.pjr03.ldn001.flagtel.com (85.95.26.238) 186.642ms 174.002 ms 173.979 ms 13 xe-5-2-0.0.cji01.ldn004.flagtel.com (62.216.128.114) 187.455ms 187.519 ms 187.693 ms 14 80.150.171.69 (80.150.171.69) 295.319 ms 293.396 ms 293.372ms ( Germany) 15 217.239.43.29 (217.239.43.29) 305.694 ms 309.873 ms 306.677ms (Deutsche Telekom AG) 16 e513-e-rbrxl-1-ne1.cern.ch (192.65.184.73) 221.143 ms 230.249 ms 215.501 ms ( CERN)
30
- AT Present only TIFR traffic utilizes the dedicated link and follows LHCONE
routing symmetry
- VECC Kolkata with 6000 CPU cores and 240 TB storage is one of the major
Alice T2, but at present it’s traffic is via TEIN4 link.
- In the framework of AAA (any time, any data, anywhere), It has become
easier for the smaller T3 clusters to do HEP analysis on LHC data, As they don’t have to worry about the data location. But this is only efficient if they have access to LHCONE network
- All the WLCG member institutes of India are already operating ( or coming
up with) T3 computing clusters ( e.g SINP – 200 core cluster)
- For Improving the networking experience of Indian WLCG Institutes, there is
a need of segregation of LHC traffic from general internet traffic.
- Cisco Catalyst 9120, which is our main router needs to upgraded. ( we
recently upgraded one SR SFT for 10 G) for IP v6 feathers. (belongs to NIC)
- Enabling and testing jumbo Frames (9000 bytes) on NKN L3VPM in India ?
- LHCONE traffic and ecosystem has already adopted to Jumbo frame.
- It is already proven that CPU overhead is reduced significantly and
efficiency is improved.
31