LHCnet: Proposal for LHC Network infrastructure extending globally - - PowerPoint PPT Presentation

lhcnet proposal for lhc network infrastructure extending
SMART_READER_LITE
LIVE PREVIEW

LHCnet: Proposal for LHC Network infrastructure extending globally - - PowerPoint PPT Presentation

LHCnet: Proposal for LHC Network infrastructure extending globally to Tier2 and Tier3 sites Artur Barczyk, Harvey Newman California Institute of Technology / US LHCNet LHCT2S Meeting CERN, January 13 th , 2011 1 THE PROBLEM TO SOLVE 2 LHC


slide-1
SLIDE 1

LHCnet: Proposal for LHC Network infrastructure extending globally to Tier2 and Tier3 sites

Artur Barczyk, Harvey Newman California Institute of Technology / US LHCNet LHCT2S Meeting CERN, January 13th, 2011

1

slide-2
SLIDE 2

THE PROBLEM TO SOLVE

2

slide-3
SLIDE 3

LHC Computing Infrastructure

3

WLCG in brief:

  • Plus O(300) Tier

WLCG in brief:

  • 1 Tier-0 (CERN)
  • 11 Tiers-1s; 3 continents
  • 164 Tier-2s; 5 (6) continents

Plus O(300) Tier-3s worldwide

slide-4
SLIDE 4

CMS Data Movements

(All Sites and Tier1-Tier2)

4

1 hour average: 1 hour average: to 3.5 GBytes/s

Throughput [GBy/s]

3 4 2

Throughput [GBy/s]

1 2 1.5 2.5 0.5

Daily average total Daily average total rates reach over 2 GBytes/s

1 1 2 1.5 0.5

120 Days June-October 120 Days June-October Daily average reach Daily average T1-T2 rates reach 1-1.8 GBytes/s 132 Hours 132 Hours Last Week

6/19 7/03 7/17 7/31 8/14 8/28 9/11 9/25 10/9 6/23 7/07 7/21 8/4 8/18 9/1 9/15 9/29 10/13 10/6 10/7 10/8 10/9 10/10

Tier2-Tier2 ~25%

  • f Tier1-Tier2

Traffic To ~50% during Dataset Reprocessing & Repopulation

slide-5
SLIDE 5

5

MB/s per day 6 GB/s

Peaks of 10 GB/s reached

~2 GB/s (design)

Grid-based analysis in Summer 2010: >1000 different users; >15M analysis jobs

The excellent Grid performance has been crucial for fast release of physics results. E.g.: ICHEP: the full data sample taken until Monday was shown at the conference Friday

Worldwide data distribution and analysis (F.Gianotti)

Total throughput of ATLAS data through the Grid: 1st January  November.

slide-6
SLIDE 6

Changing LHC Data Models

  • 3 recurring themes:

– Flat(ter) hierarchy: Any site might in the future pull data from any other site hosting it. – Data caching: Analysis sites will pull datasets from other sites “on demand”, including from Tier2s in other regions

  • Possibly in combination with strategic pre-placement of data sets

– Remote data access: jobs executing locally, using data cached at a remote site in quasi-real time

  • Possibly in combination with local caching
  • Expect variations by experiment

6

slide-7
SLIDE 7

7

Ian Bird, CHEP conference, Oct 2010 Ian Bird, CHEP conference, Oct 2010

slide-8
SLIDE 8

Remote Data Access and Local Processing with Xrootd (CMS)

 Useful for smaller sites with less

(or even no) data storage

 Only selected objects are read

(with object read-ahead). No transfer of entire data sets

 CMS demonstrator: Omaha

diskless Tier3, served data from Caltech and Nebraska (Xrootd)

8

Strategic Decisions: Strategic Decisions: Remote Access vs Data Transfers Brian Bockelman, September 2010 Brian Bockelman, September 2010 Similar operations in Similar operations in ALICE for years

slide-9
SLIDE 9

9

Ian Bird, CHEP conference, Oct 2010 Ian Bird, CHEP conference, Oct 2010

slide-10
SLIDE 10

Requirements summary

(from Kors’ document)

  • Bandwidth:

– Ranging from 1 Gbps (Minimal site) to 5-10Gbps (Nominal) to N x 10 Gbps (Leadership) – No need for full-mesh @ full-rate, but several full-rate connections between Leadership sites – Scalability is important,

  • sites are expected to migrate Minimal  Nominal  Leadership
  • Bandwidth growth: Minimal = 2x/yr, Nominal&Leadership = 2x/2yr
  • “Staging”:

– Facilitate good connectivity to so far (network-wise) underserved sites

  • Flexibility:

– Should be able to include or remove sites at any time

  • Budget Neutrality:

– Solution should be cost neutral [or at least affordable, A/N]

10

slide-11
SLIDE 11

SOLUTION PROPOSAL

11

slide-12
SLIDE 12

Lessons learned

  • The LHC OPN has proven itself, shall learn from it
  • Simple architecture

– Point-to-point Layer 2 circuits – Flexible and scalable topology

  • Grew organically

– From star to partial mesh – Open to several technology choices

  • each of which satisfies requirements
  • Federated governance model

– Coordination between stakeholders – No single administrative body required – Made extensions and funding straight-forward

  • Remaining challenge: monitoring and reporting

– More of a systems approach

12

slide-13
SLIDE 13

Design Inputs

  • By the scale, geographical distribution and diversity of the

sites as well as funding, only a federated solution is feasible

  • The current LHC OPN is not modified

– OPN will become part of a larger whole – Some purely Tier2/Tier3 operations

  • Architecture has to be Open and Scalable

– Scalability in bandwidth, extent and scope

  • Resiliency in the core, allow resilient connections at the edge
  • Bandwidth guarantees  determinism

– Reward effective use – End-to-end systems approach

  • Operation at Layer 2 and below

– Advantage in performance, costs, power consumption

13

slide-14
SLIDE 14

Design Inputs, cont.

  • Most/all R&E networks (technically) can offer Layer 2 services

– Where not, commercial carriers can – Some advanced ones offer dynamic (user controlled) allocation

  • Leverage as much as possible on existing infrastructures and

collaborations – GLIF, DICE, GLORIAD, …

  • Last but not least:

– This would be the perfect occasion to start using IPv6, therefore we should, (at least) encourage IPv6, but support IPv4

  • Admittedly the challenge is above Layer 3

14

slide-15
SLIDE 15

Design Proposal

  • A design satisfying all requirements:

Switched Core with Routed Edge

  • Sites interconnected through Lightpaths

– Site-to-site Layer 2 connections, static or dynamic

  • Switching is far more robust and cost-effective for high-

capacity interconnects

  • Routing (from

end-site viewpoint) is deemed necessary

15

slide-16
SLIDE 16

Switched Core

  • Strategically placed core exchange points

– E.g. start with 2-3 in Europe, 2 in NA, 1 in SA, 1-2 in Asia – E.g. existing devices at Tier1s, GOLEs, GEANT nodes, …

  • Interconnected through high capacity trunks

– 10-40 Gbps today, soon 100Gbps

  • Trunk links can be CBF, multi-domain Layer 1/ Layer 2 links, …

– E.g. Layer 1 circuits with virtualised sub-rate channels, sub-dividing 100G links in early stages

  • Resiliency, where needed, provided at Layer 1/ Layer 2

– E.g. SONET/SDH Automated Protection Switching, Virtual Concatenation

  • At later stage, automated Lightpath exchanges will enable a

flexible “stitching” of dynamic circuits

– See demonstration (proof of principle) at last GLIF meeting and SC10

16

slide-17
SLIDE 17

One Possible Core Technology: Carrier Ethernet

  • IEEE standard 802.1Qay (PBB-TE)

– Separation of backbone and customer network through MAC-in-MAC – No flooding, no Spanning Tree – Scalable to 16 M services

  • Provides OAM comparable to SONET/SDH

– 802.3ag, end-to-end service OAM

  • Continuity Check Message, loopback, linktrace

– 802.3ah, link OAM

  • Remote loopback, loopback control, remote failure indication
  • Cost Effective

– e.g. NSP study indicates TCO ~43% lower for COE (PBB-TE) vs MPLS-TE

  • 802.1Qay and ITU-T G.8031 Ethernet Linear Protection Standard

provides 1+1 and 1:1 protection switching

– Similar to SONET/SDH APS – Works by Y.1731 message exchange (ITU-T standard)

17

slide-18
SLIDE 18

Routed Edge

  • End sites (might) require Layer 3 connectivity in the LAN

– Otherwise a true Layer 2 solution might be adequate

  • Lightpaths terminate on a site’s router

– Site’s border router, or, preferably, – Router closest to the storage elements

  • All IP peerings are p2p, site-to-site

– Reduces convergence time, avoids issues with flapping links

  • Each site decides and negotiates with which remote site it

desires to peer (e.g. based on experiment’s connectivity design)

  • Router (BGP) advertises only the SE subnet(s) through the

configured Lightpath

18

slide-19
SLIDE 19

Lightpath termination

  • Avoid LAN connectivity issues

when terminating lightpath at campus edge

  • Lightpath should be terminated as close as possible to the

Storage Elements, but can be challenging if not impossible (support a dedicated border router?)

  • Or, provide a “local lightpath”

(e.g. a VLAN with proper bandwidth, or a dedicated link where possible); border router does the “stitching”

19

slide-20
SLIDE 20

IP backup

  • Foresee IP routed paths as backup

– End-site’s BR is configured for both default IP connectivity, and direct peering through Lightpath – Direct peering takes precedence

  • Works also for

dynamic Lightpaths

  • For full dynamic

Lightpath setup, dynamic end-site configuration through e.g. LambdaStation

  • r TeraPaths will be

used

20

slide-21
SLIDE 21

Resiliency

  • Resiliency in the core is provided by protection switching

depending on technology used between core nodes – SONET/SDH or OTN protection switching (Layer 1) – MPLS failover – PBB-TE protection switching – Ethernet LAG

  • Sites can opt for additional resiliency (e.g. where protected

trunk links are not available) by forming transit agreements with other site – akin to the current LHC OPN use of CBF

21

slide-22
SLIDE 22

Layer1 through Layer 3

22

slide-23
SLIDE 23

Scalability

  • Assuming Layer 2 point-to-point operations, a natural

scalability limitation is the 4k VLAN IDs

  • This problem is naturally resolved in

– PBB-TE (802.3Qay), through MAC-in-MAC encapsulation – dynamic bandwidth allocation with re-use of VLAN IDs

  • Only constraint is no two connections through the same

network element to use the same VLAN

23

B-DA Ethertype 0x88A8 B-SA B-VID Ethertype 0x88E7 I-SID Customer Frame incl. Header+FCS B-FCS

slide-24
SLIDE 24

How do End-Sites Connect? A Simple Example

  • A Tier2 in Asia needs 1 Gbps connectivity (each) to 2 sites in

Europe, 2 in US and the ASGC Tier1

  • 5 x 1G intercontinental circuits is cost-prohibitive
  • The Tier2 could however afford a 1-2 Gbps (e.g. EoMPLS)

circuit to next GOLE (e.g. HKOP, KRLight, TaiwanLight, T-LEX)

– Through NREN(s) or commercial circuits

  • The GOLE connects to Starlight, NetherLight (trunks) and has a

connection to ASGC (example)

  • Static bandwidth allocation (first stage):

– The end-site has a 1Gbps link, with 5 VLANS, each one terminating at

  • ne of the desired remote sites

– Bandwidth is allocated by the exchange points to fit the needs

  • Dynamic allocation (early adopter + later stage):

– The end-site has a 1Gbps link, with configurable remote end-points and bandwidth allocation

24

slide-25
SLIDE 25

Monitoring and Reporting

  • Pervasive monitoring of status and utilisation is a must!

– Robust (100% monitoring up-time) – Resilient – Reliable – Real-time – End-to-end

  • Candidate 1: MonALISA monitoring system, used in US

LHCNet, and at large scale e.g. in the ALICE experiment

– From US LHCNet experience: it has all the components, and is proven to be scalable to satisfy the requirements – See e.g. LHC OPN presentation on MonALISA in US LHCNet: http://indico.cern.ch/getFile.py/access?subContId=1&contribId=15&resId =0&materialId=slides&confId=80755

  • Candidate 2: PerfSONAR, building up on set of community

developed tools

25

slide-26
SLIDE 26

DYNAMIC LIGHTPATHS

26

slide-27
SLIDE 27

Dynamic Lightpaths - Intro

  • Kors’ requirements document: “[…] the backbone does not

need to support all possible connections at full speed all the

  • time. The backbone does need to support several full speed

connections between the leadership Tier2s simultaneously.”

  • Dynamic Lightpaths provide temporary bandwidth allocation
  • n as-needed basis

– Connection reservation between any pair of sites for the requested amount of time (only)

  • Deployed in several R&E networks (ESnet, Internet2, SURFnet,

US LHCNet),

  • Pilots being prepared in others (GEANT + selected NRENs)
  • DYNES instrument, interconnecting ~40 US campuses will start

deployment in early 2011

27

slide-28
SLIDE 28

Dynamic Lightpaths in the proposed architecture

  • Dynamic Network Resource Allocation is a powerful tool to

avoid permanent full-mesh topology, while providing flexible connectivity and resource guarantees between end-systems

  • Requires integration in the experiments’ software stack
  • We foresee to include dynamic allocation in the final design,

complementing static Lightpaths between Leadership sites – Starting with early adopters, including DYNES-connected sites

28

slide-29
SLIDE 29

DYNES Overview

  • What is DYNES?

– A U.S-wide dynamic network “cyber-instrument” spanning ~40 US universities and ~14 Internet2 connectors – Extends Internet2’s dynamic network service “ION” into U.S. regional networks and campuses; Aims to support LHC traffic (also internationally) – Based on the implementation of the Inter-Domain Circuit protocol developed by ESnet and Internet2; Cooperative development also with GEANT, GLIF

  • Who is it?

– Collaborative team: Internet2, Caltech, Univ. of Michigan, Vanderbilt – The LHC experiments, astrophysics community, WLCG, OSG, other VOs – The community of US regional networks and campuses

  • What are the goals?

– Support large, long-distance scientific data flows in the LHC, other programs (e.g. LIGO, Virtual Observatory), & the broader scientific community – Build a distributed virtual instrument at sites of interest to the LHC but available to R&E community generally

29

slide-30
SLIDE 30

DYNES Team

  • Internet2,

Caltech, Vanderbilt,

  • Univ. of Michigan
  • PI: Eric Boyd

(Internet2)

  • Co-PIs:

– Harvey Newman (Caltech) – Paul Sheldon (Vanderbilt) – Shawn McKee (Univ. of Michigan)

30

http://www.internet2.edu/dynes

slide-31
SLIDE 31

DYNES System Description

  • AIM: extend hybrid & dynamic capabilities to campus & regional networks.

– A DYNES instrument must provide two basic capabilities at the Tier 2S, Tier3s and regional networks:

  • 1. Network resource allocation such as

bandwidth to ensure transfer performance

  • 2. Monitoring of the network and data transfer

performance

  • All networks in the path require the ability

to allocate network resources and monitor the transfer. This capability currently exists

  • n backbone networks such as Internet2 and

ESnet, but is not widespread at the campus and regional level. – In addition Tier 2 & 3 sites require:

  • 3. Hardware at the end sites capable of making
  • ptimal use of the available network resources

31

Two typical transfers that DYNES supports: one Tier2 - Tier3 and another Tier1-Tier2. The clouds represent the network domains involved in such a transfer.

slide-32
SLIDE 32

DYNES: Regional Network - Instrument Design

  • Regional networks require

1. An Ethernet switch 2. An Inter-domain Controller (IDC)

  • The configuration of the IDC

consists of OSCARS, DRAGON, and perfSONAR. This allows the regional network to provision resources on-demand through interaction with the other instruments

  • A regional network does not

require a disk array or FDT server because they are providing transport for the Tier 2 and Tier 3 data transfers, not initiating them.

32 At the network level, each regional connects the incoming campus connection to the Ethernet switch provided. Optionally, if a regional network already has a qualified switch compatible with the dynamic software that they prefer, they may use that instead, or in addition to the provided

  • equipment. The Ethernet switch provides a VLAN dynamically

allocated by OSCARS & DRAGON. The VLAN has quality of service (QoS) parameters set to guarantee the bandwidth requirements of the connection as defined in the VLAN. These parameters are determined by the original circuit request from the researcher / application. through this VLAN, the regional provides transit between the campus IDCs connected in the same region or to the global IDC infrastructure.

slide-33
SLIDE 33

DYNES: Tier2 and Tier3 Instrument Design

  • Each DYNES (sub-)instrument

at a Tier2 or Tier3 site consists

  • f the following hardware,

combining low cost & high performance:

  • 1. An Inter-domain Controller (IDC)
  • 2. An Ethernet switch
  • 3. A Fast Data Transfer (FDT)
  • server. Sites with 10GE

throughput capability will have a dual-port Myricom 10GE network interface in the server.

  • 4. An optional attached disk array

with a Serial Attached SCSI (SAS) controller capable of several hundred MBytes/sec to local storage.

33 The Fast Data Transfer (FDT) server connects to the disk array via the SAS controller and runs FDT software developed by Caltech. FDT is an asynchronous multithreaded system that automatically adjusts I/O and network buffers to achieve maximum network

  • utilization. The disk array stores datasets to be transferred among

the sites in some cases. The FDT server serves as an aggregator/ throughput optimizer in this case, feeding smooth flows over the networks directly to the Tier2 or Tier3 clusters. The IDC server handles the allocation of network resources on the switch, inter- actions with other DYNES instruments related to network pro- visioning, and network performance monitoring. The IDC creates virtual LANs (VLANs) as needed.

slide-34
SLIDE 34

How can DYNES be leveraged?

  • The Internet2 ION service has currently end-points at two GOLEs in

the US: MANLAN and StarLight

  • A static Lightpath from any end-site to one of these two Lightpath

Exchanges can be extended through ION to any of the DYNES sites (LHC Tier2 or Tier3)

34

slide-35
SLIDE 35

MANAGEMENT AND ORGANIZATION

35

slide-36
SLIDE 36

Governance structure

  • The global scale of the LHC network basically excludes a single

administrative/management unit

  • Needs to be under LHC community’s control

– Capacity planning – Exchange point placement

  • Open, federated governance

– Stakeholders in LHC computing shall be able to participate and contribute

  • LHC computing sites (Tier0/1/2/3) (directly? through WLCG? GDB?)
  • R&E networks

– One coordinating body (open participation)

  • Meet regularly
  • Define and oversee service levels
  • Perform planning functions
  • MoUs with exchange point operators

36

slide-37
SLIDE 37

Funding

  • Each site is responsible for assuring funding for its own

– End-site equipment (possibly a router or port costs on campus BR) – Layer 2 connection to the next Lightpath exchange point – Monitoring device

  • Core network will necessitate some shared funding

– Centrally organised?

  • Defining exchange point placement and core trunk capacities

– On regional basis?

  • By end-sites connecting to same exchange point

37

slide-38
SLIDE 38

Summary

  • We propose a robust, scalable and comparatively low-cost

solution based on a switched core with routed edge architecture

  • Core consists of sufficient number of strategically placed

exchange points interconnected by properly sized trunk circuits

– Scaling rapidly with time as in requirements document

  • IP routing is implemented at the end-sites
  • Sites are responsible for securing proper funding for their

connectivity to the core

  • Initial deployment to use predominantly static Lightpaths,

later predominantly using dynamic resource allocation

  • A federated governance model has to be used due to global

geographical extent and diversity of funding sources

38

slide-39
SLIDE 39

QUESTIONS?

Artur.Barczyk@cern.ch

39