LHCONE Operational Framework Part 1 : principles and ideas for the - - PowerPoint PPT Presentation

lhcone operational framework
SMART_READER_LITE
LIVE PREVIEW

LHCONE Operational Framework Part 1 : principles and ideas for the - - PowerPoint PPT Presentation

LHCONE Operational Framework Part 1 : principles and ideas for the operational model Part 2 : LHCONE VRF operational handbook Part 3 : Next step Xavier Jeannin RENATER 2013/01/28 Part 1 : principles and ideas for the operational model LHCONE


slide-1
SLIDE 1

LHCONE Operational Framework

Xavier Jeannin RENATER 2013/01/28

Part 1 : principles and ideas for the operational model Part 2 : LHCONE VRF operational handbook Part 3 : Next step

slide-2
SLIDE 2

2

LHCONE VRF nature

  • In standard a L3 VRF/VPN,

– Users manage

  • Site operation

– changes (new peering/site withdraw/prefixes) – maintenance – Information: relevant, location, maintaining up-to-date, publication

  • Security policy: firewall / filtering / science DMZ
  • Monitoring policy: tools, information and test available

– NSP manage

  • routing policy
  • network monitoring : tools, statistics
  • network operation:

– Relevant information (NOC email, tel., …) – Troubleshooting process: basic trouble (connectivity issue, …), asymmetric traffic, performance

  • LHCONE VRF is a specific L3 VPN

– Multi user entities – Multi NSPs  collaboration is required

Part 1 : principles and ideas for the operational model

slide-3
SLIDE 3

3

  • perational handbook
  • Create a light documentation :
  • https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_VRF_Operational_Handbook-v0.2.pptx

– Avoid a 100 pages, static document, never updated – Living document : the strict minimum … but be accurate enough to – Summarize all operation specification in one document

  • Collect the result of different sub-groups (routing, security, monitoring sub group
  • Point to all relevant documents
  • Goals
  • help a new site for its connection to the LHCONE and to provide the appropriate

information /tools

  • Help a new NSP NOC to manage the LHCONE and to provide the appropriate information

/tools

  • Help experiment to interact with LHCONE
  • Topics covered

– Specify routing policy: protocol BGP / community / load balancing … – Specify security policy: firewall / filtering / science DMZ …? – Specify monitoring policy: tools, information and test available? – Site operation (connection, withdraw, maintenance …) – Network operation and troubleshooting process :

  • Most of network entities (VRF/NSP) have their own operational procedure already defined
  • basic trouble (connectivity issue, …), asymmetric traffic, performance

– Information management: relevant, location, maintaining up-to-date, publication (who can access to what ?)

Part 1 : principles and ideas for the operational model

slide-4
SLIDE 4

4

Approach

  • Define actors, roles & responsibilities

– Separate roles from implementation – Identify relationship of the actors

  • Identify

– Relevant use cases – Relevant information and their location – who is responsible to keep the information up to date – Tools that can help network operation

  • Operational model manufacturing

– An iterative approach

 validation during LHCOPN/LHCONE meeting

– Be careful as it is hard to have « agreement » from all entities

slide-5
SLIDE 5

5

Operational Framework

  • Not enough involvement of users and NSP in operational framework

design

  • In order to make progress, sub-groups have been proposed

– Routing (NSP, Liaison sponsor)

  • Specify routing policy: protocol BGP / community / load balancing …

– Security (Users)

  • Specify security policy: firewall / filtering / science DMZ …

– Monitoring (Users/NSP)

  • Tool to be deployed both in sites en in NSP domain …
  • Appeal for “author” or “reviewer” for the document  no answer
  • These others topics have to be handled too

– Site operation (connection, withdraw, maintenance …) (Users) – Network operation and troubleshooting process (NSP) – Information management (Users/NSP)

  • a reliable mechanism to broadcast information
slide-6
SLIDE 6

LHCONE VRF

  • perational handbook

version 0.5

Contributor :

  • Xavier Jeannin 2012/11/28
  • ???

Inspired from G. cessieux work on LHCOPN operational model Part 2: operational handbook

slide-7
SLIDE 7

7

Table of contents

  • Drawing convention
  • Actors
  • Information management: relevant, location, maintaining

up-to-date, publication (who can access to what ?)

  • Site operation (connection, withdraw, maintenance …)
  • Network operation and troubleshooting process :

– basic trouble (connectivity issue, …) – asymmetric traffic, performance – maintenance process

  • Routing policy (simple pointer)
  • Security policy (simple pointer)
  • Specify monitoring policy (simple pointer)
slide-8
SLIDE 8

8

Drawing convention

B A B A A can access information in B with no authentication A can access information in B with authentication Ticket exchange between A and B A B Actors

Peering BGP Peering BGP with load balancing

Information repository Optional Information repository A sends an alarm to B A B

Optional or non yet existing relational, repository information, …

1 A A is responsible for maintaining 1 operational 1 A A is responsible for maintaining information up- to-date within 1 1 A A is responsible for maintaining information up- to-date within 1

slide-9
SLIDE 9

9

LHCONE Actors

  • VRF

– Provides a connection to other VRF for NSP’s and sites/tiers – VRF is a specific NSP and VRF interconnection defines the “free zone” – NOC

  • NSP

– Provides a connection to sites/tiers – NOC

  • Users

– Sites/tiers

  • T1
  • T2/T3 (should T2D be clearly identify by others actors ?)

– LHC experiments

  • Atlas, CMS, LHCb, Alice
  • Use the infrastructure
  • Define data flow model
  • Interact at operational level: down time (agenda), site ranking

Part 2: Actors

slide-10
SLIDE 10

10

LHC Experiments Sites (T0/T1) Sites (T0/T1)

LHCONE Actors

Sites (T1/T2/T2D/T3) Actor

Infrastructure

  • perators

Users

VRF NOC NSP NOC LHC Experiments LHC Experiments LHC Experiments

(ATLAS, CMS, LHCb, ALICE)

Part 2: Actors

slide-11
SLIDE 11

11

VRF

LHCONE infrastructure actors

Actors NSP NOC

Free Zone

VRF VRF NOC NOC NOC NSP NOC NSP NOC

Peering BGP Peering BGP with load balancing

Site Site Site Site Site Site Part 2: Actors

slide-12
SLIDE 12

12

Network operation information organization

  • A unique information access point known by

every one:

  • A central portal (wiki CERN) should allow to find where to

find the information

– Provide an exhaustive list of pointers to other repositories: for instance RIR database, monitoring tools, VRF NOC site … – Information could/should be distributed

  • Each information should be put under the responsibility of
  • ne identified actor

– For instance: One site is responsible to maintain the list announced prefixes / a NSP is responsible to maintain the list sites connected to him.

  • Critical information should be mirrored ?

– For instance, a mirror of the central portal could be implemented

  • n other continent (America, Asia, Europe) ?

Part 2: Information management

slide-13
SLIDE 13

13

LHCONE TT Trouble Ticket (GGUS) VRF or NSP

information and repositories management

NOC L3 monitoring Information repository Actor PerfSONAR PS

  • r MDM

BGP looking glass LHC experiment Operation service (TBD) Statistics reports Ranking site site Optional Information repository CERN B A A is responsible for maintaining B operational B A A is responsible for maintaining information up- to-date within B B A A is responsible for maintaining information up- to-date within B Statistics reports Global web repository (Twiki) Operational procedure and information (routing policy, AS, filter implemented …) Operational contacts (site/NSP) Network sites information Pointer toward all

  • ther repositories

(RIR database, monitoring) Part 2: Information management

slide-14
SLIDE 14

14

List of information maintained up-to-date by NSP/VRF

Operators Served region POPs Contact information VRF/NSP connected Site connected phone CERNlight Europe/any Geneve (CH) extip@cernSPAMNOT.ch GEANT, … CERN ESnet US MANLAN, WIX, … trouble@esSPAMNOT.net I2, BNL,FNAL, SLAC, … Geant Europe Roberto.Sabatino@ I2, Esnet, CernLight, RedIRIS .. ? RedIRIS Spain Madrid? ? PIC

Network Operators' Contact information Monitoring information*

Operators BWCTL One Way Delay BGP announce / received route Looking glass * Statistic CERNlight @server @server @server @server @server Geant @server @server @server @server @server

Required Optional

* Authentication required HTML link on twiki table HTML link on twiki table

slide-15
SLIDE 15

15

List of information maintained up-to-date by sites

Site Operators' Contact information

Site Name Country Tier Technical Contact VRF/NSP connected Phone AGLT2 (UM) US Tier-2D Shawn McKee smckee@umichSPAMNOT .edu ? DESY-HH DE Tier-2D Kars Ohrenberg Kars.Ohrenberg@ …de DFN

Monitoring information *

Operators BWCTL One Way Delay BGP announce / received route Looking glass DESY @server @server Routes announced Routes received @server Geant @server @server @server @server

Sites network related information published

Site Name NSP/VRF connected Prefixes MTU firewall comment AGLT2 (UM) published either on

LHCONE or RIR Database

Required Optional

* Authentication required HTML link on twiki table HTML link on twiki table HTML link on twiki table

slide-16
SLIDE 16

16

VRF or NSP

Information access

NOC Information repository Actor LHC experiment Operation service (TBD) Statistics reports Ranking site site Optional Information repository monitoring PerfSONAR PS

  • r MDM

BGP looking glass Statistics reports B A B A A can access information in B with no authentication A can access information in B with authentication Ticket exchange between A and B A B LHCONE TT Trouble Ticket (GGUS) Global web repository (Twiki) Operational procedure and information (routing policy, AS, filter implemented …) Operational contacts (site/NSP) Network sites information Pointer toward all

  • ther repositories

((RIR database, monitoring) Part 2: Information management

slide-17
SLIDE 17

17

Information pending question

  • An authorization framework is needed for the access to

some information:

  • Looking glass
  • Monitoring tools
  • Data
  • Should prefix being published in RIR database or in the CERN

twiki?

  • With RIR database, one can easily built automatic tools to

check if prefixes sent by a site is appropriately declared

  • Twiki: easy in first place but less robust in long term

Part 2: information management

slide-18
SLIDE 18

18

Information broadcast channels

  • How to inform of a maintenance / a trouble?
  • Operational email list
  • Which other tools should we set-up for more interactive

activity?

  • Instant messenger
  • Skype, …
  • Possibility to set-up audio and video conference quickly
  • EVO, skype, ….
  • An always open conference on a MCU bridge or in

EVO?

Part 2: information management

slide-19
SLIDE 19

19

Basic troubleshooting process

Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC

Use cases: Generic trouble

  • Connectivity problem
  • Site A cannot join the LHCONE anymore
  • Site A cannot join site B

VRF NOC Site B

Free Zone

VRF or NSP where the site B is connected Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y VRF or NSP where the site B is connected

  • In general, NSP/VRF procedures should be reused to solve basic trouble

Part 2: Network operation Entity Y Entity X Entity Y

slide-20
SLIDE 20

20

Asymmetric troubleshooting process

Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC

Experiment asymmetric is often due to route propagation issue or route filtering 1. Site A contact site B to check if there is a filtering problem

  • r check thanks to site B looking glass

2. Check route distribution along the path on NSP looking

  • glass. If looking glass is not available site A will send an

alarm to NSP that has to be checked.

VRF NOC Site B

Free Zone

VRF or NSP where the site B is connected Part 2: Network operation Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y Entity Y Entity X Entity Y

1 2 2 2

slide-21
SLIDE 21

21

Performance troubleshooting process

Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC

1. Site A make end to end test between site A and B thanks to remote monitoring tools : Perfsonar PS / MDM 2. Identify the faulty span A. Launch test between site A and every NSP crossed toward site B B. If the entity responsible of the faulty span is identified then contact the entity or contact suspected entities for investigation

VRF NOC Site B

Free Zone

VRF or NSP where the site B is connected Part 2: Network operation Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y Entity Y Entity X Entity Y

2 2 2 1

slide-22
SLIDE 22

22

Maintenance management

  • Use case
  • A link will be have a maintenance
  • Routers will have a maintenance

Question

  • List of entities to inform about this event?  broadcast channel
  • How much time before an service interruption a NSP has to warn?

Part 2: Network operation

To be discussed with experiment and sites

slide-23
SLIDE 23

23

Site operation

  • connection, withdraw, maintenance
  • A new site connects the LHCONE
  • New prefixes announced
  • New link
  • Site maintenance ...

Part 2: Site operation

slide-24
SLIDE 24

24

Routing policy

We can insert the of Routing policy group within operational handbook Or simply put a pointer to the relevant document General Rule

  • BGP protocol must be used to connect LHCONE VRF
  • Public ASN must be use ; Do not allow private AS ; VRF operator can

announce the site address as belonging to the VRF's AS if necessary.

  • Public IP addresses must be use …..

https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_routing_policy- v2.1.pdf

Part 2: Routing Policy

slide-25
SLIDE 25

25

Security policy

See Michael O’Connors presentation Simply put a pointer to the relevant document provided by the sub-group https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_security_policy- v2.1.pdf

Part 2: Security and monitoring Policy

Monitoring policy

Simply put a pointer to the relevant document provided by the sub-group https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_monitoring_policy- v2.1.pdf

slide-26
SLIDE 26

26

Next step

1. If there is an agreement on this approach

– This document needed to be reviewed especially by users: experiment end site

1. Implement now the information repository (almost done on CERN Twiki) 2. Implement information broadcast channels 3. Network maintenance management well defined 4. Review and improve troubleshooting use case 5. Users must define site operation 6. Validation of the new specification during the next LHCOPN/LHCONE meeting

  • Appeal to volunteers as a “author” or “reviewer” and contributor especially

for

– Routing policy – Site operation – Security policy

Part 3 : Next step