LHCONE Operational Framework
Xavier Jeannin RENATER 2013/01/28
Part 1 : principles and ideas for the operational model Part 2 : LHCONE VRF operational handbook Part 3 : Next step
LHCONE Operational Framework Part 1 : principles and ideas for the - - PowerPoint PPT Presentation
LHCONE Operational Framework Part 1 : principles and ideas for the operational model Part 2 : LHCONE VRF operational handbook Part 3 : Next step Xavier Jeannin RENATER 2013/01/28 Part 1 : principles and ideas for the operational model LHCONE
Xavier Jeannin RENATER 2013/01/28
Part 1 : principles and ideas for the operational model Part 2 : LHCONE VRF operational handbook Part 3 : Next step
2
– Users manage
– changes (new peering/site withdraw/prefixes) – maintenance – Information: relevant, location, maintaining up-to-date, publication
– NSP manage
– Relevant information (NOC email, tel., …) – Troubleshooting process: basic trouble (connectivity issue, …), asymmetric traffic, performance
– Multi user entities – Multi NSPs collaboration is required
Part 1 : principles and ideas for the operational model
3
– Avoid a 100 pages, static document, never updated – Living document : the strict minimum … but be accurate enough to – Summarize all operation specification in one document
information /tools
/tools
– Specify routing policy: protocol BGP / community / load balancing … – Specify security policy: firewall / filtering / science DMZ …? – Specify monitoring policy: tools, information and test available? – Site operation (connection, withdraw, maintenance …) – Network operation and troubleshooting process :
– Information management: relevant, location, maintaining up-to-date, publication (who can access to what ?)
Part 1 : principles and ideas for the operational model
4
– Separate roles from implementation – Identify relationship of the actors
– Relevant use cases – Relevant information and their location – who is responsible to keep the information up to date – Tools that can help network operation
– An iterative approach
validation during LHCOPN/LHCONE meeting
– Be careful as it is hard to have « agreement » from all entities
5
design
– Routing (NSP, Liaison sponsor)
– Security (Users)
– Monitoring (Users/NSP)
– Site operation (connection, withdraw, maintenance …) (Users) – Network operation and troubleshooting process (NSP) – Information management (Users/NSP)
Contributor :
Inspired from G. cessieux work on LHCOPN operational model Part 2: operational handbook
7
– basic trouble (connectivity issue, …) – asymmetric traffic, performance – maintenance process
8
B A B A A can access information in B with no authentication A can access information in B with authentication Ticket exchange between A and B A B Actors
Peering BGP Peering BGP with load balancing
Information repository Optional Information repository A sends an alarm to B A B
Optional or non yet existing relational, repository information, …
1 A A is responsible for maintaining 1 operational 1 A A is responsible for maintaining information up- to-date within 1 1 A A is responsible for maintaining information up- to-date within 1
9
– Provides a connection to other VRF for NSP’s and sites/tiers – VRF is a specific NSP and VRF interconnection defines the “free zone” – NOC
– Provides a connection to sites/tiers – NOC
– Sites/tiers
– LHC experiments
Part 2: Actors
10
LHC Experiments Sites (T0/T1) Sites (T0/T1)
Sites (T1/T2/T2D/T3) Actor
Infrastructure
Users
VRF NOC NSP NOC LHC Experiments LHC Experiments LHC Experiments
(ATLAS, CMS, LHCb, ALICE)
Part 2: Actors
11
VRF
Actors NSP NOC
Free Zone
VRF VRF NOC NOC NOC NSP NOC NSP NOC
Peering BGP Peering BGP with load balancing
Site Site Site Site Site Site Part 2: Actors
12
find the information
– Provide an exhaustive list of pointers to other repositories: for instance RIR database, monitoring tools, VRF NOC site … – Information could/should be distributed
– For instance: One site is responsible to maintain the list announced prefixes / a NSP is responsible to maintain the list sites connected to him.
– For instance, a mirror of the central portal could be implemented
Part 2: Information management
13
LHCONE TT Trouble Ticket (GGUS) VRF or NSP
NOC L3 monitoring Information repository Actor PerfSONAR PS
BGP looking glass LHC experiment Operation service (TBD) Statistics reports Ranking site site Optional Information repository CERN B A A is responsible for maintaining B operational B A A is responsible for maintaining information up- to-date within B B A A is responsible for maintaining information up- to-date within B Statistics reports Global web repository (Twiki) Operational procedure and information (routing policy, AS, filter implemented …) Operational contacts (site/NSP) Network sites information Pointer toward all
(RIR database, monitoring) Part 2: Information management
14
Operators Served region POPs Contact information VRF/NSP connected Site connected phone CERNlight Europe/any Geneve (CH) extip@cernSPAMNOT.ch GEANT, … CERN ESnet US MANLAN, WIX, … trouble@esSPAMNOT.net I2, BNL,FNAL, SLAC, … Geant Europe Roberto.Sabatino@ I2, Esnet, CernLight, RedIRIS .. ? RedIRIS Spain Madrid? ? PIC
Network Operators' Contact information Monitoring information*
Operators BWCTL One Way Delay BGP announce / received route Looking glass * Statistic CERNlight @server @server @server @server @server Geant @server @server @server @server @server
Required Optional
* Authentication required HTML link on twiki table HTML link on twiki table
15
Site Operators' Contact information
Site Name Country Tier Technical Contact VRF/NSP connected Phone AGLT2 (UM) US Tier-2D Shawn McKee smckee@umichSPAMNOT .edu ? DESY-HH DE Tier-2D Kars Ohrenberg Kars.Ohrenberg@ …de DFN
Monitoring information *
Operators BWCTL One Way Delay BGP announce / received route Looking glass DESY @server @server Routes announced Routes received @server Geant @server @server @server @server
Sites network related information published
Site Name NSP/VRF connected Prefixes MTU firewall comment AGLT2 (UM) published either on
LHCONE or RIR Database
Required Optional
* Authentication required HTML link on twiki table HTML link on twiki table HTML link on twiki table
16
VRF or NSP
NOC Information repository Actor LHC experiment Operation service (TBD) Statistics reports Ranking site site Optional Information repository monitoring PerfSONAR PS
BGP looking glass Statistics reports B A B A A can access information in B with no authentication A can access information in B with authentication Ticket exchange between A and B A B LHCONE TT Trouble Ticket (GGUS) Global web repository (Twiki) Operational procedure and information (routing policy, AS, filter implemented …) Operational contacts (site/NSP) Network sites information Pointer toward all
((RIR database, monitoring) Part 2: Information management
17
Part 2: information management
18
Part 2: information management
19
Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC
Use cases: Generic trouble
VRF NOC Site B
Free Zone
VRF or NSP where the site B is connected Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y VRF or NSP where the site B is connected
Part 2: Network operation Entity Y Entity X Entity Y
20
Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC
Experiment asymmetric is often due to route propagation issue or route filtering 1. Site A contact site B to check if there is a filtering problem
2. Check route distribution along the path on NSP looking
alarm to NSP that has to be checked.
VRF NOC Site B
Free Zone
VRF or NSP where the site B is connected Part 2: Network operation Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y Entity Y Entity X Entity Y
1 2 2 2
21
Site A LHCONE TT Trouble Ticket (GGUS) VRF or NSP where the site A is connected NOC
1. Site A make end to end test between site A and B thanks to remote monitoring tools : Perfsonar PS / MDM 2. Identify the faulty span A. Launch test between site A and every NSP crossed toward site B B. If the entity responsible of the faulty span is identified then contact the entity or contact suspected entities for investigation
VRF NOC Site B
Free Zone
VRF or NSP where the site B is connected Part 2: Network operation Investigate internally and contact potential involved partner Ticket exchange between entity X and Y Entity X Entity X sends an alarm to Entity Y Entity Y Entity X Entity Y
2 2 2 1
22
Question
Part 2: Network operation
To be discussed with experiment and sites
23
Part 2: Site operation
24
We can insert the of Routing policy group within operational handbook Or simply put a pointer to the relevant document General Rule
announce the site address as belonging to the VRF's AS if necessary.
https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_routing_policy- v2.1.pdf
Part 2: Routing Policy
25
See Michael O’Connors presentation Simply put a pointer to the relevant document provided by the sub-group https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_security_policy- v2.1.pdf
Part 2: Security and monitoring Policy
Simply put a pointer to the relevant document provided by the sub-group https://twiki.cern.ch/twiki/pub/LHCONE/LhcOneVRF/LHCONE_monitoring_policy- v2.1.pdf
26
1. If there is an agreement on this approach
– This document needed to be reviewed especially by users: experiment end site
1. Implement now the information repository (almost done on CERN Twiki) 2. Implement information broadcast channels 3. Network maintenance management well defined 4. Review and improve troubleshooting use case 5. Users must define site operation 6. Validation of the new specification during the next LHCOPN/LHCONE meeting
for
– Routing policy – Site operation – Security policy
Part 3 : Next step