1
Survivable Network Design
- Dr. János Tapolcai
Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 - - PowerPoint PPT Presentation
Survivable Network Design Dr. Jnos Tapolcai tapolcai@tmit.bme.hu 1 The final goal 2 We prefer not to see: Telecommunicaiton Networks Video PSTN Internet Business Metro Backbone High Speed Backbone Service providers Mobile
1
2
3
High Speed Backbone
Service providers
PSTN Internet Video Backbone Mobile access Metro Business
http://www.icn.co
4
5
IP (Internet Protocol) ATM (Asynchronous Transfer Mode) SDH/SONET (Synchronous Digital Hierarchy) WDM (Wavelength Division Multiplexing) Adressing, routing Traffic engineering Transport and protection High bandwidth
6
Thin SONET
Optics
MPLS
SONET IP Optics ATM
Layer 3 2 1
Packet Optical Inter- working Smart Optical Packet IP/Ethernet
Layer 2/3 0/1
1999 201x 2003 BGP-4: 15 – 30 minutes OSPF: 10 seconds to minutes SONET: 50 milliseconds
IP
GMPLS
7
– Hop-by-hop routing – Packets are forwarded based on forwarding tables
– Shortest path routing
IS (Intermediate System To Intermediate System)
– From a technical point of view not very popular
8
– Centralized control – Exact knowledge of the physical topology
– Source and destination node pairs, bandwidth
9 A B C D E Wavelength crossconnect Lightpaths IP router
10
11
12
13
– Type failures
– Wear out
– Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake)
14
– Compiler detects most of these failures
15
– Misconfiguration
– misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv)
– Policers, classifiers, markers, shapers
– Block legacy traffic
– Other operation faults:
16
errors
– Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection)
– Between different vendors and versions – Between service providers or AS (Autonomous system)
ASs
17
Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other
18
– Physical devices
– Against nodes
– DoS (denial-of-service) attack (i.e. used in the Interneten)
1996 computers could be froze by recieving larger packets.
– Short term
– Long term
– Road construction (‘Universal Cable Locator’) – Rodent bites
– New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes
– Electro-magnetic noise – solar flares
– Air-conditioner fault
– Fires, floods, terrorist attacks, lightnings, earthquakes, etc.
20
21
Maintenance Power Outage Fiber Cut/Cicuit/Carrier Problem Hardware Problem Routing Problems Interface Down Congestion/Sluggish Malicious Attack Software Problem
22
Operator 35% Hardw are 15% Environmental 31% User 5% Unknow n 11% Malice 2% Softw are 1% Cause Type # [%]
Maintenance Operator 272 16.2 Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Problem Environmental 261 15.3 Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Routing Problems Operator 104 6.1 Miscellaneous Unknown 86 5.9 Unknown/Undetermined/ No problem Unknown 32 5.6 Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3
23
Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002,
24
– Simple solutions needed – Sometimes reach 90% of all failures
– Running at night – Sometimes reach 20% of all failures
– It will be worse in the future
– 10 million line source codes
– Anything from which a point-to-point connection fails (not only cable cuts)
25
– is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment tf
– continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t] intended in the presence of network failures
26
– Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables
– non-increasing – –
t t
e e t F t R
λ λ − −
= − − = − = ) 1 ( 1 ) ( 1 ) (
t a
1
R(t) R(a)
) ( lim 1 ) ( = =
∞ →
t R R
t
27
Device is
t
UP DOWN
Device is
Device is
The network element is failed, repair action is in progress.
Failure
– Availability, A(t)
– Unavailability, U(t)
faulty state at some time t in the future
Failure
28
– MTTR - Mean Time To Repair – MTTF - Mean Time to Failure
– MTBF - Mean Time Between Failures
– MUT - Mean Up Time
– MDT - Mean Down Time
– MCT - Mean Cycle Time
29
Availability Nines Outage time/ year Outage time/ month Outage time/ week
90% 1 nine 36.52 day 73.04 hour 16.80 hour 95%
36.52 hour 8.40 hour 98%
14.60 hour 3.36 hour 99% 2 nines (maintained) 3.65 day 7.30 hour 1.68 hour 99.5%
3.65 hour 50.40 min 99.8%
87.66 min 20.16 min 99.9% 3 nines (well maintained) 8.77 hour 43.83 min 10.08 min 99.95%
21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 99.999% 5 nines (failure protected) 5.26 min 25.9 sec 6.05 sec 99.9999% 6 nines (high reliability) 31.56 sec 2.62 sec 0.61 sec 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec
30
– independent and identically distributed (iid) variables following exponential distribution – sometimes Weibull distribution is used (hard) – λ > 0 failure rate (time independent!)
– iid exponential variables – sometimes Weibull distribution is used (hard) – µ > 0 repair rate (time independent!)
exponentially distributed we have a simple model
– Continuous Time Markov Chain
α
λt
e t F
−
− =1 ) (
31
UP 1 DN
λ µ 1-λ 1-µ
– Transition matrix P (stochastic matrix)
– The transition matrix after k steps: Pk – Stationary distribution is a row vector π, for which – π exists, (and in this case it is unambiguous)
Mean of exp. dist. variables:
32
UP 1 DN
λ µ 1-λ 1-µ
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = µ µ λ λ 1 1 P
µ λ µ µ λ µ λ µ µ λ λ + = − = ⋅ = ⋅ ⋅ + − ⋅ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − ⋅ = A A U U A U A A U A U A 1 seen have / we ) 1 ( 1 1 ) ( ) (
Transition matrix: Stationary distribution:
) ( , U A DOWN UP = = Π
33
1
A(t) Ass=
µ λ µ +
ss
) ( | ) (
) (
t R e e t A
t t
= = + + + =
− = + − λ µ µ λ
µ λ λ µ λ µ
t
34
Prediction of Electronic Equipment)
– Match curves on the observations to get λp
t
p
λ −
35
– On spot measured data – Data tested in laboratory
– Since then called Telcordia standard (1998) – France Telecom (CNET93) and British Telecom (HRD5) improved the method
36
IP Router (simplified model, configuration example ) HW common parts SW library
1 X 4 port OC3/STM1 POS line card 2 X 1 portGigabit Ethernet module 4 X 1 port OC48/STM16 POS line card
8 slot available
housing, conditioning
Not used
IP router: interface card MTBF[h] = 8.5·104 MTTR[h] = 4 IP router: SW MTBF[h] = 3·104 MTTR[h] = 0.0004 (SW restart) MTTR[h] = 0.02 (SW reload) MTTR[h] = 0.25 (no automatic restart) IP router: route processor MTBF[h] = 2·105 MTTR[h] = 4
37
Trunk Transponder Tributary Transponder
Control
SDH DXC/ADM: MTBF[h] = 1·106 MTTR[h] = 4 DXC has more ports than IP routers
SDH – Synchronous Digial Hierarchy SONET - Synchronous Optical NETworking DXC – digital cross connect ADM – add-drop multiplexer OEO – optical electrical optical conversion
38
Aerial cable MTBF[km]=1.75·105 MTTR=6
OXC Trans- ponder WDM line system Cable/ Fibre Amplifier
MTBF=400·103 MTTR=6 Submarine cables MTBF[km]=4.64·106 MTTR=540 MTBF=250·103 MTTR=6 MTBF=160·103 MTTR=6 WDM OXC (OEO) or OADM MTBF=1·105 MTTR=6 Buried cable MTBF[km]=2.6·105 MTTR=12
WDM – wavelength division multiplexing OXC – optical cross connect OADM – optical add-drop multiplexer
39
OXC Trans- ponder WDM line system Amplifier
MTBF=4·105 MTTR=6 MTBF=2.5·105 MTTR=6 MTBF=1.6·105 MTTR=6 WDM OXC MTBF=1·105 MTTR=6 Ground cable (200 km) MTBF[km]=2.63·105 MTTR=12 As-d = AOXC * Atr * AMUX * Acable * Aamp * AMUX * Atr * AOXC = 0.99994 * 0.999985 * 0.9999625 * 0.99087 * 0.999976 * 0.9999625 * 0.999985 * 0.99994 = 0.99994 * 0.99074 * 0.99994 = 0.99062
3.65 day/year
i m i
A A
1 =
∏ = Series rule:
40
) 1 ( 1
1 i m i
A A − ∏ − =
=
53 min/year
Parallel rule:
As-d = AOXC * [1-(1-Apath1) *(1-Apath2)] * AOXC = 0.99994 * [1- (1-0.99074)*(1-0.99074)] * 0.99994 = 0.99979
– Efficiency vs. complexity
41
Simple Complex
1 working + 1 protection path is allocated The two path are disjoint
1 1 1 1 1 1 2 1 1
The reserved capacity along the common link is : A + B PRO: instantaneous recovery (no action is needed)
43
capacity along their protection routes can be shared
– At most one of them is activated after a single failure
1 1 1 1 1 1 1 1 1
The spare capacity along the common link is : max{A,B} CONS: we need actions (signaling) after failure
architecture)
(isolation) (tl)
– Path selection (tp) – Device configuration (td)
Failure management
45
Recovery time Recovery operation (switching) time Fault notification time Fault detection time time notification The service is
Failure detected by the nearast node The protection path is deployed Data flow arrives at the destination node
failure
Hold-Off time Sending fault notification The service is
On the example shared protection: tl = 10 ms, tn = 20-30 ms, tc = 20-30 ms, tp = 0-30 ms, td = 50 ms, tR= 100-150ms
Traffic Recovery time
Shared protection (pre-planned)
3
Dedicated protection Dynamic restoration 150 ms 0 ms 0 % 100 % 150 ms 0 ms 0 % 100 % 150 ms 0 ms 0 % 100 %
Protection: the restoration process (e.g. protection paths) is planned at connection setup Dynamic restoration: the restoration process is computed on-the-fly after failure
47
1 2 3 4 7 5 8 9 6 1 2 3 4 7 5 8 9 6 fault
Link protection: local, loop back
1 2 3 4 7 5 8 9 6 fault
fault
1 2 3 4 7 5 8 9 6
Segment protection: A good compromise Working path Path protection: global, efficient
48
100%, fast No guarantee, slower pre-planned (protection) after failure event occures (restoration) link path segment link path segment dedicated shared dedicated shared dedicated shared Failure dependent Faiure independent (the faied element is unknown) Failure dependent Faiure independent (the faied element is unknown)
Different protection approaches from down to top (e.g. Dedicated Path protection or Failure Dependent Shared Link Protection)
49
and along the protection path
– The destination node switches to protection path
redundancy)
S D
swithcing
50
– The source and destination node switches to protection path
used for best effort traffic
– It is called „preemption”
S D
switching switching
51
and destination nodes
– Better capacity efficiency J – CON: slightly smaller availability L
– Aw, Ap
– Aw1, Aw2, Ap
A=1-(1-Aw)(1-Ap)=Aw+Ap-AwAp A=Aw1Aw2+(1-Aw1)Aw2Ap+Aw1(1-Aw2)Ap
S D
– For single failure – For n=2 it is the bitwise XOR of the two working path
52
53
Path 1 Path 2 B Α → Β Α → Β A B Α → Β Α → Β Failure A
Switch
54
Working ring A B Α → Β Α → Β A B Α → Β Α → Β Failure Protection ring
Switch Switch
55
Amsterdam London Brussels Paris Zurich Milan Berlin Vienna Prague Munich Rome Hamburg Lyon Frankfurt Strasbourg Zagreb
defined in advance in the spare capacity of the network
links
56
– On-cycle link
– Straddling links
57
– Protects unit working bandwidth if the working path is routed along the cycle – Protects two units of working bandwidth if the working path traverses on a straddling link
– No spare capacity reservation along straddling links – Could be a lot of straddling links – Efficient bandwidth usage – Only two switching needed at recovery
58
– They are built up in the optical control plane, but the switches are not configured
1 1 1 1 1 1 1 1 1
59
Free capacity Spare capacity Working capacity Non-shareable Free capacity Shareable Working capacity
with W sw
j
link j link j
60
10 10 5 5 10 10
spare working free
Single link SRLGs are considered!
61
– In which SRLGs are they involved
SRLGs
62
sl
j = non-sharable spare capacity
along link j, if the working path is in SRLGl
SRLG (Working edge involved)
Protection edges (all edges in the network) l. 3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4……………………. 3 1 ……….2 …..………………... 2 2 ……… 3……………….……. 1 2 .………5.……………………. 2 1 .………2……………………. 2 2 .………4……………………. column
=
S
63
Link l
=
S
20 10 10 10 10
link j
keep track of the network state after each failure
failed at a moment.
64
Link l
=
S
20 10 10
Link j 10 5 5
65
capacity on link j?
shareable spare capacity along link j if the working path is known?
– finding the maximum demand
the SRLGs traversed by W
,
max
j j l l SRLG
v s
∈
=
=
S
,
max
W j j l l W
s s
∈
=
SRLG edge
66
W j j
v s − =
W j j j
f v s b + − ≥
W j j
v s b − <
spare vj free fj working
W j j
v s b − ≥
W j j j
f v s b + − <
shareable Non-shareable
W j
s
W
W j
h
BLOCK ADMIT
67
network reliability”
protection techniques in WDM networks”
Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.
Network recovery: Protection and Restoration
Morgan Kaufmann Publishers, 2004. Computer Networking: A Top Down Approach Featuring the Internet, 3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.
68
Approach to Connection Unavailability Estimation in Shared Backup Path Protection’
reliability”
techniques in WDM networks”
Ethernet’
Fast Reroute”
Network”