Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - PowerPoint PPT Presentation

Availability models Dr. János Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1

Failure sources – HW failures • Network element failures – Type failures • Manufacturing or design failures • Turns out at the testing phase – Wear out • Processor, memory, main board, interface cards • Components with moving parts: – Cooling fans, hard disk, power supply – Natural phenomena is mostly influence and damage these devices (e.g. high humidity, high temperature, earthquake) • Circuit breakers, transistors, etc. 2

Failure sources – SW failures • Design errors • High complexity and compound failures • Faulty implementations • Typos in variable names – Compiler detects most of these failures • Failed memory reading/writing operation 3

Failure sources – Operator errors (1) • Unplanned maintenance – Misconfiguration • Routing and addressing – misconfigured addresses or prefixes, interface identifiers, link metrics, and timers and queues (Diffserv) • Traffic Conditioners – Policers, classifiers, markers, shapers • Wrong security settings – Block legacy traffic – Other operation faults: • Accidental errors (unplug, reset) • Access denial (forgotten password) • Planned maintenance • Upgrade is longer than planned 4

Failure sources – Operator errors (2) • Topology/Dimensioning/Implementation design errors – Weak processor in routers – High BER in long cables – Topology is not meshed enough (not enough redundancy in protection path selection) • Compatibility errors – Between different vendors and versions – Between service providers or AS (Autonomous system) • Different routein settings and Admission Control between two ASs 5

Failure sources – Operator errors (3) • Operation and maintenance errors Updates and patches Misconfiguration Device upgrade Maintenance Data mirroring or recovery Monitoring and testing Teach users Other 6

Failure sources – User errors • Failures from malicious users – Physical devices • Robbery, damage the device – Against nodes • Viruses – DoS (denial-of-service) attack (i.e. used in the Interneten) • Routers are overload • At once from many addresses • IP address spoofing • Example: Ping of Death – the maximal size of ping packet is 65535 byte. In 1996 computers could be froze by recieving larger packets. • Unexpected user behavior – Short term • Extreme events (mass calling) • Mobility of users (e.g. after a football match the given cell is congested) – Long term • New popular sites and killer applications 7

Failure sources – Environmental causes • Cable cuts – Road construction (‘Universal Cable Locator’) – Rodent bites • Fading of radio waves – New skyscraper (e.g. CN Tower) – Clouds, fog, smog, etc. – Birds, planes • Electro-magnetic interference – Electro-magnetic noise – solar flares • Power outage • Humidity and temperature – Air-conditioner fault • Natural disasters – Fires, floods, terrorist attacks, lightnings, earthquakes, etc.

Michnet ISP Backbone 11/97 – 11/98 • Which failures are the most probable ones? Hardware Problem Maintenance Software Problem Power Outage Fiber Cut/Cicuit/Carrier Problem Interface Down Malicious Attack Congestion/Sluggish Routing Problems 9

Michnet ISP Backbone 11/97 – 11/98 Cause Type # [%] Maintenance Operator 272 16.2 User 5% Power Outage Environmental 273 16.0 Fiber Cut/Cicuit/Carrier Operator Environmental 261 15.3 Problem Environmental 35% 31% Unreachable Operator 215 12.6 Hardware Problem Hardware 154 9.0 Interface Down Hardware 105 6.2 Malice Routing Problems Operator 104 6.1 2% Hardw are Softw are Unknow n Miscellaneous Unknown 86 5.9 15% 1% 11% Unknown/Undetermine Unknown 32 5.6 d/No problem Congestion/Sluggish User 65 4.6 Malicious Attack Malice 26 1.5 Software Problem Software 23 1.3 10

Case study - 2002 • D. Patterson et. al.: “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies”, UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002, 11

Failure sources - Summary • Operator errors (misconfiguration) – Simple solutions needed – Sometimes reach 90% of all failures • Planned maintenance – Running at night – Sometimes reach 20% of all failures • DoS attack – It will be worse in the future • Software failures – 10 million line source codes • Link failures – Anything from which a point-to-point connection fails (not only cable cuts) 12

Motivation behind survivable network design 13

Reliability • Failure – is the termination of the ability of a network element to perform a required function. Hence, a network failure happens at one particular moment t f • Reliability, R(t) – continuous operation of a system or service – refers to the probability of the system being adequately operational (i.e. failure free operation) for the period of time [0 – t ] intended in the presence of network failures 14

Reliability (2) • Reliability, R(t) – Defined as 1- F(t) (cummulative distribution function, cdf) – Simple model: exponentially distributed variables • Properies: – non-increasing  – R ( 0 ) 1 –  lim R ( t ) 0   t R ( t ) 1 R ( a )           t t R ( t ) 1 F ( t ) 1 ( 1 e ) e 0 a t 15

Network with reparable subsystems • Measures to charecterize a reparable system are: – Availability, A(t) • refers to the probability of a reparable system to be found in the operational state at some time t in the future • A(t) = P(time = t, system = UP) – Unavailability, U(t) • refers to the probability of a reparable system to be found in the faulty state at some time t in the future • U(t) = P(time = t, system = DOWN) • A(t) + U(t) = 1 at time t Failure Failure UP Device is Device is Device is operational operational operational DOWN t The network element is failed, repair action is in progress. 16

Element Availability Assignment • The mainly used measures are – MTTR - Mean Time To Repair – MTTF - Mean Time to Failure • MTTR << MTTF – MTBF - Mean Time Between Failures • MTBF=MTTF+MTTR • if the repair is fast, MTBF is approximately the same as MTTF • Sometimes given in FITs (Failures in Time), MTBF[h]=10 9 /FIT • Another notation – MUT - Mean Up Time • Like MTTF – MDT - Mean Down Time • Like MTTR – MCT - Mean Cycle Time • MCT=MUT+MDT 17

Availability in hours Outage time/ Outage time/ Outage time/ Availability Nines year month week 90% 1 nine 36.52 day 73.04 hour 16.80 hour 95% - 18.26 day 36.52 hour 8.40 hour 98% - 7.30 day 14.60 hour 3.36 hour 2 nines 99% 3.65 day 7.30 hour 1.68 hour (maintained) 99.5% - 1.83 day 3.65 hour 50.40 min 99.8% - 17.53 hour 87.66 min 20.16 min 3 nines (well 99.9% 8.77 hour 43.83 min 10.08 min maintained) 99.95% - 4.38 hour 21.91 min 5.04 min 99.99% 4 nines 52.59 min 4.38 min 1.01 min 5 nines (failure 99.999% 5.26 min 25.9 sec 6.05 sec protected) 6 nines (high 99.9999% 31.56 sec 2.62 sec 0.61 sec reliability) 99.99999% 7 nines 3.16 sec 0.26 sec 0.61 sec 18

Availability evaluation – assumptions (1) • Deployment – availability increases (unavailability decreases) – Performance is optimized • Steady state – the availability remains the same for a long period (time independent) • Wear out (component aging) – availability decrease (unavailability increase) – e.g. impairments in the fiber U(t) Bathtub curve 1 Steady state t 0 19

Availability evaluation – assumptions (2) • Failure arrival times – independent and identically distributed (iid) variables following exponential distribution    t – sometimes Weibull distribution is used (hard)  1  F ( t ) e –  > 0 failure rate (time independent!) • Repair times – iid exponential variables – sometimes Weibull distribution is used (hard) – m > 0 repair rate (time independent!) • If both failure arrival times and repair times are exponentially distributed we have a simple model – Continuous Time Markov Chain 20

Two-state Markov model – Steady state analysis (1)  Mean of exp. dist. variables: 1 1  MTTF  UP DN 1 0 1 1m  MTTR m m • Transition probability distribution in a matrix form – Transition matrix P (stochastic matrix) • Time homogeneous Markov-chain – The transition matrix after k steps: P k – Stationary distribution is a row vector π, for which     P – π exists, (and in this case it is unambiguous) 21

Two-state Markov model – Steady state analysis (2)  1      1 Transition matrix:    P   UP DN m  m 1   1 0 Stationary distribution: m 1m      UP , DOWN ( A U )      1     ( A U ) ( A U )   m  m 1         m A A ( 1 ) U     m   A U / we have seen U 1 A m  A   m 22

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - PowerPoint PPT Presentation

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1 Failure sources HW failures Network element failures Type failures Manufacturing or design failures Turns out at the testing

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Availability Knob Flexible User-Defined Availability in the Cloud Mohammad Shahrad and David

for High Availability Martin Thompson - @mjpt777 What Is High Availability ?

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

AutoASAP AutoASAP Features AutoAsap Entities Availability & Availability &

High Availability with the openais project Prepared by: Steven Dake October 2005 Agenda

High Availability with the openais project Prepared by: Steven Dake 7/12/05 Agenda Service

Availability Suite Cornel Popescu Veeam Systems Engineer South East Europe Agenda Introduction

Chapter 4: Implementing High Availability and Redundancy in a Campus Network CCNP-RS SWITCH

Contents Introduction Basic Model High Availability, Scalable Storage, Availability

Data Protection & Availability Kaushal Devater Data Protection & Availability Discipline

The Need for Global Availability of High The Need for Global Availability of High Quality Fuels

Enhanced Network Topology For Improved System Availability Availability Mark Enright Tyco

Biomass Fuel Availability Biomass Fuel Availability Berlin, New Hampshire Prepared for: Clean

Microsoft Azure and SUSE High Availability TUT1134 When Availability Matters Mark Gonnelly

NetFORCE NetFORCE DATA INTEGRITY AND AVAILABILITY DATA INTEGRITY AND AVAILABILITY ACROSS ALL

The war on error James Chapman University of Tartu Practicalities 4 weeks of 4 sessions per

Is Software Malfunction an Oxymoron? Jesse Hughes July 25, 2007 Hughes Is Software Malfunction

Internet Discussion Theme: Death by TLAs Slides with * are not testable material

Goals for Today Learning Objective: Understand why secure systems fail Announcements,

Operating System Vulnerabilities Erez Zadok ezk@cs.columbia.edu April 8, 1998 (version 4) 1 of

Die socket .. Die ! eamless Let a thousand systems.. bloom ! a short story in 4 parts PART ONE

Unlocking the power of Administrative Data Yi-Ping Tseng Economic and Social Outlook Conference

CSD Project Overview March 13 , 2018 Dr. Ann Cox Program Manager Cyber Security Division

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu - PowerPoint PPT Presentation

Availability models Dr. Jnos Tapolcai tapolcai@tmit.bme.hu http://opti.tmit.bme.hu/~tapolcai/ / 1 Failure sources HW failures Network element failures Type failures Manufacturing or design failures Turns out at the testing

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Availability Knob Flexible User-Defined Availability in the Cloud Mohammad Shahrad and David

for High Availability Martin Thompson - @mjpt777 What Is High Availability ?

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

AutoASAP AutoASAP Features AutoAsap Entities Availability &amp; Availability &amp;

High Availability with the openais project Prepared by: Steven Dake October 2005 Agenda

High Availability with the openais project Prepared by: Steven Dake 7/12/05 Agenda Service

Availability Suite Cornel Popescu Veeam Systems Engineer South East Europe Agenda Introduction

Chapter 4: Implementing High Availability and Redundancy in a Campus Network CCNP-RS SWITCH

Contents Introduction Basic Model High Availability, Scalable Storage, Availability

Data Protection &amp; Availability Kaushal Devater Data Protection &amp; Availability Discipline

The Need for Global Availability of High The Need for Global Availability of High Quality Fuels

Enhanced Network Topology For Improved System Availability Availability Mark Enright Tyco

Biomass Fuel Availability Biomass Fuel Availability Berlin, New Hampshire Prepared for: Clean

Microsoft Azure and SUSE High Availability TUT1134 When Availability Matters Mark Gonnelly

NetFORCE NetFORCE DATA INTEGRITY AND AVAILABILITY DATA INTEGRITY AND AVAILABILITY ACROSS ALL

The war on error James Chapman University of Tartu Practicalities 4 weeks of 4 sessions per

Is Software Malfunction an Oxymoron? Jesse Hughes July 25, 2007 Hughes Is Software Malfunction

Internet Discussion Theme: Death by TLAs Slides with * are not testable material

Goals for Today Learning Objective: Understand why secure systems fail Announcements,

Operating System Vulnerabilities Erez Zadok ezk@cs.columbia.edu April 8, 1998 (version 4) 1 of

Die socket .. Die ! eamless Let a thousand systems.. bloom ! a short story in 4 parts PART ONE

Unlocking the power of Administrative Data Yi-Ping Tseng Economic and Social Outlook Conference

CSD Project Overview March 13 , 2018 Dr. Ann Cox Program Manager Cyber Security Division

AutoASAP AutoASAP Features AutoAsap Entities Availability & Availability &

Data Protection & Availability Kaushal Devater Data Protection & Availability Discipline