Calculating Total System Availability KLM ICT Infrastructure Hoda - - PowerPoint PPT Presentation

calculating
SMART_READER_LITE
LIVE PREVIEW

Calculating Total System Availability KLM ICT Infrastructure Hoda - - PowerPoint PPT Presentation

Calculating Total System Availability KLM ICT Infrastructure Hoda Rohani, Azad Kamali Roosta Project Supervisors: Betty Gommans, Leon Gommans What we will see Availability Definition How to calculate availability for: A single


slide-1
SLIDE 1

Calculating Total System Availability

KLM ICT Infrastructure

Hoda Rohani, Azad Kamali Roosta Project Supervisors: Betty Gommans, Leon Gommans

slide-2
SLIDE 2

What we will see

 Availability Definition  How to calculate availability for:

 A single component  Parallel / Serial configurations

 How to calculate availability of a system

2

slide-3
SLIDE 3

Research Project Place in the Hierarchy

 Artificial IT Intervention Handler (AITIH)

 To establish a framework for calculation of the

availability (as a non-functional requirement) for a KLM Business Application

Availability is a requirement

3

slide-4
SLIDE 4

Definitions

 Availability

 Reliability Engineering

A function of time, defined as the probability that system is operating correctly and is available to perform its function at the instant of time t

 Unavailability

 1 - Availability

4

slide-5
SLIDE 5

Definitions

MTBF

The (mean) time expected between two consecutive system failures

High MTBF means...

MTTR

The (mean) Time required to repair a failed system

This time includes …

Represented in units of hours

Basic measures of calculating the availability

5

slide-6
SLIDE 6

Failure Rate

Hardware failures

Design Faults,

Mechanical malfunction

Electronic Interference  Bathtub Curve

Software failures:

 Complexity of software, Size of code. Team experience  Depth of testing before releasing the product, Percentage of code

reused from a previous stable project

Basic assumption: Constant Failure Rates

6

http://www.mana-ups.com

slide-7
SLIDE 7

How to Calculate Availability

 𝐵 =

𝑉𝑞𝑢𝑗𝑛𝑓 𝐸𝑝𝑥𝑜𝑢𝑗𝑛𝑓 + 𝑉𝑞𝑢𝑗𝑛𝑓

 𝐵 =

𝑁𝑈𝐶𝐺 𝑁𝑈𝐶𝐺+𝑁𝑈𝑈𝑆

The impact of MTBF and MTTR

7

slide-8
SLIDE 8

Many Factors in Availability Calculation

Designing and implementing a high available network:

Hardware

Hardware failures like I/O errors, hard disk failures, memory parity errors, network hardware failures

Software

Software errors like bugs in source codes, system overload, resource exhausting

Environmental Faults

Human Errors

Mostly occur as a result of changes

8

slide-9
SLIDE 9

HW/SW factors in Availability Calculation of a Component

Calculating Hardware Availability:

MTBF

Can be obtained by the vendor for the off-the-shelf components or the hardware team for the in-house component

MTTR

Service contract response time

Calculating Software Availability:

MTBF

Multiplying the defect rate by the size of program executed per second

MTTR

Mean time taken to reboot or debugging

9

slide-10
SLIDE 10

Human Errors and Environmental Factor

 Environment

 29 minutes down time for power loss per year, get the

availability of 0.999945

 Can be increased by backup power devices

 Human Errors

 experienced  Task complexity: either it is simple or hard, routine or

non-routine

 Stress factor: how much time is available  If there is any procedural guidance for doing the job

10

slide-11
SLIDE 11

Availability in a Serial System

11

Availability = A1 × A2 = 0.990025 What happens if A1 is high but A2 is low?

slide-12
SLIDE 12

Availability in a Parallel System

12

Unavailability = (1-A1) × (1-A2) = 0.000025 Availability = 1 – Unavailability = 0.999975

slide-13
SLIDE 13

So far…

 We know what the availability is  We can calculate the availability of a single

(independent) component

 We can calculate the availability of dependent

components with simple relations

13

slide-14
SLIDE 14

Application Dependency

15

User Web Service 1

Database

Network Switch 2 Network Switch 1 Network Switch 3

Web Service s

Application

2

slide-15
SLIDE 15

Real life example!

16

A10 A11 A12 A13 A14 A1 A2 A3 A4 A5 A6 A7 A8 A9 H19 H20 H21 H22 H23 H24 H25 H26 H1 H2 H8 H5 H4 H3 H9 H11 H12 H6 H13 H7 H14 H15 H16 H17 H10 H18 S1 S6 S3 S2 S4 S5 Application Host Switch

slide-16
SLIDE 16

Different Layers

17

Stack Module 1 Module 2 Network Device 1 Module 1 Module 2 Network Device 1 Network Device

Applicaiton Applicaiton Web Server Applicaiton Applicaiton Web Server Applicaiton JVM JVM Operating System Applicaiton Applicaiton Web Server Applicaiton Applicaiton Web Server Applicaiton JVM JVM Operating System Virtualization Hardware Network Interface Cards Network Interface Cards Network Interface Cards Network Interface Cards Cables Network Devices Cables Cables Cables

What about a cloud?!

slide-17
SLIDE 17

What may go wrong?

An application may have bugs

An application server may run out of resources

An operating system may fail

A hard disk may fail

A server hardware may fail

A network cable may get disconnected

A switch may malfunction

An administrator may make a mistake while configuring something

You may have power outage

Your cooling system may fail

And …

Are these happening one at a time?!

18

slide-18
SLIDE 18

The approach

19

APP DB WS1 WS2 NS1 NS2 NS3 User U X X X X X X U A U X X X X X U A A X X X U X U A A X X U A X U A A U U A A X U Otherwise A

User Web Service 1

Database

Network Switch 2 Network Switch 1 Network Switch 3

Web Service s

Application

A Available U Unavailable X Don’t Care

𝐵 𝑇𝑧𝑡𝑢𝑓𝑛 = 1 −

𝑉𝑜𝑏𝑤𝑏𝑗𝑚𝑏𝑐𝑚𝑓 𝑇𝑢𝑏𝑢𝑓𝑡

𝑄𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑝𝑔 𝑐𝑓𝑗𝑜𝑕 𝑗𝑜 𝑢ℎ𝑓 𝑡𝑢𝑏𝑢𝑓

slide-19
SLIDE 19

In order to find failures

Choose what layers you want to include in your calculations

 You may want to skip a level or integrate it into others

Partition those layers into two categories:

Network Category: All those providing network connectivity

End point Category: All those are not engaged in network connectivity

Divide End Points into two subcategories:

Application itself

Containers (no dependency rule)

And Network subcategories are:

Container

Interface

20

slide-20
SLIDE 20

The rules are:

 A container will fail, if either of its components

fails

 An application will fail if:

 Itself fails;  Its container fails;  What it depends on had failed;  There is no connectivity between the application and

what it depends on.

 An interface will fail if it fails!

21

slide-21
SLIDE 21

Situation Modeling

22

Redundancy

Web Service 1 Web Service s Relationship Rules

Applicati

  • n

Web Service 1 Web Service s

Database

Web Service 1

Database

Web Service 2

Connections

User Host

Database

Network Switch 2 Network Switch 1 Network Switch 3

Host Web Service 2

NIC NIC

Web Service 1

NIC

Host

NIC NIC

Application

Host

NIC

slide-22
SLIDE 22

Connections

User Host

Database

Network Switch 2 Network Switch 1 Network Switch 3

Host Web Service 2

NIC NIC

Web Service 1

NIC

Host

NIC NIC

Application

Host

NIC

Calculation Steps

23

Redundancy

Web Service 1 Web Service s

Inject Fault(s) If not all rules are satisfied, it is a Fail State Add to the sum Calculate the probability

Relationship Rules

Applicati

  • n

Web Service 1 Web Service s

Database

Web Service 1

Database

Web Service 2

slide-23
SLIDE 23

Test Case

 Getting AITIH data for a part of a business application in

csv format

 appT  appCSA  appEUI  appEBC  appEDB  appCS  appkia

24

slide-24
SLIDE 24

Application - Hosts

25

Application Name Host

  • No. of Clones Running

appCSA hst01 1 appCSA hst02 1 appEUI hst03 5 appEUI hst04 5 appEUI hst05 5 appEBC hst06 3 appEBC hst07 3 appEBC hst08 3 appEBC hst03 3 appEBC hst04 3 appEBC hst05 3 appCS hst06 1 appCS hst07 1 appCS hst08 1 appCS hst03 1 appCS hst04 1 appCS hst05 1 appkia hst06 1 appkia hst07 1 appkia hst08 1 appkia hst03 1 appkia hst04 1 appkia hst05 1 appT hst09 1 appT hst10 1 appEDB hst11 1

slide-25
SLIDE 25

Application Dependencies

26

Application Name Database Service Hosted on appCS appT hst09 appkia appT hst10 appEBC appEDB hst11

slide-26
SLIDE 26

All components together

27

Total Availability Calculation Process

Application Host Network NIC End User
slide-27
SLIDE 27

The input data

apps.csv

hst01,appCSA,1 hst02,appCSA,1

netnods.csv

Switch_1,Switch_3 Switch_3,Switch_2,Switch_1

hostnicsw.csv

hst08,eth2,Switch_1 hst07,eth2,Switch_1 hst01,eth2,Switch_1

dep.csv

appCS,appT appkia,appT

availability.csv (A random number between 0.9999 and 0.999997)

hst08->eth2,,,0.999944 hst09,,,0.999972

29

slide-28
SLIDE 28

The Process

30 Total Availability Calculation Process

Input Intermediate Process Output Calculation Legend Phase

Process Input Data Dependency List

(dep.csv)

Host – NIC – Switch Relation

(hostnicsw.csv)

App – Replica - Host Relation

(hostapp.csv)

Fixed Input Main Code

(exe.py)

Code Maker

(makeit.py)

Code Template

(template.py)

Redundancy List

(clusters.csv)

Component Availability Parameters

(acalculator.py)

Component Availability Parameters

(availability.csv)

Network Nodes

(netnods.csv)

Failure Log

(failed.log.exe.py)

Execution Log

(log.exe.py)

Output Data Application Replica Finder

(replicator.py)

App – Host Relation

(apps.csv) Availability Runs... Configuration

slide-29
SLIDE 29

Results - Summary

33

Maximum Simultaneous Faults Number of Failure Scenarios Total Availability 1 5 99.9781476669 % 2 280 99.9780993579 % 3 8,192 99.9780993065 % 4 136,153 99.9780993064 % 5 1,769,375 99.9780993064 % 6 17,919,053 99.9780993064 %

slide-30
SLIDE 30

Results - Two Simultaneous Fault Scenarios

34

# Component A Component B Desc 1 'hst09' 'appT .REP2' appT .REP2 on hst10 2 'hst09' 'hst10' 3 'appT .REP2' 'appT .REP1' appT .REP1 on hst09 4 'appT .REP1' 'hst10' 5 'hst02' 'hst01->eth2' 6 'hst02' 'appCSA.REP1' .REP1 on hst01 7 'hst02' 'hst01' 8 'hst01->eth2' 'appCSA.REP2' .REP2 on hst02 9 'hst01->eth2' 'hst02->eth2' 10 'appCSA.REP2' 'appCSA.REP1' 11 'appCSA.REP2' 'hst01' 12 'appCSA.REP1' 'hst02->eth2' 13 'hst02->eth2' 'hst01' 14 'Switch_2' hst11->eth1 15 'hst11->eth1' 'hst11->eth2'

slide-31
SLIDE 31

Result – A little analysis

 Criticality Function:

CF(component) = Number of times it appeared as a cause of system failure * (1- Availability(component))

 Most Critical Components: ['Switch_1']

35

slide-32
SLIDE 32

Assumptions

Each single node’s independent Availability is either pre-calculated, or its MTBF and MTTR parameters are present. If none were present, a random number between 0.9999 and 0.999997 were assigned as the availability.

Whenever there is a physical network path between two network nodes, it illustrates a network connection between them. In other words, no network segmentation exists in upper layers.

Physical connectors (like cables) are considered as always available.

Network devices are seen as a single component even if they are modular.

There is no virtualization involved.

There is only one web application on each web server.

Hosts include: Operating System and Host hardware (except for the NIC).

All network cards of a server are able to take-over other cards.

In the network layer, Redundancy is made by using separate paths. There is no Stacked Switch.

Environmental and Human Related Factors are rolled out for simplicity

36

slide-33
SLIDE 33

To Sum-up, we saw

What availability means

How to calculate it for a standalone (independent) component

How to calculate it for simple dependent components

A method of calculating availability in a complex system

An example of such calculation

37