Emerging trends for High Availability Asim Zuberi Senior - - PowerPoint PPT Presentation

emerging trends for high availability
SMART_READER_LITE
LIVE PREVIEW

Emerging trends for High Availability Asim Zuberi Senior - - PowerPoint PPT Presentation

Emerging trends for High Availability Asim Zuberi Senior Consultant, Collective Technologies Ayaz Mudarris Senior Consultant, Collective Technologies Module 1: Concepts What is Downtime? If a user cannot get his job done on time, the


slide-1
SLIDE 1

Emerging trends for High Availability

Asim Zuberi

Senior Consultant, Collective Technologies

Ayaz Mudarris

Senior Consultant, Collective Technologies

slide-2
SLIDE 2

Module 1: Concepts…

slide-3
SLIDE 3

What is Downtime?

– If a user cannot get his job done on time, the system is down – the downtime is incurred.

slide-4
SLIDE 4

Causes of Downtime!

slide-5
SLIDE 5

What is Availability?

MTBF

A = ——————— MTBF + MTTR

where: A – is the degree of availability expressed as a percentage MTBF – is the mean time between failures (Uptime) MTTR – is the maximum time to repair (Downtime)

slide-6
SLIDE 6

Availability Equation (A closer look)

Case-I : As MTTR approaches zero, A increases toward 100%.

MTBF

A = ——————— MTBF + MTTR

slide-7
SLIDE 7

Availability Equation (A closer look)

Case-I : As MTTR approaches zero, A increases toward 100%. Case-II: As MTBF gets larger, MTTR has less impact on A.

MTBF

A = ——————— MTBF + MTTR

slide-8
SLIDE 8

Increasing Availability

  • Key is obviously to minimize downtime
  • As downtime approaches zero, availability

approaches 100%

2 0 4 0 6 0 8 0 10 0 10 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 A v a ila blity

slide-9
SLIDE 9

The Rule of 9’s

87 Hours 36 Minutes 99.00 (2 – 9’s) 43 Hours 43 Minutes 99.50 26 Hours 17 Minutes 99.70 17 Hours 30 Minutes 99.80 8 hours 45 Minutes 99.90 (3 – 9’s) 1 hour 45 Minutes 99.98 52.8 Minutes 99.99 (4 – 9’s) 0 hours 100

Annual Downtime % Uptime

slide-10
SLIDE 10

Why do you need Availability?

– Issues which have caused problems or concerns with computer availability…

  • Terrorist attacks
  • Satellite Outages
  • Attacks by computer viruses
  • Emergence of internet as viable force
slide-11
SLIDE 11

Levels of Availability

  • Level 1: Regular Availability

(Do Nothing Special)

  • Level 2: Increased Availability

(Protect the Data)

  • Level 3: High Availability

(Protect the System)

  • Level 4: Disaster Recovery

(Protect the Organization)

slide-12
SLIDE 12

Twenty Key System Design Principles

20) Spend Money…but not blindly 19) Assume Nothing 18) Remove/Identify SPOFs 17) Maintain Tight Security 16) Consolidate Your Servers 15) Automate Common Tasks 14) Document Everything 13) Establish Service Level Agreements 12) Plan Ahead 11) Test Everything 10) Maintain Separate Environments 9) Invest in Failure Isolation 8) Examine the History of the System 7) Build for Growth 6) Choose Mature Software 5) Select Reliable and Serviceable Hardware 4) Reuse Configurations 3) Exploit External Resources 2) One Problem, One Solution 1) KISS: Keep It Simple Simple

slide-13
SLIDE 13

End-to-end Availability Measurem ent

E-E-A Application Network I nfrastructure System Software Operating System Hardware

slide-14
SLIDE 14

Modeling Availability

  • Complex as system comprises many

components

  • Most common techniques

– Monte Carlo principle – Markov techniques

  • Basically state diagrams
slide-15
SLIDE 15

State Diagram

slide-16
SLIDE 16

W hat Does I t Mean to Us?

How do you minimize downtime?

1 min 10 min 1 hr 12 hrs 24 hrs Clustering Replication Snapshot Mirroring Backups

slide-17
SLIDE 17

Trinity of TTs

slide-18
SLIDE 18

Module 2: Storage Area Networks…

slide-19
SLIDE 19

W hy Storage Area Netw orks?

  • Management of distant configurations.
  • Soft recabling.
  • Storage consolidation.
  • Heterogeneous connectivity.
  • Data sharing.
  • Massive configurations.
  • LAN-less and/ or server-less backup.
slide-20
SLIDE 20

W hy Fibre Channel?

  • Reliable Communication

– Removes the performance barriers of legacy LANs. – Support for other, typically "non-network" protocols, such as SCSI.

  • Low-latency message passing
  • High bandwidth transfer

– Connection and connectionless data delivery. – sustain data transfer rates at 90 Mbps – variable length (0-2 KB) frames. – Highly effective for protocol frames of less than 100 bytes, as well as bulk data transfer

  • Scalable networks.
slide-21
SLIDE 21

SAN Com ponents

  • Host Bus Adapter (HBA)
  • Channel
  • Switch/ Hub/ Bridge
slide-22
SLIDE 22

HBA

  • Fibre Channel Cards

– Every device on the SAN has a World Wide Number (WWN) including HBA’s – 64 bit assigned by IEEE – Similar to the way MAC addresses are assigned to Network Interface Cards (NICs).

  • Vendors

– JNI for Solaris – Emulex for NT

slide-23
SLIDE 23

Channel

  • Medium

– Copper 30m – Fibre optics

  • Multimode

500m

  • Single mode

10km

  • Buffer to buffer copy
  • Transmission isolated from control
  • FC-0 through FC-4
slide-24
SLIDE 24

Topologies

  • Point-to-point

– Two Nodes

slide-25
SLIDE 25

Topologies

  • Arbitrated loop

– 126 nodes – Practically even less

Hub

slide-26
SLIDE 26

Topologies

  • Fabric

– 16 million nodes

Hub Switch Bridge

JBOD

Switch

Array

Enterprise Switch

Array

slide-27
SLIDE 27

Sw itches/ Hubs/ Bridges

  • Workgroup switches

– 8 or 16 port – Redundant Power supplies – Hot Swappable GBIC’s

  • Enterprise Switches

– 64 port – Everything redundant, everything hot swappable

  • Hubs

– Connects FC-AL to FC-SW

  • FC/ SCSI Bridges

– Reuse old JBOD or SCSI tape drives

slide-28
SLIDE 28

NAS Vs SAN

  • NAS devices are storage appliances big, single purpose

servers that you plug into your network.

  • These appliances perform one task, and they perform it

well: They serve files very fast.

  • The difference between how a NAS appliance and a SAN

function is subtle.

  • NAS is a defined product that sits between your

application server and your file system.

  • SAN is a defined architecture that sits between your file

system and your underlying physical storage.

  • NAS is network-centric.
  • A SAN is data-centric.
slide-29
SLIDE 29

The Final Conflict

  • NAS appliances offer

– performance and reliability at a low cost. – excellent devices for collaboration and data storage, especially in heterogeneous computing environments. – Yet, NAS appliances can send only files, not data blocks, which limits their ability.

  • SAN promises to free your network of bottlenecks.

– traffic relief comes at a high price.

slide-30
SLIDE 30

Third level of High Availability

  • 85% of storage on Unix servers is unprotected!
  • RAID,Replication and Snapshots can protect you when

disaster strikes.

  • New emerging concepts

– Business Continuance Volumes (BCV) – Shared Storage Option (SSO)/ Smart Media – SAN over WAN – iSCSI

slide-31
SLIDE 31

Business Continuance Volum es

Backup/ Restore

High speed Tapeless Offsite

Test Environm ent

Softw are Lifecycles Y2 k/ Euro Currency

Decision Support

Reporting DataW arehouse

slide-32
SLIDE 32

BCV

  • sync-split-mount sequence

– Directly form disk to internal cache and then BVC – Works at volume group level

  • Block-by-block copy
  • Only changed tracks copied at next sync.
  • Instantaneous fallback.
slide-33
SLIDE 33

Sharing Tape Libraries

  • Tape Drives are shared

– Heterogeneous connectivity. – Reduces cost – Increases availability

SUN

Switch Enterprise Switch Bridge

NT Tru6 4

slide-34
SLIDE 34

Module 3: High Availability trends for SAN…

slide-35
SLIDE 35

SPOF; Sw itch/ Sw itch com ponents

FC Switch

MIRRORING CLUSTER

slide-36
SLIDE 36

SPOF; Sw itch/ Sw itch com ponents

MIRRORING CLUSTER

FC Switch FC Switch

slide-37
SLIDE 37

SPOF;

MIRRORING CLUSTER

Enterprise FC Switch

slide-38
SLIDE 38

SPOF; Tape Drives

50% Drives Drives

Enterprise FC Switch

slide-39
SLIDE 39

Module 4: Design Issues for Clustering…

slide-40
SLIDE 40

Design Issues

  • Objectives
  • Understand design issues of high-availability
  • Understand trade-offs of design issues
slide-41
SLIDE 41

Design Suggestions

  • Keep it simple

– Complexity hurts long term maintenance and manageability

  • Know all single points of failure (SPOFs)
  • Avoid failover if possible
slide-42
SLIDE 42

Know and Document ALL Single Points of Failure

  • Look for all SPOFs in both hardware and software
  • Look for SPOFs both on hosts and on cluster as a

whole

  • Could the failure of any single component prevent a

client from accessing a vital service?

slide-43
SLIDE 43

SCSI1

A Typical Layout

Ethernet heartbeat links

Hub NICS NICs Service Network OS Disks Hub NETWORK:

  • hbas
  • routers
  • switches
  • hubs
  • power source

SAN:

  • hbas
  • routers
  • switches
  • hubs
  • power source

HOSTS:

  • critical file systems,

e.g. / and /usr

  • power source

TAPE:

  • hbas
  • drives
  • robots
  • power source

DISKS:

  • hbas
  • drives
  • power source

Bridge NICs FC1 FC0 FC1 FC0

SCSI2 SCSI1 SCSI2

OS Disks

slide-44
SLIDE 44

SCSI1

SPOF:Hosts

Ethernet heartbeat link

NICS NIC Service Network Hub HOSTS:

  • critical file systems,

e.g. / and /var

  • power source

NIC

SCSI1

slide-45
SLIDE 45

SCSI1

SPOF:Disks

ethernet heartbeat link

NICS NICs Service Network OS Disks Hub DISKS:

  • controllers
  • drives
  • power source

NICs

SCSI2 SCSI1 SCSI2

OS Disks

slide-46
SLIDE 46

SPOF:Heartbeat

Ethernet heartbeat links

Hub NICS NICs Service Network Hub NETWORK:

  • nic
  • switches
  • hubs
  • power source

NICs

SCSI1 SCSI2 SCSI1 SCSI2

slide-47
SLIDE 47

SCSI1

SPOF:Storage/SAN

Ethernet heartbeat links

Hub NICS NICs Service Network Hub Storage/SAN

  • hbas
  • cabling
  • switches
  • hubs
  • power source

NICs FC1 FC0 FC1 FC0

SCSI2 SCSI1 SCSI2

slide-48
SLIDE 48

SCSI1

SPOF:Tape

ethernet heartbeat links

Hub NICS NICs Service Network Hub TAPE:

  • hbas
  • drives
  • robots
  • power source

Bridge NICs FC1 FC0 FC1 FC0

SCSI2 SCSI1 SCSI2

slide-49
SLIDE 49

SPOF:Network Switching

SystemA SystemB

Switch

Clients

slide-50
SLIDE 50

SPOF:Network Switching

SystemA SystemB

Clients

Switch Switch

IEEE 802.3ab Port Aggregation “Cisco Etherchannel” IP MultiNIC

slide-51
SLIDE 51

SPOF:Network Switching

SystemA SystemB

Main Switch Main Switch

Intermediate Switch Intermediate Switch Intermediate Switch

Spanning Tree Protocol

slide-52
SLIDE 52

SPOF:Network Routing

Main Switch

Router

Main Switch

Router

Net 1 Net 1 Net 2 Net 2

Hot Standby Router Protocol

slide-53
SLIDE 53

SPOF:Network Support Services

SystemA SystemB

NIS Master Primary DNS NIS Slave Secondary DNS Switch Switch

slide-54
SLIDE 54

Avoid failover if possible

! All failovers cause some type of client disruption ! Utilize all resources to make individual servers as reliable as possible ! Failover should always be last resort

slide-55
SLIDE 55

Avoid dependencies on any

  • utside resource!

! Name services

! Look up local data first ! Ensure redundant network paths exist with any outside

service required for the cluster

! Data service

! Network Attached Storage (NAS) - avoid it being a

single point of failure.

slide-56
SLIDE 56

Application Considerations

  • Not all applications are compatible with clustering

solutions

  • Recoverability
  • Failover times
  • License issues
  • Other issues
slide-57
SLIDE 57

Module 5: A Ride thru VCS

slide-58
SLIDE 58

Heart Beats (HB)

  • HBs are set at the “network layer”.
  • Low Latency Transport (LLT)
  • /etc/llttab
  • /etc/llthosts
  • Global Atomic Broadcast (GAB)
  • /etc/gabtab
slide-59
SLIDE 59

main.cf & types.cf

  • main.cf is the “ONLY” VCS configuration file.
  • types.cf contains all the agent information.
  • Location: /etc/VRTSvcs/conf/config/main.cf

/etc/VRTSvcs/conf/config/types.cf

slide-60
SLIDE 60

Service Group and Resources

! Service Groups

  • A service group is a set of resources that work

together to provide application services to clients.

  • Group operations are standard for each resource in a

group.

  • If a group faults on one system, the group and its

resources can be configured to start on another system.

slide-61
SLIDE 61

Service Group and Resources

! Resources

A resource is a software or hardware component required by an application under VCS control.

  • Disk
  • Disk Group
  • IP
  • NIC
  • Mount
  • NFS
  • Process
  • Share
  • Volume
slide-62
SLIDE 62
slide-63
SLIDE 63

VCS vs Sun Clustering

Same code on any platform Edit a file on each node to make changes One configuration file Only supports 8 nodes VCS cluster can grow to 32 nodes Complete tear down is required Upgrades are easy No support for SNMP Supports SNMP Terminal concentrators or management consoles are NOT required. Only supports Sun disks Supports non-Sun disks

Sun Clustering VCS

slide-64
SLIDE 64

Conclusion

  • High Availability cannot be achieved by merely

installing failover software and walking away.

  • Build your systems so reliably that they never have to

failover.

  • Figure out how much downtime costs on your

systems, and then use that figure to determine how much you can afford to spend to protect against that downtime.