Building High-available systems Chander Damodaran - Collabera - - PowerPoint PPT Presentation
Building High-available systems Chander Damodaran - Collabera - - PowerPoint PPT Presentation
Building High-available systems Chander Damodaran - Collabera Agenda Introduction Key Concepts Approach Availability Index Key HA Design Principles 3 Sample Business Scenarios Root causes IRCTC RPO Mid-size GitHub
Building High-available systems
Chander Damodaran - Collabera
Agenda
- Introduction
- Key Concepts
- Approach
- Availability Index
- Key HA Design Principles
3
Sample Business Scenarios
Root causes IRCTC RPO Mid-size Company GitHub Single point of failure System used beyond design limits Software error Human error
4
Wrong design assumptions
Key Concepts
- Availability: Availability is the measure of how often or how long a service or a
system component is available for use.
- Reliability: Reliability is the measure of fault avoidance.
- Serviceability: Serviceability is a measurement that expresses how easily a
system is serviced or repaired. Uptime __________________ Uptime + Downtime Availability =
5
system is serviced or repaired.
- Disaster Recovery: Disaster recovery is the ability to continue with services in
the case of major outages, often with reduced capabilities or performance.
Approach
List Vulnerabilities Evaluate scenarios, and determine their probability Map scenarios to requirements Design solution Review the solution, and check its behaviour against failure scenarios
6
VULNERABILITY LIKELIHOOD (1-5) IMPACT (1-5) LEVEL OF CONCERN SOLUTION Failed disk 5 1 5 Implement Mirrored disks Application Crash 5 4 20 Distributed application, failover, clustering
Availability Index
Disaster Recovery Replication Failovers Services and Applications Client Management Local Environment Networking Disk and Volume Management A V A I L A B I L I
7
Reliable Backups Good System Administration Practices INVESTMENT I T Y
*Blueprints for High Availability
Components, failures & protection mechanism
Component category Typical failure Fault protection User environment Data deletion or corruption Disaster-recovery processes Administration environment Data deletion or corruption Disaster-recovery processes Application Crashes, data corruption Distributed application, failover, clustering Middleware Crashes, memory leaks Clustering
8
Middleware Crashes, memory leaks Clustering (Network) infrastructure Connection loss Independent high- availability architecture Operating system Crash, device driver errors Clustering Hardware Device defect Redundant components, hot-spare disks maintenance contracts Physical environment Power outage, fire, floods UPS, backup data center
Key High Availability Design Principles
- Assume Nothing
- Remove Single Points of Failure (SPOFs)
- Plan Ahead & Design for Growth
- One Problem, One Solution
- Choose Mature, Reliable Hardware
- Choose Mature Software
- Learn from History
- Separate Your Environments
- Separate Your Environments
- Test Everything
- Employ Service Level Agreements
- Document Everything
- Enforce Change Control
- Watch Your Speed
- Consolidate Your Servers
- Enforce Security
- Don’t Be Cheap
9
QUESTIONS?
10