High availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period.
Availability refers to the ability of the user community to access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
Planned and unplanned downtime
A distinction needs to be made between planned downtime and unplanned downtime. Typically, planned downtime is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Planned downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, planned downtime is usually the result of some logical, management-initiated event. Unplanned downtime events typically arise from some physical event, such as a hardware failure or environmental anomaly. Examples of unplanned downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, catastrophic security breaches, or various application, middleware, and operating system failures.
Many computing sites exclude planned downtime from availability calculations, assuming, correctly or incorrectly, that planned downtime has little or no impact upon the computing user community. By excluding planned downtime, many systems can claim to have phenomenally high availability, which might give the illusion of continuous availability. Systems that exhibit truly continuous availability are comparatively rare and higher priced, and they have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware, and application upgrades, patches, and replacements.
Availability is usually expressed as a percentage of uptime in a given year. In a given year, the number of minutes of unplanned downtime is tallied for a system; the aggregate unplanned downtime is divided by the total number of minutes in a year (approximately 525,600), producing a percentage of downtime; the complement is the percentage of uptime, which is what is typically referred to as the availability of the system. Common values of availability, typically stated as a number of “nines”, for highly available systems are:
- 99.9% ≡ 43.8 minutes/month or 8.76 hours/year (“three nines”)
- 99.99% ≡ 4.38 minutes/month or 52.6 minutes/year (“four nines”)
- 99.999% ≡ 0.44 minutes/month or 5.26 minutes/year (“five nines”)
It should be noted that uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.
Measurement and intrepetation
Clearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas the system administrator will claim 100% “uptime.” However, given the true definition of availability, the system will be approximately 99.897% available (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, while administrators might have a different (and probably incorrect, certainly in the business sense) perception. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true availability measure is holistic.
Availability must be measured to be determined, ideally with comprehensive monitoring tools (“instrumentation”) that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand
Closely related concepts
Recovery time is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.
Another related concept is data availability, that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management specialists often focus separately on data availability in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.
A service level agreement (“SLA”) formalizes an organization’s availability objectives and requirements.
System design for high availability
Paradoxically, adding more components to an overall system design can actually undermine efforts to achieve high availability. That’s because complex systems inherently have more potential failure points and are more difficult to implement correctly. The most highly available systems hew to a simple design pattern: a single, high quality, multi-purpose physical system with comprehensive internal redundancy running all interdependent functions paired with a second, like system at a separate physical location. This classic design pattern is common among financial institutions, for example. The communications and computing industry has established the Service Availability Forum to foster the creation of high availability network infrastructure products, systems and services. The same basic design principle applies beyond computing in such diverse fields as nuclear power, aeronautics, and medical care.
High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well.
HA clusters are often used for key databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.
HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is multiply connected via Storage area networks.
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle, but serious condition every clustering software must be able to handle is split-brain. Split-brain occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.