Home » Server, articles

High availability

6 February 2008 0 views 2 Comments

High is a and associated implementation that ensures a certain absolute degree of during a given measurement period.

refers to the ability of the user community to access the , whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the , it is said to be unavailable. Generally, the term is used to refer to periods when a is unavailable.

Planned and unplanned

A distinction needs to be made between planned and unplanned . Typically, planned is a result of maintenance that is disruptive to operation and usually cannot be avoided with a currently installed design. Planned events might include patches to that require a or configuration changes that only take effect upon a . In general, planned is usually the result of some logical, management-initiated event. Unplanned events typically arise from some physical event, such as a failure or environmental . Examples of unplanned events include power outages, failed or RAM components (or possibly other failed components), an over-temperature related shutdown, logically or physically severed connections, catastrophic breaches, or various application, middleware, and failures.

Many computing sites exclude planned from calculations, assuming, correctly or incorrectly, that planned has little or no impact upon the computing user community. By excluding planned , many systems can claim to have phenomenally high , which might give the illusion of continuous . Systems that exhibit truly continuous are comparatively rare and higher priced, and they have carefully implemented specialty designs that eliminate any of failure and allow , , , middleware, and application upgrades, patches, and replacements.

Percentage calculation

is usually expressed as a percentage of uptime in a given year. In a given year, the number of minutes of unplanned is tallied for a ; the aggregate unplanned is divided by the total number of minutes in a year (approximately 525,600), producing a percentage of ; the complement is the percentage of uptime, which is what is typically referred to as the of the . Common values of , typically stated as a number of “nines”, for highly available systems are:

  • 99.9% ≡ 43.8 minutes/month or 8.76 hours/year (”three nines”)
  • 99.99% ≡ 4.38 minutes/month or 52.6 minutes/year (”four nines”)
  • 99.999% ≡ 0.44 minutes/month or 5.26 minutes/year (”five nines”)

It should be noted that uptime and are not synonymous. A can be up, but not available, as in the case of a outage.

Measurement and intrepetation

Clearly, how is measured is subject to some degree of interpretation. A that has been up for 365 days in a non-leap year might have been eclipsed by a failure that lasted for 9 hours during a peak usage period; the user community will see the as unavailable, whereas the administrator will claim 100% “uptime.” However, given the true definition of , the will be approximately 99.897% available (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing problems are often deemed partially or entirely unavailable by users, while administrators might have a different (and probably incorrect, certainly in the sense) perception. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users — a true measure is holistic.

must be measured to be determined, ideally with comprehensive tools (”instrumentation”) that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by the users themselves, than systems which experience periodic lulls in demand

Closely related concepts

Recovery time is closely related to , that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinite with certain designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center.

Another related concept is data , that is the degree to which databases and other information systems faithfully record and report transactions. Information management specialists often focus separately on data in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss.

A service level agreement (”SLA”) formalizes an organization’s objectives and requirements.

design for high

Paradoxically, adding more components to an overall design can actually undermine efforts to achieve high . That’s because complex systems inherently have more potential failure points and are more difficult to implement correctly. The most highly available systems hew to a simple design pattern: a single, high , multi-purpose physical with comprehensive internal running all interdependent functions paired with a second, like at a separate physical location. This classic design pattern is common among financial institutions, for example. The communications and computing industry has established the Service Forum to foster the creation of high infrastructure products, systems and services. The same basic design principle applies beyond computing in such diverse fields as nuclear power, aeronautics, and medical care.

High- clusters

High- clusters (also known as HA Clusters or Clusters) are clusters that are implemented primarily for the purpose of improving the of services which the provides. They operate by having redundant or nodes which are then used to provide service when components . Normally, if a with a particular application crashes, the application will be unavailable until someone fixes the crashed . HA remedies this situation by detecting / faults, and immediately restarting the application on another without requiring administrative intervention, a process known as . As part of this process, may the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, may have to be configured, and some supporting applications may need to be running as well.

HA clusters are often used for key databases, file sharing on a , applications, and customer services such as electronic websites.

HA implementations attempt to build into a to eliminate single points of failure, including multiple connections and data which is multiply connected via area networks.

HA clusters usually use a private connection which is used to monitor the and status of each node in the . One subtle, but serious condition every must be able to handle is split-. Split- occurs when all of the private links go down simultaneously, but the nodes are still running. If that happens, each node in the may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the .

-

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Related posts

2 Comments »

  • Robert Michel said:

    I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.

    Robert Michel

  • High availability said:

    [...] post by Tech. info. news @inertz.org A.at_adv_here_7881, A.at_pow_by_7881 {font-family: Arial; font-size: 10px; font-style: normal; [...]

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.