The data center reliability decision tree outlined below was contributed to BICSI to be included in the recently released ANSI/BICSI 002-2011 Data Center Standard. The steps outlined below implement the following methodology:
- Determine the Data Center Operational Requirements:
This step identifies if opportunities exist for scheduled (planned) downtime to perform regular maintenance on critical facility systems. The impact of little to no opportunity for planned downtime increases the requirement for redundancy.
- Determine the Data Center Availability Requirements:
This step identifies the expected availability of the IT systems. This step refers only to scheduled uptime; scheduled downtime to perform planned maintenance should not be included when expressing the availability requirement. The higher the expectation of the IT systems being able to perform their intended functions increases the requirement for reliability and redundancy.
- Determine the Impact of Downtime:
This step identifies how significant an event that results in the IT systems inability to perform their intended functions has on the business. The greater the impact on the business results in a higher requirement for redundancy and reliability.
- Determine the required reliability "Class":
The criteria established in the preceding steps determine the required reliability Class to align the facility systems with the expectations of the data center services being able to perform the intended functions.
It should be noted that the current release of the ANSI/BICSI 002-2011 Standard does not incorporate Step 2, "Data Center Availability Requirements", into the determination of the appropriate reliability Class. This step was removed from the standard just prior to publication. The criteria for the "Availability Requirements" included below have been developed by Isaak Technologies and should not be considered part of the current standard. Isaak Technologies is in the process of submitting the Availability Requirements criteria for inclusion in the ANSI/BICSI 002-2011 standard, following the normal standards development process with the content being vetted by the standards committee. Isaak Technologies, as a member of the standards committee, continues to work with the standards body to improve and update the standards content as required.
A Reliability Selector Tool is available as a free download to help you quickly determine the appropriate Classification for your data center. Please contact us if you would like to discuss in greater detail your particular data center requirements and how they impact the required level of reliability and redundancy.
Step 1: Determine Operational Requirements
The first step in defining the risk level associated with a mission-critical IT facility is to define the facility's intended operational requirements. Sufficient resources must be available to achieve an acceptable level of quality over a given time period. IT functions that have a high-quality expectation over a longer time period are by definition more critical than those requiring less resources, lower quality, and/or are needed over a shorter time period. The key element to consider here is time, which is quantified as "windows of opportunity" for testing and maintenance. Thus, to define the operational requirements for a critical IT facility, assign one of five facility operational levels, as defined in the table below.
Note that the term "shutdown" means that operation has ceased; the equipment is not able to perform its mission during that time. Shutdown does not refer to the loss of system components if they do not disrupt the ability of the system to continue its mission.
Operational Classifications
|
Description
|
Annual Allowable
Maintenance Hours
|
Operational
Level
|
Functions are operational less than 24 hours a day and less than 7 days a week. Scheduled maintenance "down" time is available during working hours and off hours
|
> 400
|
0
|
Functions are operational less than 24 hours a day and less than 7 days a week. Scheduled maintenance "down" time is available during working hours and off hours
|
100 - 400
|
1
|
Functions are operational up to 24 hours a day, up to 7 days a week, and up to 50 weeks per year. Scheduled maintenance "down" time is available during working hours and off hours
|
50 - 99
|
2
|
Functions are operational 24 hours a day, 7 days a week for 50 weeks or more. No scheduled maintenance "down" time is available during working hours
|
0 - 49
|
3
|
Functions are operational 24 hours a day, 7 days a week for 52 weeks each year. No scheduled maintenance "down" time is available
|
0
|
4
|
Step 2: Determine Availability Requirements
The second step in the risk management process is to identify the required facility availability, which is defined as the total uptime that the facility must support. Facility availability refers only to scheduled uptime; that is, the time during which the IT functions are actually expected to run.
We express facility availability in terms of an Availability Ranking. The rank selected for a given facility is chosen as the intersection between a level of intended maximum annual downtime and the operational level previously discussed. Since a function or process that has a high availability requirement with a low operational level has less risk associated with it than a similar function with a higher operational level, we use this step to adjust the overall availability to reflect the true functional requirement. This step will result in one of fire Availability Rankings to be used in Step 3. The Availability Rankings can be analyzed by expressing the acceptable range of unplanned downtime in minutes or expressing the acceptable availability in terms of 9's.
Note: Use either table below to determine the Availability Ranking.
Availability Ranking
|
|
Allowable Maximum Annual Downtime (Minutes)
|
|
Operational
Level
|
> 5000
|
500 to 5000
|
50 to 500
|
5 to 50
|
0.5 to 5.0
|
|
0
|
0
|
0
|
1
|
2
|
2
|
|
1
|
0
|
1
|
2
|
2
|
2
|
|
2
|
1
|
2
|
2
|
2
|
3
|
|
3
|
2
|
2
|
2
|
3
|
4
|
|
4
|
2
|
3
|
3
|
4
|
4
|
Availability Ranking
|
|
Allowable Availability (Expressed as 9's)
|
|
Operational
Level
|
< 99%
|
99% to 99.9%
|
99.9% to 99.99%
|
99.99% to 99.999%
|
99.999% to 99.9999%
|
|
0
|
0
|
0
|
1
|
2
|
2
|
|
1
|
0
|
1
|
2
|
2
|
2
|
|
2
|
1
|
2
|
2
|
2
|
3
|
|
3
|
2
|
2
|
2
|
3
|
4
|
|
4
|
2
|
3
|
3
|
4
|
4
|
Step 3: Determine Impact of Downtime
The third step in the risk management process is to identify the impact or consequences of downtime. This is an essential component of risk management because not all downtime has the same impact on mission critical facilities. Identifying the impact of downtime on mission-critical functions helps determine the tactics that we to deploy to mitigate downtime risk. As shown in the table below, there are five impact classifications, each associated with a specific impact scope.
|
Description
|
Impact of Downtime
|
Local in scope, affecting only a single function or operation, resulting in a minor disruption or delay in achieving non-critical organizational objectives.
|
Sub-local
|
Local in scope, affecting only a single site, or resulting in a minor disruption or delay in achieving key organizational objectives.
|
Local
|
Regional in scope, affecting a portion of the enterprise (although not in its entirety) or resulting in a moderate disruption or delay in achieving key organizational objectives.
|
Regional
|
Multi-regional in scope, affecting a major portion of the enterprise (although not in its entirety) or resulting in a major disruption or delay in achieving key organizational objectives.
|
Multiregional
|
Affecting the quality of service delivery across the entire enterprise, or resulting in a significant disruption or delay in achieving key organizational objectives.
|
Enterprise
|
Step 4: Determine Facility Reliability Class
The final step in the process defined in this section is to combine the three previously identified factors to arrive at a usable expression of availability. This expression of availability is used as a guide to determine the architectural and engineering features needed to appropriately support critical IT functions. Since operational level is subsumed within the availability ranking (as explained previously in this subsection), the task at hand is to matrix the availability ranking against the impact of downtime and arrive at an appropriate Availability Class. The table below shows how this is done:
|
|
Availability Rank
|
|
Impact of Downtime
|
0
|
1
|
2
|
3
|
4
|
|
Sub-local
|
Class F0
|
Class F0
|
Class F1
|
Class F2
|
Class F2
|
|
Local
|
Class F0
|
Class F1
|
Class F2
|
Class F3
|
Class F3
|
|
Regional
|
Class F1
|
Class F2
|
Class F2
|
Class F3
|
Class F3
|
|
Multiregional
|
Class F1
|
Class F2
|
Class F3
|
Class F3
|
Class F4
|
|
Enterprise
|
Class F1
|
Class F2
|
Class F3
|
Class F4
|
Class F4
|