Case Study
We will determine a reliability figure on three very basic SAN architectures. The starting point of our study is the network storage requirements.
Network Storage Requirements
We want networked storage that has access to one server. Later, this storage will be accessible to other servers. The server is already in place, and has been designed to to sustain single component hardware failures (with dual host bus adapters (HBAs), for example). Data on this storage must be mirrored, and the storage access must also stand up to hardware failures. The cost of the storage system must be reasonable, while still providing good performance.
Our first temptation might be to decide which components to use; switches, hubs, Sun StorEdge_T3 arrays, Sun StorEdge_ A5x00 arrays, and so on. However, a more prudent approach would be to determine the appropriate architecture in terms of its resistance to hardware failures, cost, and performance, leaving the selection of specific components for a later stage.
NOTE
For this case study, the focus is on storage architecture redundancy and reliability, and does not address cost and performance issues.
Architecture 1
FIGURE 2 Architecture 1 Block Diagram
Architecture 1 provides the basic storage necessities we are looking for with the following advantages and disadvantages:
Advantages:
Storage is accessible if one of the links is down.
Storage A is mirrored onto B.
Other servers can be connected to the concentrator to access the storage.
Disadvantages:
If the concentrator fails, we have no more access to the storage. This concentrator is a single point of failure (SPOF).
Architecture 2
FIGURE 3 Architecture 2 Block Diagram
Architecture 2 has been improved to take into account the previous SPOF. A concentrator has been added, and now the storage configuration is redundant and the requirements are satisfied with the following advantages:
If any links or components go down, storage is still accessible (resilient to hardware failures).
Data is mirrored (Disk A <-> Disk B).
Other servers can be connected to both concentrators to access the storage space.
Architecture 3
FIGURE 4 Architecture 3 Block Diagram
Architecture 3 seems very close to architecture 2. The main difference resides in the fact that Disk A and Disk B have only one data path. Disk A is still mirrored to Disk B, as required.
This architecture has all the advantages of the previous architectures with the following differences:
Disk A can only be accessed through Link C, and Disk B only through Link D.
There is no data multipathing software layer, which results in easier administration and easier troubleshooting.
In some sense it seems we are loosing a level of redundancy in architecture 3. To appreciate the differences between architecture 2 and 3, we will use block diagram analysis to determine and compare their reliability values.
Determining Redundancy
We first list an inventory of components involved in the three architectures as shown in the first column of the following table. Next, we analyze the three architectures for redundancy.
Failing Component (first failure) |
Architecture 1: Is the System OK? |
Architecture 2 and 3: Is the System OK? |
HBA 1 |
Yes |
Yes |
HBA 2 |
Yes |
Yes |
Link A |
Yes |
Yes |
Link B |
Yes |
Yes |
Concentrator 1 |
No |
Yes |
Concentrator 21 |
n/a |
Yes |
Link C |
Yes |
Yes |
Link D |
Yes |
Yes |
Disk A |
Yes |
Yes |
Disk B |
Yes |
Yes |
Total number of redundant components |
8 |
10 |
Consequently, we see that Architecture 2 and 3 satisfy our objectives for redundancy, while Architecture 1 does not.
It is possible to obtain an objective difference between architecture 2 and 3 by studying their respective reliability. We will find that, although both architecture 2 and 3 are fully redundant, one is more reliable than the other.
Determining Reliability
Using the reliability formulas discussed earlier, we can determine which architecture has the highest reliability value. For the purpose of this article, we will use sample MTBF values (as obtained by the manufacturer) and AFR values shown in the table below:
TABLE 1 Component Inventory
Component |
AFR Variable |
Sample MTBF Values (hours) |
AFR2 |
HBA 1 |
H |
800,000 |
0.011 |
HBA 2 |
H |
|
|
Link A |
L |
400,000 |
0.022 |
Link B |
L |
|
|
Concentrator 1 |
C |
580,000 |
0.0151 |
Concentrator 23 |
C |
|
|
Link C |
L |
400,000 |
0.022 |
Link D |
L |
|
|
Disk A |
D |
1,000,000 |
0.0088 |
Disk B |
D |
|
|
NOTE
The example MTBF values were taken from real network storage component statistics. However, such values vary greatly, and these numbers are given here purely for illustration.
Architecture 1
FIGURE 5 Architecture 1 Reliability Block Diagram
Having the rate of failure of each individual component, we can obtain the system's annual failure rate AFR1 and consequently the system reliability and system MTBF values. Using the block diagram (FIGURE 5), it is easy to identify which components are configured redundantly, and which are not. The following formula is derived using the block diagram analysis discussed earlier. The AFR values of redundant components are multiplied to the power equal to the number of redundant components. The AFR values of non-redundant components are multiplied by the number of those components in series. In this case, the concentrator (C) is the only non-redundant component (C * 1= C). And finally, the AFR values are summed.
The formula for this architecture:
AFR1 = (H + L)2 + C + L2 +D2
Sample values applied:
AFR1 = (0.011 + 0.022)2 + 0.0151 + 0.0222 + 0.00882 = 0.0167
Using the AFR value, we determine the annual reliability R1 of the system:
R1 = 1 AFR1
R1 = 1 0.0167 = 0.9833, or 98.33%
Using the AFR value, the following system MTBF value is derived:
System MTBF = 8760/AFR1
System MTBF = 8760 / 0.0167 = 524,551 hours
Architecture 2
FIGURE 6 Architecture 2 Reliability Block Diagram
This architecture has a different configuration, and the resulting formula is derived using the block diagram analysis.
The formula for this architecture:
AFR2 = (H + L + C + L)2 +D2
Sample values applied:
AFR2 = (0.011 + 0.022 + 0.0151 + 0.022)2 + 0.00882 = 0.005
Using the AFR, determine the annual reliability R2 of the system:
R2 = 1 AFR2
R2 = 1 0.005 = 0.995, or 99.5%
Using the AFR value, the following system MTBF value is derived:
System MTBF = 8760 / AFR2
System MTBF = 8760 / 0.005 = 1,752,000 hours
Architecture 3
FIGURE 7 Architecture 3 Reliability Block Diagram
Architecture 3 results in yet another block diagram calculation.
The formula for this architecture:
AFR3 = (H + L + C + L +D)2
Sample values applied:
AFR3 = (0.011 + 0.022 + 0.0151 + 0.022 + 0.0088)2 = 0.0062
Using the AFR, determine the annual reliability R3 of the system.
The formula:
R3 = 1 AFR3
Numbers applied:
R3 = 1 0.0062 = 0.9938, or 99.38%
Using the AFR value, the following system MTBF value is derived:
System MTBF = 8760 / AFR3
System MTBF = 8760 / 0.0062= 1,412,903 hours
Conclusion
When the calculations are complete, we compare the data:
Architecture 1 = 98.33%, or a System's MTBF = 524,551 hours
Architecture 2 = 99.50%, or a System's MTBF = 1,752,000 hours
Architecture 3 = 99.38%, or a System's MTBF = 1,412,903 hours
The MTBF figures are the most revealing, and indicate that architecture 2 is statistically the most reliable of all.
In conclusion, the case study calculations provide the following points:
Only architecture 2 and 3 are fully redundant, hence they satisfy the requirement of a redundant configuration that can sustain a single hardware failure.
The reliability value for Architecture 1 doesn't show the non-redundant aspect of this architecture. It is therefore important to consider both characteristics; redundancy and reliability.
Architecture 2 is nearly three times more reliable than Architecture 1, and has an estimated higher MTBF of 339,097 hours when compared to architecture 3.
Finally, weighing the advantages of one solution over the another, we must also take other parameters into account, such as:
Storage capacity requirements
Performance
Cost
Maintainability (indexed by the MTTR: mean time to repair)
Availability (which depends on the MTBF and MTTR)
Serviceability
Ease of deployment
Support
The last point, support, is a critical consideration, because it is through support that a second failure will be avoided by quick troubleshooting and prompt part replacement. One factor not obvious in the calculations is that although we might think Architecture 2 brings more in terms of redundancy, due to the dual path from server to disks, it has the drawback of requiring additional software that can add another layer of complexity that might be less desirable (possibly lowering the ease of deployment and serviceability, while increasing costs).
Finally, it is worth noting that any storage area networking (SAN) implementation must be carefully planned and analyzed before deployment. Added to which, simple SAN design often will be preferable, because of easier support (troubleshooting and problem resolution). But one must not favor one parameter over the others without knowing the consequences, and therefore every aspect of the architecture decision must be considered. This is the only way to increase the reliability of storage architecture.