Request PDF

Enter your info below and we will immediately send you a PDF of this document.

First Name

Last Name

Company

Corporate E-mail


Would you like us to contact you? Yes  No


A Better Architecture For High Availability

Data replication at the file-system level between two independent systems is a high-availability configuration that has many significant advantages over shared-storage clustering.

The majority of existing high-availability installations use a shared-storage clustering architecture, where multiple servers each connect to external RAID subsystems. However, this architecture presents several hurdles to eliminating data loss and to achieving high availability. A high availability system based on disk data replication over TCP/IP between two completely independent systems does not have these shortcomings. This replication architecture, not widely used for high availability until now, offers higher availability at a lower cost than the commonly used shared-storage clustering architecture.

An architecture based on data replication instead of shared storage contains no single points of failure and enables sub-second failover as file system recovery time is eliminated. The cache and lock management issues that plague shared-storage clustering are not present in data replication architectures. The transparency of replication and the avoidance of many of the intricate problems faced by clustering means that replication offers an ease of implementation and maintenance that contrasts with the complexity of clustered configurations.

The Drawbacks of Shared-Storage Clustering

Shared-storage clustering configurations have multiple nodes attached to the same external RAID subsystems (see Figure 1). In the event that one server fails, a second server provides access to the shared storage. There are two variations of the shared-storage clustering architecture. In one variation, called "single-system access," each multihost disk is accessed through a single, primary node. Only in the event that this primary node fails does a second node take over control of the multihost disk. In the second variation, called "multi-system access," two or more nodes access the disk at the same time. In the event one of the nodes fails, the remaining nodes continue to access the disk. Both of these variations of shared-storage clustering have shortcomings that reduce availability and increase costs.


Figure 1: Shared-Storage Clustering Architecture.

The first drawback of shared-storage clustering is that, by necessity, the shared storage must be an external dual-ported, dual-controller, multi-disk channel RAID subsystem. These subsystems are significantly more expensive than internal storage or external single-controller, single-ported RAID subsystems. In addition to being expensive, these shared RAID subsystems have several single points of failure. A July 2001 InfoStor article (Drawbacks to Active-Active RAID by Tomlinson G. Rauscher) identifies the following as single points of failure within RAID subsystems:

  • Each disk channel
    - Disk channel controller chip fails (hardware or firmware) and locks the disk channel
    - Physical disk channel fails
  • Backplane into which the RAID controllers are inserted
  • Communication link between redundant RAID controllers

The other obvious drawback to using shared-storage subsystems is the single geographic point of vulnerability that they present. A fire, flood, or other disaster can eliminate the disk subsystem and the data on it. In addition, the limited distance of a SCSI or Fibre Channel connection limits the distance between the clustered servers themselves. Thus, the shared-storage architecture does not provide disaster recovery.

In both single-system access and multi-system access, lost cache data due to server failure is a significant problem. When a server fails with committed data in its cache that has not yet been written to disk, that data is unavailable, at best, or permanently lost, at worst. To prevent lost cache data, all file system writes must be synchronously passed to the disk subsystem before the write is committed to by the file system. This synchronization requirement significantly reduces system performance.

The multi-system access variation of shared-storage clustering presents several difficult issues that must be dealt with by the clustering software and the application. Lock management and cache consistency must be provided. Because multiple systems can directly access all the data in a multi-system access configuration, the clustering software must deal with the intricacies of lock management. Before accessing any data, each system must determine whether any other system is already accessing that data. When one system fails, the other nodes must be aware of which locks belonged to that system so that they can remove those locks and access the data.

In the case of cache consistency, some mechanism must ensure that data recently written by any of the cluster members to its cache is signaled to other cluster members. It is necessary either to immediately write all data to the shared-storage subsystem (disk synchronization) or to alert other cluster members what blocks on the shared storage are invalid.

These lock management and cache consistency issues are difficult to solve as they involve the tiniest intricacies of the entire data path from file system to disk subsystem. The cache and lock management issues significantly increase the total cost of high availability by requiring sophisticated configuration and system design.

Possibly the most significant drawback of shared-storage clustering is that upon a system failure, a file system recovery operation must be run on each file system of the failed system. This file system recovery makes rapid (i.e. sub-second) failover impossible even if the other two components of the failover time budget — failure detection and application recovery — are instantaneous. A lengthy process must identify and roll back any data partially written by the failed server to prevent corruption and lost data. Although a journaled file system simplifies and speeds up this data integrity checking, sub-second failover is not possible. In the case of single-system access, the standby server must also mount the file systems previously controlled by the downed server after conducting a data integrity check. Where nearly continuous service is required, shared-storage clustering is not appropriate because of the required file system recovery operation.

The Advantages of Data Replication for High Availability

Synchronous data replication between two independent systems (see Figures 2a and 2b) offers an alternative high-availability architecture that delivers significant advantages over shared-storage clustering. It enables sub-second failover in a simple configuration that is easy to implement and to maintain. Data replication creates a second copy of all data onto a completely independent standby system, with its own separate storage. Note that replication differs from mirroring in that the data duplication takes place over a TCP/IP network to an independent system instead of within a RAID array or within a storage-area network. Until now, data replication has not been widely used as a high-performance solution for high availability. Instead, it has been used to provide periodic snapshots for limited disaster recovery. However, recent advances make the data replication architecture ideal for providing higher availability at a lower cost than shared-storage clustering.

Figure 2a: Data Replication Figure 2b: Data Replication Architecture with Single Architecture with Internal ported RAID Subsystem Server Storage

There are two methods of data replication: (1) block-level replication and (2) file-system-level replication. Block-level replication functions at the disk or volume manager level and replicates opaque blocks of data. (See Figure 3a.) Block-level replication sits below the cache. File-system-level replication replicates data to the standby system as it enters the file system. (See Figure 3b.) File-system-level replication sits above the cache instead of below it.

File-system-level replication provides many significant advantages over block-level replication. Because block-level replication functions below the cache, all data must be synchronously passed to the volume manager so that it can be replicated to the standby system. In the case of file-system-level replication, the replication occurs above the cache layer so disk synchronization is not required. More importantly, rapid failover is not possible with block-level replication since the standby file system must be quiesced, checked, and mounted upon failover. With file-system-level replication, the standby file system is always ready to take over, making the file-system recovery instantaneous. Because of the advantages of file-system-level replication over block-level replication, the remainder of this Article will focus exclusively on file-system-level replication and its advantages over shared-storage clustering.

Figure 3a: Block-Level Replication Figure 3b: File-System-Replication Level Replication

File-system-level replication involves synchronously replicating data across TCP/IP at the file system level to an independent system. The file system does not commit to a write until the data has been replicated to the other system. In the event the active system fails, the standby system has immediate access to all storage data that was written to the failed system. There is no possibility of lost data due to any single failure, and because replication occurs at the file system level, no file system recovery operation is required during a failover. The data replication function can take place completely within the operation system kernel, making it completely transparent to applications, volume managers, and storage systems. This transparency both upwards to the application and downwards to the I/O Path makes high availability easier to implement and to maintain. This significantly reduces the total cost of high availability.

Although some commentators have expressed concern that synchronous data replication will adversely affect the performance of the system, file-system-level replication can actually improve the performance of the local file system. This performance improvement is due to a subtle aspect of disk data replication — the ability to control synchronous disk writes. In many applications, synchronous writes are used to ensure that data written by the application is actually written to disk or to an independent disk subsystem. In the case of shared-storage clustering, synchronous writes are required to eliminate cache consistency issues and to prevent lost data in the case of a failure. These systems require synchronous writes to ensure that the data is flushed from the cache of the server out to the storage system. Otherwise, a failure of the server makes the cached data lost or unavailable. These synchronous writes decrease performance by making the application wait for the data to be written to disk.

In a file-system-level replication environment, these disk synchronous writes can be replaced by network synchronous writes. With network synchronous writes, the file system commits to the write and the application is allowed to continue once the data has been transmitted to the standby system. The active system can then commit to a write once it receives confirmation (e.g., a TCP acknowledgment) that the standby system has also received the data. Thus, although the data may not yet be written to either disk, the active file system can commit to the write. Because both systems have the data, no single fault will prevent the data from being written to disk. Thus, this method provides the same level of security as synchronous writes without the penalty of waiting for the disk.

In addition to performance, synchronous file-system-level replication delivers high-availability features that are not offered by shared-storage clustering. First, the simple redundancy of the architecture — redundant systems, redundant storage, and redundant networks — guarantees that there are no single points of failure. Second, file-system-level replication enables sub-second failover for any application. Because the standby system is a completely independent system, file system recovery is not required, eliminating it from the failover time budget. The standby system does not need to check the integrity of the file system, and the standby system already has the storage mounted. Application recovery time can also be eliminated by having a hot standby application ready to take over for a failed application. With file system recovery and application recovery occurring almost instantly, the time required for complete failover reduces to failure detection. Thus, the replication configuration provides the sub-second failover required by the most demanding environments.

Because both the active and the standby systems have their own independent storage, internal server storage or external single-ported, single-controller RAID subsystems can be used instead of expensive dual-ported, dual-controller RAID subsystems. Even though up to twice the storage capacity may be required, this less expensive storage can substantially reduce the cost of the system. And because the two independent systems require only a TCP/IP connection, there are no distance separation limitations between the systems. Thus, the replication configuration is ideal for disaster recovery.

Finally, data replication at the file-system level does not suffer from the cache and lock problems that plague clustering configurations. Because file-system-replication takes place above the caching process and because replication is synchronous, the cache consistency and lost cache data issues suffered by shared-storage clustering are eliminated. The lock management issues that arise with shared-storage clustering are not present in replication configurations because each system has its own independent storage. By eliminating the cache and lock management issues, replication architectures significantly reduce the complexity of implementing and maintaining high-availability applications.

File-system-level replication offers a high-availability architecture that offers superior availability with lower costs. The replication architecture has no single points of failure, and it enables sub-second failover for any application. In addition, replication offer lower storage costs and is ideal for disaster recovery since no geographic limitations exist. Just as significant, file-system-level replication is transparent to applications, volume managers, and storage. This means that replication significantly reduces the complexity and cost of implementing and maintaining high-availability applications. It offers higher availability at a lower total cost of ownership.

About Continuous Computing

Continuous Computing provides integrated systems and services that enable telecom equipment manufacturers to rapidly deploy Next Generation Networks (NGN). Over 150 customers worldwide benefit from the company's unique blend of customized professional services, Trillium protocol software, AdvancedTCA and CompactPCI systems, and BladeCenter hardware. Continuous Computing helps customers reduce platform lifecycle costs, optimize data delivery, and accelerate deployments of NGN, 3G/4G Wireless, and IP Multimedia Subsystem (IMS) infrastructure. The company is ISO-9001 and CMMI certified and based in San Diego with development centers in China and India. For more information, visit www.ccpu.com.

Continuous Computing, the Continuous Computing logo, Create | Deploy | Converge, FlexTCA, Flex21, FlexChassis, FlexCompute, FlexCore, FlexDSP, FlexPacket, FlexStore, FlexSwitch, FlexTCA, Network Service-Ready Platform, Quick!Start, TAPA, Trillium, Trillium+plus, and the Trillium logo are trademarks or registered trademarks of Continuous Computing Corporation. Other names and brands may be claimed as the property of others.

Copyright © 2009 Continuous Computing. All Rights Reserved.  |  +1.858.882.8800 phone  |  www.ccpu.com

Copyright © 2009 Continuous Computing. All Rights Reserved.  |  +1.858.882.8800 phone  |  www.ccpu.com
Contact Us