CompactPCI without PCI: Heresy or High Availability?
The Challenges to Packet-Based Telephony
At its most reliable, a packet-switched telephone system can provide far more than a traditional circuit-switched network in terms of functionality and flexibility. However, while packet-switched telephony can more than match the circuit-based network in terms of functionality and flexibility, it has not yet achieved the level of reliability enjoyed by copper wire.
Hence the issue of High Availability, where a platform must, in effect, work even when it is not working. For example, a CPU might fail for some reason, but its function would be immediately taken up and carried out by a standby CPU. The platform as a whole would not miss a beat while this takes place.
Conventional Solutions To The Challenges Of High Availabilty
The 2N Redundant Architecture
The most obvious solution to the challenges put forth by equipment failure is hardware duplication. Such a 2N-Redundant platform is composed of a primary node and an identical backup node that will provide service if the primary fails (Figure 1). Such a solution is fairly easy to implement both in terms of design and the software that must be used to manage it, but can be prohibitively expensive as everything must be duplicated.

Figure 1
Furthermore, such a design carries a more significant drawback in that all calls in process are dropped when the primary node fails, even if the backup node comes up quickly. In some applications such as voicemail this is acceptable, as a user can simply call back to replay their messages. However, in the case of a user who is attempting to manage a Wall Street conference call with 200 analyst participants, such a situation would result in an enormous inconvenience and lost revenue. A customer who was placed in this situation would not fail to remember it when the time came to renew their service contract with their telephony provider.
The Dual System Slot Architecture
Another approach to high availability is the duplication of subsets of hardware instead of the entire node. Most often, this translates to duplication of the CPU. Should the primary CPU fail, the backup awakens from standby and begins to provide service (Figure 1).
This design addresses the cost issues that arise when compared to the 2N Redundant architecture, since not all of the hardware in a given node need be replicated. However, while a Dual System Slot architecture is less expensive to implement, the software challenges for such a platform are formidable. Should the primary CPU fail and the backup take over its function, the backup must accurately determine the state of all of the I/O cards and take control of them — all without disturbing the state of the cards or the calls in process at the time.
Another consideration is the immaturity of such systems and the lack of standardization. The PICMG dual system slot standard has not yet been released, and differing approaches are being championed by various vendors, thus making it very difficult to build a system using off-the-shelf parts.
Furthermore, the CompactPCI bus remains a single point of failure. A single misbehaving I/O card can take down the entire system much like a single bulb in a string of Christmas lights. The newly discovered nano-bounce problems with CompactPCI hot-swap only further illustrate the difficulty of this approach.
Plainly, there exist strong motivations for a new architecture that can simplify the programming necessary to implement high availability successfully, avoid potential problems caused by fluid standards and leverage open standards, allay concerns brought up by single points of failure at busses, and leverage the increasingly intelligent I/O cards and peripheral processors that have become available.
A Conceptual Solution: CompactPCI without PCI
Let's take a step back for a moment and look at what components are required for a high availability packet processing system:
- Redundant configuration and application data
- Redundant communication between control nodes and processing nodes
- Redundant control nodes
- A dynamically assignable pool of processing resources
All of the above can be realized by using dual networks rather than a backplane/midplane for the system bus (Figure 2).
Active CPU Standby CPU

Figure 2: Network Bus Architecture
Ethernet is an obvious choice for this network, as most current CPU boards and many intelligent I/O processor boards have dual 100BaseT interfaces. However, there is no reason this architecture cannot be extended to faster networks such as Gigabit Ethernet or Infiniband. Also, even though this diagram shows a single chassis connected via two local switches, all communication is based on TCP/IP, so there are no geographical or physical constraints on what kind of system one might design.
Let's take a look at how this architecture addresses the needs of a high availability packet processing system.
- Redundant data
Conventional data redundancy depends on shared access to mirrored SCSI disks. Unfortunately, this make the SCSI bus a single point of failure. Repair and upgrade of disks is difficult, and the length of the SCSI bus is limited. In a network bus architecture data is shared over the network, and these limitations are removed. - Redundant communication
Each component in the system has two separate paths to every other component. Any single component — even an entire switch — can fail, and the system continues to operate. Also, when a component does fail, the other components do not notice a hardware failure. Instead, they notice a network failure, which is much easier to handle in software. - Redundant control
With a midplane/backplane bus, it is non-trivial to arrange for a standby system controller since such an arrangement takes up a precious slot on the bus. In a network bus architecture, the standby controller can maintain state and quickly take over for a failed controller. Also, a standby controller does not have to be idle: two active controllers can be standby for each other. - Pool of I/O processors
In a network bus architecture, intelligent I/O processors work largely independently of the system controller. I/O is directly to and from the network — not over the midplane/backplane from the controller. The I/O processors get services from the controller (boot image, network disk, configuration management, etc.), and possibly marching orders (although they may come over the network from an external soft switch).
Now let's take a look at some additional advantages provided by this architecture.
- High Density
Popular busses for telephony applications support a limited number of slots. This limits the number of I/O processors per system controller and generally reduces the density of the system. In a network bus architecture, depending on the chassis, any combination of system controllers, I/O processors, or CPUs of any kind is possible. - Ease of Programming
Conventional HA systems present a formidable programming challenge. Most operating systems cannot survive a hardware failure, so any fault takes down the whole system. To get around this, conventional systems must use hardened operating systems, hardened device drivers, and even hardened applications to protect against the failure of an I/O processor or any peripheral device. In the case of a failover to a standby CPU, the newly active CPU must completely determine the state of each device and its control registers.If, however, the boards in a system are coupled only by network connections, the programming task becomes much simpler and the results more reliable. The failure of a network link merely means failover to a second link (more on this below); the failure of a single component merely means dropped network connections to other components. In other words, if a single component fails, the other parts of the system do not experience a hardware failure in the conventional sense.
For the CPU control nodes, this means they do not have to run hardened operating systems, but can run rich, popular systems such as Solaris or Linux. (I/O processors may elect to run any one of a number of operating systems including real time operating systems.)
This approach also leverages the increasing intelligence of available I/O cards, relieving the CPU of having to carry out functions that can be assumed by more peripheral parts of the architecture. Much like an object-oriented programming model, this allows I/O processors to present themselves as an encapsulated set of resources available over TCP/IP rather than as a set of device registers to be managed by the CPU.
- Hot-Swap Advantages
In our model, replacing failed boards or upgrading boards is simply a matter of swapping the old board out and the new board in. There are no device drivers to shut down or bring back up. (Depending on the application, it may well be wise to provide support for taking a board out of service and managing power smoothly. More on this below.) - Failover
In this model, failover is greatly simplified. Intelligent I/O processors can continue handling calls even while system controllers are failing over. It is easy for both active and standby system controllers to know the system state, plus intelligent I/O processors tend to know about their own state and can communicate it upon request.Failure of network components is handled via redundant network links.
Within the system, failover to a standby component can be handled either by the standby component assuming the network address of the failed component (IP or MAC failover), or by coordination among the working components, or by a combination of both.
Communication to equipment outside the system can be handled transparently via IP failover, or by the standby component re-registering with a new IP address (e.g. re-registering with a softswitch).
- Geographic Distribution
High availability solutions based on backplane/midplane busses such as CompactPCI are constrained to a single rack. Given that some causes of hardware failure are geographical (power outages, fires, earthquakes, floods, etc.), this is a serious limitation for certain applications. These issues are also alleviated by replacing a midplane/backplane bus with network links. Such systems can be composed of nodes located on opposite side of a Central Office space, a building, or even a city to increase redundancy and availability in case of a disaster. The flexibility of a fully-distributed architecture via a network explodes the possibilities for design. - Bandwidth Benefits
While nominal midplane/backplane bus speeds tend to be higher than nominal network speeds, fully non-blocking, full-duplex switches can provide higher aggregate throughput. For example, a cPCI bus provides 1 Gbit/sec to a maximum of 8 slots. A full-duplex 100 Mbit ethernet switch would provide 1.6 Gbit/sec to an 8 slot system and 4.2 Gbit/sec to a 21 slot system. (In a redundant configuration those numbers could be as high as 3.2 and 8.4 respectively, although an application would have to allow for reduced throughput in the case of a network component failure.)
Continuous Computing's High-Density High-Availibility (HD/HA) Network Bus Architecture
Continuous Computing's network bus architecture (Figure 3) is a combination of hardware, software, and APIs that allow customers to use off the shelf CompactPCI cards to minimize development time and avoid getting locked into proprietary architectures.

Figure 3
Since all control and data are passed over the network, the CompactPCI backplane provides only power, not PCI. This greatly increases the reliability of the system, increases MTBF, and simplifies maintenance.
Hardware
- CCN
The CCN (Continuous Control Node) provides presence detect, board healthy, and reset control. The CCN can power up and power down individual slots, and can provide network access for any boards that have serial consoles. The CCN acts as a single point of contact for management and provisioning of the entire system. - System Controllers
Dual SPARC/Solaris system controllers provide a redundant platform for control applications. They also provide transparent IP disk mirroring (see upDisk below) and can provide boot and configuration services to the I/O processors. - Ethernet Switches
Continuous Computing is shipping the world's first 24+2 CompactPCI Ethernet switch. This switch has twenty four 100 Mbit and two 1 Gbit interfaces, and is full-duplex and non-blocking. This means predictable, low latency communication among all components of the systems. A pair of these switches ensures that the network is not a single point of failure in the system. - Power Supplies
Continuous Computing's power supplies are unique in that they are dual-feed. Either power supply can run the whole system, and both power supplies can take power from both the A and the B 48 volt feeds in a typical telco installation. - Intelligent I/O Processors
These are application dependant and are completely up to the customer. A customer may configure 2N redundancy or N+M redundancy or no redundancy at all (in which case system capacity is reduced in the case of a board failure).There are a few requirements for these boards to take full advantage of the HD/HA network bus architecture:
- They must have multiple Ethernet interfaces
- They must be able to configure multiple IP addresses on an interface
- They must be able to run a small heartbeat client
- They must be able to boot from onboard flash or over the network
There are numerous off-the-shelf boards on the market that meet these requirements and can be used in the HD/HA environment. These include DSP cards, CPU cards, and voice-processing cards combining DSP, telephony, and RISC resources.
One of the unique features of the architecture is that these I/O processors are not limited to CompactPCI I/O slot processors. Because there is no CompactPCI bus, standard system slot processor boards can be used in the HD/HA I/O slots. This considerably increases the range of cards available.
Conclusion
In the past, standard bus architecture has restricted the implementation of true highly available CompactPCI platforms. Proprietary extensions and hot-swap implementation details have made it difficult to build standard high availability platforms. At the same time, the push to use CompactPCI for Voice-over-IP has created a need for "Five-9's" (99.999%) reliability.
Replacing the PCI bus with network links addresses these concerns while exploding possibilities for development. The Continuous Computing HD/HA architecture demonstrates that off-the-shelf components can be used to obtain dramatically higher levels of availability while providing a dramatically simpler software interface than competing solutions.
About Continuous Computing
Continuous Computing provides integrated systems and services that enable telecom equipment manufacturers to rapidly deploy Next Generation Networks (NGN). Over 150 customers worldwide benefit from the company's unique blend of customized professional services, Trillium protocol software, AdvancedTCA and CompactPCI systems, and BladeCenter hardware. Continuous Computing helps customers reduce platform lifecycle costs, optimize data delivery, and accelerate deployments of NGN, 3G/4G Wireless, and IP Multimedia Subsystem (IMS) infrastructure. The company is ISO-9001 and CMMI certified and based in San Diego with development centers in China and India. For more information, visit www.ccpu.com.
Continuous Computing, the Continuous Computing logo, Create | Deploy | Converge, FlexTCA, Flex21, FlexChassis, FlexCompute, FlexCore, FlexDSP, FlexPacket, FlexStore, FlexSwitch, FlexTCA, Network Service-Ready Platform, Quick!Start, TAPA, Trillium, Trillium+plus, and the Trillium logo are trademarks or registered trademarks of Continuous Computing Corporation. Other names and brands may be claimed as the property of others.
