English Chinese (Simplified) Korean Japanese

Request PDF

Enter your E-mail address and we will immediately send you a PDF version of this document.

Your E-mail Address

High Availability

Issues and Challenges when Designing an ATCA System

Author: Philippe Chevalier, Principal Architect
Issue Date: November 2007

Introduction

In the United States, there is a mandate from the Federal Communications Commission (FCC) to push telecom service providers to make their telephone services highly available. The idea is that anyone should be able to pick up a phone and call for help.

When designing a High Availability (HA) platform, system architects often have to deal with many parameters ranging from hardware to software, number of ports, number of subscribers, amount of resources, and ultimately cost. The cost of implementing HA must be weighed against the amount of revenue the service will generate - or conversely, the amount of revenue lost if the service is unavailable.

This paper describes the factors involved in designing for, and calculating, the availability of a carrier-class telecom system. The analysis focuses on AdvancedTCA (ATCA) platforms, but the facts can be extended to other types of platforms and environments.

Definitions

We first need to understand what we mean by High Availability, or a highly available system. System availability is expressed in terms of percentage; it defines the amount of time the system will be available to perform a service over the length of measurement.

There are nine levels of availability ranging from level 1 (or "1 NINE") to 9 (or "9 NINES"), but we will consider the first seven levels as follows:¹

Table 1: Levels of Availability

System Type DownTime / year Availability (A) N
Unmanaged 52,560 min 90% 1 NINE
Managed 5,256 min 90% 2 NINES
Well-Managed 526 min 99.9% 3 NINES
Fault-Tolerant 52 min 99.99% 4 NINES
Highly-Available 314 sec 99.999% 5 NINES
Very-High Availability 31 sec 99.9999% 6 NINES
Ultra-High Availability 3 sec 99.99999% 7 NINES
¹Jim Gray, Andreas Reuter. Transaction Processing: Concepts and Techniques - Morgan Kaufmann

Where the number N of NINES is a function of the availability value (A):

Availability

The availability (A) is the percentage of the total service time the system will be operational to provide its intended service. The relationship between availability and downtime is given by the following formula:

Downtime

If we need to calculate our downtown budget based upon an Availability figure we are given, we can use the following formula:

TotalServiceTime

The total service time is the total number of hours in one year:

The telecommunications industry has been driving the 5 NINES requirement for years. 5 NINES means that a system must meet a 99.999% of up time, or 5.24 minutes of downtown, per year - or 314 seconds.

At the system level, this means that the Mean Time To Failure (MTTF) must be within a budget driven by the time it takes to repair, also known as the Mean Time To Repair (MTTR).

The Mean Time Between Failure (MTBF) is the sum of these two times:

We can calculate the system availability by dividing the mean time it will take for the platform to fail by the mean time it will take to repair it.

Considering an average of four hours to repair a system deployed in the field - assumed to be three hours travel time plus one hour of repair time - and considering that we are given 5 NINES system availability, the platform MTBF is:

This means that the entire system must meet the 400,000 hours MTBF.

Why Systems Fail

Why does a system fail? There are several reasons, but primarily categorized into two areas:

  1. Software
  2. Hardware

Let's model these two elements, or "components", into an initial Reliability Block Diagram (RDB) representation:

The total system availability is the product of all components' availability:

For instance, given a hardware availability of 99.999% and a software availability of 99.999% (i.e., both HA, or 5 NINES), results in a total system availability of:

And its corresponding MTBF is:

Or, if expressed in downtime per year:

The same value could have been found differently by using the sum of the unavailable time for each component as follow:

Fit

This leads to a third way to calculate the total availability budget by the use of the Failure in Tenth (FIT) formula:

A FIT is the corresponding number of failures expected for each component in 1010

Taking the above example where we have a 99.999% component and 4-hour MTTR, this gives an MTBF of 400,000 hours. Its corresponding FIT is:

To calculate the total FIT for the entire system, we sum all the FIT values for each component:

The platform MTBF is then:

This method is very useful to quickly calculate the total system availability number without too much "thinking." We will use it as an example later on.

Systems fail for one major reason: faults. Once reached, faults cause a system failure, which translates to system downtime.

There are three types of faults which need to be considered in a system:

  • Mechanical Faults
  • Hardware Faults
  • Software Faults

Before we look closely at each type of fault, however, we will discuss failure rate.

Failure Rate

The failure rate is the average number of times a failure will occur within a given time. The relationship between the MTBF and the failure rate is:

Considering a 5 NINES platform and perfect fault coverage, we demonstrated earlier that the total downtime was not to exceed 314 seconds. However, we can look at this number in a couple of different ways.

  1. We can either have one failure per year at 314 seconds

    Or...

  2. We can have 314 failures per year with a time to recover of 1 second each. Both are equal from a numerical HA calculation standpoint, but what does that mean in reality?

The relationship between the two is:

The following graph shows the relationship between the average time to repair versus the average number of failures per year.

This means that a high availability system can be designed under different types of failure rate models, but the different failure rates need to be compensated for by the time to repair. Some applications may favor an HA model where more failures are preferred over a single longer one, and others may favor the opposite model for cost and practicality reasons. As a result, when designing an HA platform, system engineers have to choose between the two models in order to drive the requirements appropriately.

System Failure Rate

Often times failure rates are expressed by different values - for example, failure rate of the hardware alone, failure rate of the software alone, or failure rate of the complete system. This last case is the most complicated to identify because it involves the application that the network service provider will be using on the system / platform itself. Because of this, telecom vendors often claim the ability to offer a "5-NINES-ready" platform, or something similar - but it is left to the service provider to apply this availability figure at the subscriber level, which includes the application.

Hardware Failure Rate

The hardware failure rate is the simplest one to determine: it is derived directly from the MTBF of each component applied to a redundancy model. Such a model can be 1N, 2N, or N+M, in which case we restrict the downtown to when there is not enough hardware left to sustain the designated service. The only repair time is the time it takes to change the hardware, which is assumed to be four hours as described previously.

For instance, in the case of a 1N model, failure of one component causes the failure of the entire system. When a board fails, we need to wait four hours to repair it - pretty straightforward.

In a 2N model, it takes two failures to consider a complete shutdown. In other words, the unavailability is the probability of the second component failing after the first one has failed, or what is known as "double failure."

We can apply this theory for as many components as we have in our model. Interestingly, however, extending the number of components shows that we are not necessarily improving our availability anymore. The reason is because there is a higher probability of having a failure in one of the N components.

The following chart has been created using N number of components with the same MTBF of 100,000 hours in an N+1 redundancy model:

The hardware equivalent failure rate is very low when the number of components is less than 10 (e.g., > 5 NINES), but decreases when trying to increase the number of components beyond this value (e.g., 4 NINES availability model with 40 components). This is an unrealistic redundancy model, but it is instructive to be aware of this when working at the system level with a 5 NINES goal in mind.

Software Failure Rate

Software failure rates are not known until testing and deployment occurs. A software failure rate is a measure of the software operation developing the code. Some organizations delivering code to sensitive environments - such as aviation, nuclear power stations, and the like - are even restricting their software engineers from writing a single line of code; instead, the code is generated by code generators and highly reliable compilers. This rigorous approach costs money, of course, and is not necessarily applicable to all target environments. The Software Engineering Institute (SEI) at Carnegie Mellon has developed the Capability and Maturity Model (CMM) with five levels:

Table 2: Capability and Maturity Mode

Level Identification Description
1 Initial Immature or undefined process
2 Repeatable Requirements management, project planning, configuration management, SW quality assurance
3 Defined L2 + organization-level management and process control, defined processes, established training programs
4 Managed L3 with process measurement and analysis
5 Optimized L4 with process improvement and optimizations

It is considered good practice to run a software engineering organization at a minimum SEI Level 3 to produce reliable code; it is also the most difficult level to reach when originally assessed at Level 1 (Initial).

System engineering processes also consist of thoroughly describing each scenario, validating all the ins and outs, and describing the test cases necessary to flush out as many design and development issues as possible.

System Failure Rate

When hardware and software are combined and applied to a specific use, the system engineer needs to calculate and demonstrate the end system failure rate. This is achieved by first allocating the hardware and software failure rates, which can be 50/50 or 20/80 or whatever is necessary to reach the 5 NINES goal at the user level. Such allocation is a very complex task which requires a very thorough knowledge of the entire platform, redundancy models, fault zones, call control allocation, etc.

MECHANICAL FAULTS

Mechanical parts are components with a mechanical rotating part such as fans and hard drives, which are the two most likely to be found in a telecom platform.

Mechanical components have three periods during their lifetime:

  1. Burn-in period
  2. Useful life
  3. Wear-out period

During the burn-in period, the mechanical component has a high failure rate. Once it has passed this stage, the failure rate is constant until it wears out and the failure rate increases again.

The availability study relies heavily on mechanical components being operated during their useful life, but despite this condition, mechanical components are highly unreliable. For example, fans, which are necessary to cool down components in a chassis, are one of the least reliable components in a system with less than 100,000 hours MTBF during their useful life of two to five years, depending on cost. Hard drives are also within this category, although with a much higher MTBF and lifetime. Proper operational management needs to be in place to extend these components' lifetimes as long as possible, such as by lowering the speed of fans when applicable or shutting down hard drives when not in use.

Hardware Faults

Hardware faults are typically considered hard failures. Hardware fails when a component fails, and therefore a hardware change (replacement) is required. The failure is simple to isolate and repair, but it takes time. Hardware faults are latent and will manifest themselves at a certain time referred as their failure rate.

When a hardware fault occurs, the hardware stops functioning and causes a system failure, upon which a technician needs to be dispatched to repair the problem (assumed to be 4 hours MTTR). Such failures are very "expensive" in terms of cost as they represent hours of missed revenue and possibly angry or lost customers.

One technique for addressing hardware faults consists of placing a second component in parallel of the first one.

The new total availability is given by the total probability that either one or both of the hardware components will be available. That is:

Which gives the total hardware availability as follows:

So, given 100,000 hours MTBF for each of the hardware components, our total availability is now:

As a result, even with a couple of not-so-reliable components, we have managed to get a theoretical system to be almost 100% available.

This is a very high number when both hardware components are working in an "active / active" manner and sharing all information with each other. As one might expect, though, this is not a very realistic scenario. In order to have full redundancy between two components (whether in an active / standby configuration or an active / active configuration), it takes time to reconfigure a single component to take over all of the services that the first one had been handling; this is called the fail over time.

During the fail over time, the service is not operational, and as quick as we would like it to be, this is still accounted for as downtime.

Let's take our two pieces of HA hardware in a 2N redundancy model and consider λ as the time it takes to detect, isolate, and recover from the fault. We now have the new system availability at the hardware level, which is:

In our previous example, given a 100,000 hours MTBF hardware component, and considering that it takes 2 seconds to fail over all necessary state information for the secondary unit (HW2) to take over fully, we now have for our new hardware equivalent availability:

Or ~0.17 seconds downtime per year!

One other factor to consider is that a system may not have the ability to actually detect a fault, which has the potential of causing total system failure. Let's assign a new variable δfor coverage, expressed as a percentage; the new hardware equivalent is:

Given a realistic 80% fault coverage, we now have:

Or ~ 251 seconds!

This represents 80% of our total 314 second downtown budget to reach our total system (software + hardware) high availability.

Before moving one, we should drill down on this very simplistic view of the redundancy model a little bit more. In particular, as a practical matter, the concept of stacking up hardware components in order to reach a higher level of redundancy is very expensive. So, instead of duplicating an entire system, we need to decompose our hardware into pieces.

Let's use a typical ATCA platform with a set of "n" payload cards and two switches. We have the following components:

  • Fan tray in a "2 out of 3" availability model, FT1, FT2 and FT3
  • Backplane, 1N
  • Power Entry Module 2N availability model, PEM1 and PEM2
  • Payload cards in a "k out of n" availability model, PL1… PLn
  • Switch cards in a 2N availability model, SW1, SW2

Our new RDB model is represented as follows:

This model can be simplified to the following RDB where each component has been simplified into a 1N model and serialized as shown below:

SHELF MANAGER AVAILABILITY

One of the most counter-intuitive elements of the telecom HA model is that shelf managers do not actually contribute directly to the high availability of a system. The reason is because we can lose both shelf managers and still be operational - which is assuming, of course, that none of the components are directly affected by the absence of the shelf manager. Fan trays, for instance, which are typically controlled by the shelf managers, are pre-programmed to go on high speed when not controlled by either of the two shelf managers.

Furthermore, during that same time, the availability manager has no visibility of the system, which can cause a complete system outage. Under this scenario, we count the failure of shelf managers as a partial outage.

BACKPLANE AVAILABILITY MODEL

The backplane is an interesting component; it is typically very reliable yet is still a single point of failure of the entire system. As such, the backplane must never have any active components on it - only passive components.

Regardless of its calculated MTBF, historically the backplane has caused downtown not because of a component failure, but because of human error - such as bent pins, coffee spills, foreign material jammed into connectors, electrostatic discharge, etc. When these types of things happen, a technician needs to be dispatched with a new chassis to rebuild the complete system in the field, which can take significantly more than 4 hours. Extra care must be given to the backplane, therefore, and only highly trained personnel should be allowed to service it.

FAN TRAY AVAILABILITY MODEL

With the Fan Tray, we do not have to consider any switchover case. The Fan Tray equivalent (FTEQ) is given by the parallel redundancy model as follows:

Given a typical MTBFFT of 70,000 hours during its operational lifetime (typically a 5-year period), we can consider that the Fan Tray will always be operational, or will have a MTBF FTEQ of a million of years, and therefore considered completely negligible.

Note too that this model assumes the system is still operational when only one Fan Tray is running, which may not be always the case if we consider that the system is out of service when two Fan Trays are down - in which case we need to change the formula as follows: the probability of the FTEQ to be operational is the probability for two Fan Trays to be up and one to be down; therefore, the probability for the FTEQ to be inoperative is given by two Fan Trays down and one Fan Tray up.

With our 70,000 hours MTBF for each Fan Tray, we now have:

Yielding ~0.1 seconds downtown per year, which is essentially negligible when delivering a 5 NINES platform.

POWER ENTRY MODULES AVAILABILITY MODEL

And for the Power Entry Module (PEMeq):

Considering a 100,000 hours MTBF for the Power Entry Module, we have:

Or 0.05 seconds downtime per year - again negligible and thus can be removed from the RDB.

Unlike the two hardware components just analyzed, the switch availability model is slightly different. In this case, we need to consider two new factors.

First, we need to consider the time it takes to switch over, and second we need to assume the fault is not going to be detected.

Reusing Equation X, we have:

Assuming 100,000 hours MTBF for the switch, 90% fault coverage, and 2 seconds failover, we can calculate our Switch Availability as follows:

This is equivalent to 125 seconds downtown per year, which is fairly significant when trying to meet 314 seconds for 5 NINES availability.

Looking at Equation X, it is clear that this number will be primarily driven by two important factors:

  • Fault Coverage
  • MTBF

Fault coverage is an important factor, and as we can see, the higher the coverage the less we will depend on the 4 hours MTTR. Ninety percent fault coverage is a very reasonable target for a carrier-class switch. The MTBF is the only factor that is not so easy to increase; it depends a lot on the type of components used and the complexity of the design. The ideal switch design is one that is as simple as possible with no mechanical parts (such as a hard drive). Two hundred thousand hours is a respectable MTBF for a 10GbE switch.

It might be counter-intuitive, but the fail over time is actually not that important in regard to availability figures, especially when considering a fail over time within seconds. Fail over time is therefore more important from the standpoint of loss of connectivity rather than from the standpoint of high availability. 250ms is a reasonable metric for switch fail over.

With the given parameters of 90% fault coverage, 200,000 hours MTBF, and 250ms fail over time, an ATCA chassis / switch complex can provide up to 99.9998% up time, or 63 seconds downtime.

Note too that if the switch complex provided on an ATCA platform is completely independent from the rest of the application, the system architect is free to design an application without having to worry about making sure the application has to manage the switch failover scenarios. The system architect, therefore, has to make sure the system can guarantee the rest of the application be within the remaining availability budget:

Or in terms of availability: 99.9992% up time. The new simplified RDB is now composed of two components, each with their respective availability number:

SOFTWARE FAULTS

Software faults are a bit different than hardware faults. Like mechanical parts, they tend to have a bathtub model with a high failure rate when the software is introduced, and then a fairly steady failure rate which tends to increase again overtime (since software is updated constantly and is not necessarily tested as thoroughly as it was the first time). So, the tendency is to see failure rates increasing as the software ages, illustrated as follows:

Software faults are primarily due to bugs introduced during the design and are latent until activated under a certain condition either during tests, integration, or a trial period. Once released, the software may never fail until a new maintenance patch is introduced which exposes a latent fault. We can list software faults into four different categories:

  • Bohrebug: A repeatable bug. This is the one we want to have, and hopefully the most likely with which to deal. The issue is to figure out how to repeat it. With a good set of log and debugging tools, it can be isolated and fixed.
  • Heisenbug [from Heisenberg's Uncertainty Principle in quantum physics]: This type of bug has the very unfortunate behavior of being difficult to isolate; it tends to disappear when one is trying to isolate it.
  • Mandlebug: This is the most difficult bug to deal with; it tends to have many dependencies and can present multiple different behaviors each time it is being isolated.
  • Shroedinbug [MIT: from the Schroedinger's Cat thought experiment in quantum physics]: This bug doesn't really manifest itself until a thorough design review is performed and, when it does, it can turn into a Heisenbug or Mandlebug and is almost impossible to isolate. A well-established code and design review is necessary to flush out these sorts of faults.

Needless to say, software is the most unpredictable aspect of the system and a High Availability Manager is necessary to control the reliability of the overall software application.

One aspect that needs to be highly reliable is the fault manager itself. Fault management is decomposed in three main categories:

  • Fault Detection: Mechanism to report the failure to the availability manager
  • Fault Isolation: Mechanism to analyze the failure and identify which exact component of the system has failed
  • Fault Recovery: Mechanism to apply our recovery procedure in order to keep the system running

There are two main parameters that need to be taken into consideration when going through fault management. One is the amount of time it takes to go through the entire fault management procedure, and the other is in relation to accuracy. There is not much point in having strong fault detection if we do not also have strong fault recovery, without which would lead to full system outage.

Recovery Time

Ideally we expect the system to be available all the time, or to recover from a failure as quickly as possible. Unfortunately, it takes time to complete this task because there are multiple steps involved.

First, we may not be aware of the fault immediately; there is a time lag between when the failure occurs and when we get informed. For instance, some system implementations rely on monitoring the system log (syslog) file, so between the time the fault is reported into the syslog file and the time the fault detection mechanism reads this information, hundreds of milliseconds may have already been lost.

Then, the HA manager needs time to identify what the problem is exactly, such as by running diagnostic tools. Depending upon the complexity of the fault, this can take several seconds.

Lastly, the recovery mechanism takes place based on the recovery policy procedure, and this depends on whether the standby component is already up and ready to take over the failure ("hot standby"), or if it first needs to be brought up to service before it can take over ("cold standby"). This fault management can take from one second to several minutes depending upon the complexity of the recovery procedure and is illustrated below.

Fault Management

Another factor is the accuracy of the fault management procedure; each stage of the recovery is subject to completion without any problems before the next one can commence. Unfortunately, most of these procedures are occurring in software, and as a result there is not 100% reliability. At each stage, there is a possibility that the fault management does not complete accurately.

Considering a failure stage success coverage represented by "C," the probability for a successful fault management procedure is:

For instance, considering a 99.9% success rate of completing the fault management recovery at each stage, the probability for the system to become operational is now:

This impacts the fault coverage described earlier by this factor, and the new availability equation becomes:

The upshot is clear identification of the importance of using an HA manager with the highest reliability - that is, developing the HA manager with the most rigorous software development process.

SERVICE AVAILABILITY FORUM (SAF)

A good fault management solution begins with good software architecture.

For years, telecom equipment manufacturers developed their own HA middleware to manage their systems and protect against failures. Each vendor developed its own proprietary model under well-kept intellectual proprietary rights, sometimes even on their own operating systems for greater security and control. More recently, however, as these vendors have worked to become more competitive, they have decided to move to a more open, standards-based way to address these challenges.

With the adoption of ATCA as the base platform for future development, telecom equipment manufacturers are at the same time enabling a new set of tools and standards-based middleware to provide them with a solid foundation for their final application while also minimizing their development costs.

The Service Availability Forum (SAF) was created a few years ago to address the specific HA middleware aspects of the platform. The intent of SAF is not to define an architecture, but rather a set of application program interfaces (APIs) associated with a specific behavior - which is left to the HA middleware developer to create.

HARDWARE PLATFORM INTERFACE (HPI)

One of the specifications of SAF is the Hardware Platform Interface (HPI), which is the main point of access into the hardware platform. HPI provides a standard way to retrieve information available at the hardware level, such as the type and identification of each ATCA blade plugged into the backplane, power on or off, retrieval of thermal sensor data across the chassis, etc.

APPLICATION INTERFACE SPECIFICATION (AIS)

Another set of specifications released by the SAF is the Application Interface Specification (AIS). This interface provides all the necessary hooks into the HA middleware for the application to be managed in a high availability manner.

AVAILABILITY MANAGEMENT FRAMEWORK (AMF)

The most important piece of the HA middleware is the Availability Manager; it is located either on the switch blade or on one of the payload blades and provides the heart of the entire availability framework.

PLATFORM MANAGEMENT SERVICES

SAF parameters assume that platforms spend most of their time in an operational status - which we all know is not always the case. For example, platforms are not operational during provisioning, maintenance, and of course during initial development and debugging.

The methods for accessing the hardware platform during these phases are often overlooked by platform vendors. Having an additional layer of "platform management services" is extremely helpful in facilitating interaction with the system in non-operational modes. One type of platform management interface leverages the serial-over-LAN communication protocol to enable Ethernet access to any CPU, thereby allowing ATCA blades and AMCs to be pre-programmed before booting, firmware to be reprogrammed, etc. Such an approach from one single interface to the platform can be a very powerful tool for platform development and deployment.

Testing

Software reliability depends heavily on testing. Even if software doesn't age in the sense that hardware does, it is constantly undergoing changes either from the type of data that it has to handle (e.g., the famous Y2K problem) or from the maintenance and myriad updates it has to endure.

There is only so much that architects and engineers can do from a system engineering aspect, and there are always cases and scenarios that cannot be foreseen. As a result, thorough testing is used to not only validate the complete design but also to try to expose latent faults by pushing the limits of the software. There are two basic types of software testing.

  • Normal test coverage: The most common and most widely-accepted test coverage is done by verifying each and every one of the software requirements identified at the beginning of the development cycle. These tests are typically positive verifications and flush out about 75% of software faults.
  • Abnormal test coverage: An additional and less common approach to software testing is done by developing new scenarios which are based on non-typical scenarios that force the software to take different routes and thereby uncover new faults.

Applying the 80/20 rule, it is typical that 75% of software faults are discovered during 20% of testing, but it becomes a resource and time delay issue when trying to sort out the remaining 25% of bugs.

It comes down to the following ratio:

The overall cost of faults is estimated as the amount of lost revenue - i.e., service revenue not captured by the operator when the system is out of service - plus the cost of repairing the problem and getting the system back in service.

Likewise, the cost of fixing bugs is estimated as the amount of dollars and delays induced by spending more time and resources on testing the system. System engineers have to evaluate the best value for "T," as nobody wants to spend too much time testing nor does anyone want to pay too much to repair the system once in the field. The inflexion point is when T=1. Beyond that (or similarly when T<1) is a matter of negotiation between internal teams: some want to ship the product as soon as possible in order to start generating revenue, while others want to test it as much as possible to prevent future issues in the field.

An "80/20" rule of thumb is that a system should be tested for 3 to 6 months maximum; any more than that might cause the product to miss its prime market window. In addition, a software product can be considered qualified to enter the release phase when 80% of the faults have been uncovered, meaning when 20% of the testing has been executed. This approach helps evaluate the amount of resources needed for testing on a particular product, the difficulty in identifying the highest number of possible test cases, determining 20% of the most critical ones and dividing them by the number of days, and then determining the number of test engineers needed to perform the task.

It is not uncommon to see 10K, 20K, or even 30K potential test cases, but it is not realistic to have 30K test cases performed manually. Indeed, a certain number of test cases should be automated - ideally those which are in the 20% most critical ones. Automating test cases allows engineering to release new code based on issues found in the field and to make sure that no new bugs have been introduced, for which regression testing is conducted.

The 80% of not-so-critical test cases must still be tested, but can be done after the product has been released because the predicted failures are typically latent. These remaining test cases can be conducted manually, and sometimes are done so because of their complexity and/or the requirement for specific fault injections.

Summary

High Availability is a very complex engineering challenge which needs to be addressed at the earliest stages of the development cycle. High availability comes with a cost that needs to be identified and carefully factored into the business model of the services that the platform will provide. Not all applications need 5 or 6 NINES availability, but the ones that do are often the very core services that define the essence of the end-user experience - and hence worthy of the investment in high availability.

ATCA platforms are well-positioned to speed up the development process because of the existence of off-the-shelf high availability software and hardware components. Even so, taking a complete systems view of the availability of the overall product is a non-trivial endeavor that requires deep telecom expertise in order to be done swiftly and with the most efficient use of development resources. Having a telecom-focused partner who is skilled in the art and science of delivering high availability systems is often the most prudent approach to successful and cost-effective product introduction.

About Continuous Computing

Continuous Computing provides integrated systems and services that enable telecom equipment manufacturers to rapidly deploy Next Generation Networks (NGN). Over 150 customers worldwide benefit from the company's unique blend of customized professional services, Trillium protocol software, AdvancedTCA and CompactPCI systems, and BladeCenter hardware. Continuous Computing helps customers reduce platform lifecycle costs, optimize data delivery, and accelerate deployments of NGN, 3G/4G Wireless, and IP Multimedia Subsystem (IMS) infrastructure. The company is ISO-9001 and CMMI certified and based in San Diego with development centers in China and India. For more information, visit www.ccpu.com.

Continuous Computing, the Continuous Computing logo, Create | Deploy | Converge, FlexTCA, Flex21, FlexChassis, FlexCompute, FlexCore, FlexDSP, FlexPacket, FlexStore, FlexSwitch, FlexTCA, Network Service-Ready Platform, Quick!Start, TAPA, Trillium, Trillium+plus, and the Trillium logo are trademarks or registered trademarks of Continuous Computing Corporation. Other names and brands may be claimed as the property of others.


Copyright © 2008 Continuous Computing. All Rights Reserved.  |  +1.858.882.8800 phone  |  www.ccpu.com | RSS