upSuite Frequently Asked Questions (FAQs)
ubManager
- Software upgrade of applications (registered to UbMgr) causes the UbMgr failure.
The applications upgraded on the active machine should also be upgraded on the standby machine for consistency.
Note: The application upgrade is recommended when the upSuite services are stopped (if applications are dependent on the upSuite).
- The UbMgr does not failover (application switch back) automatically under the following scenario:
- Shut down the inactive machine first and then the active machine.
- Wait for some time and then start the inactive machine.
- If the inactive machine is not active, use the 'ubmgr active' command in the console.
- Shut down the inactive machine first and then the active machine.
upBeat
- Explain the exact handshakes that happen as part of the heartbeat.
An upbeat handshake is a three way active handshake. It normally takes 5 seconds for an upbeat handshake to come up and start assigning roles to services and start registering them upon request. However, this time is reconfigurable. As upbeat daemon is started, it sends heartbeat to peer daemon (as mentioned in configuration file) about its UP status. Then peer node (if active) sends back its status along with the information about services being run on it. Depending on that information, the daemon brings the system to appropriate state and acknowledges back accordingly.
- What does the Upbeat heartbeat message contain (a simple message, which indicates that the node is UP or it also contains service related information)?
The Upbeat heartbeat packet contains the following information required to keep track of the status of the link or node as well as the services on the peer nodes:- Node ID
- Link ID
- Service ID
- Service IP
- and some directive messages (if any), to make other node active, standby, and so on.
- What happens when two nodes register for a service at the same time? How is it determined, which service will become ACTIVE?
The split brain condition results when two nodes register for a service at the same time. This will be handled by Upbeat as per the policy specified in the configuration file (that is, value of SPLIT_RES attribute). If no value is specified, the default (that is, LOWEST_NODE_ID) is taken into consideration and the service on the node with lowest ID will be registered as ACTIVE.
- What happens when the Upsuite goes out of sync on the nodes?
If two nodes go out of sync, both will become ACTIVE, as neither of nodes recognise the other is alive. Later when they sync again, the split brain resolution takes place. This depends on the SPLIT_RES policy in the upSuite.conf file. For example: If the SPLIT_RES is “HIGHEST_15MIN_LOAD” then the node with highest average of 15 minutes load becomes ACTIVE and the other one becomes STANDBY.
- What happens when the /opt/upsuite/bin/ups shows the link status as DOWN?
- If the ups reports the link status as DOWN, try pinging that link manually. If there is no answer to the ping, check your LAN cables and switches.
- If the link is fluctuating between UP and DOWN, check the heartbeat TIMEOUT_MSEC setting in the upSuite.conf file. Ensure that this setting is not too short for the network.
- What happens when the messages indicate the DISKS as busy?
If you encounter the message below, the disk is not responding:
Oct 3 09:52:41 left upbeat[418]: disk or partition '/dev/rdsk/c0t0d0s0' is not responding
If the investigation reveals that there are no problems with your disk or SCSI bus, the SCSIbeat settings must be too low in the upSuite.conf file. Increase the values of the TIMEOUT_MSEC and the FREQ_MSEC attributes beneath the "PARTITION" subtag of the "NODE" tag.
- What happens when the UpBeat is not working properly and results with the error messages in /var/log/upsuite?
If there are error messages in the /var/log/upsuite, ensure that the system configuration as defined in the configuration file (upSuite.conf) matches with the actual configuration. Alternatively, use the 'ifconfig -a' command at the Solaris prompt, match the 'ifconfig -a' output information with the configuration file. After configuration, the UpBeat’s status can be monitored through the 'syslog'. By default, the upBeat daemon sends all the output to the syslog facilty unless it is running in the debug mode, in which case it sends all the output to the terminal.
upLink
Note: For all NIC (Network Interface Card) devices using 'ce' drivers, the upLink needs to be upgraded to version 2.0.0.
- Link failure error: upLink: Standby link failure - not receiving heartbeats.
The modified 'rc' script segment patch involved in release 2.0.1 provides the solution with a 'sleep' function introduced between each of the configurations.
- The bge interface displays the error (Configuration not valid) and the upLink does not get configured at the startup of the machine.
If more than one instance of the upLink interfaces are used and are to be configured, then append the number of lines as follows in the '/kernel/drv/upLink.conf'.name="upLink" parent="pseudo" failovertime=1000 instance=0;
Note:
name="upLink" parent="pseudo" failovertime=1000 instance=1;
name="upLink" parent="pseudo" failovertime=1000 instance=2;
For each instance of the upLink interface, one line should be present in the '/kernel/drv/upLink.conf' file. After editing the file, the machine should be rebooted.
- The qfe interface links are not active after the power cycle switch back between two interfaces.
Disable the auto-negotiation option on both the interfaces to solve this problem.
- Kernel Panic because of the 'recursive mutex_enter' in the 'em_proto' module of the upLink.
This bug has already being fixed in the release 2.0.1.
- How to check whether the packets are correctly delivered between the uplink interfaces?
The following commands can be used to find this state:
1. The 'ifconfig upLink0' command should display the correct IP address and the UP flag as 'TRUE'.
2. The 'netstat -i' command checks whether the interface is active and any errors (collisions, packet loss, and so on) occur.
3. The 'netstat -rn' command checks whether the routing table is correct.
- How to check whether the upLink does not fail after the system reboot?
The files hostname.upLink0 and /etc/upsuite/upLink0.conf should be configured as described in the Installation and Configuration files. These files contain settings that ensure upLink survives the configuration after reboot of the system.
- How to check the status of the upLink instance?
The 'ifconfig uplinkN' command is used to check the status of the upLink instance. If it fails, the upLink instance has not been properly configured. For example, on using the 'ifconfig upLink1 plumb' the following message is displayed in the console:
"ifconfig: SIOCSLIFNAME for ip: upLink1: no such interface"
The upLink setting 'instance=1' is missing from the '/kernel/drv/upLink.conf'. Edit the configuration file, unload and reload the upLink instance so that the file will be re-read. For more details please refer to the upLink User Manual.
- How to check the configuration of the upLink instance?
The 'uplink-config -c' command is used to check the configuration of the upLink instance. If it fails, the following message is displayed in the console:
"upLink0 = A: None + B: None"
The 'upLink-config -c' command must not be performed before any physical interfaces are configured. Edit by using -a and -b options followed by -c option.
- How to check the upLink interface test is functinoning?
The 'ping' command (test a virtual interface by pinging an external address) is used to check the functioning of the upLink interface. If it fails, the following message is displayed in the console:
"ICMP Host Unreachable from gateway localhost (127.0.0.1)for icmp from localhost (127.0.0.1) to 192.168.72.1"
The upLink and the interfaces might appear to be running, but the interface was never initiated, so no entries have been made in the routing table. The 'ifconfig upLinkN up' command is used as part of the configuration procedure.
- How to check the link status of all upLink instances?
The 'uplink-config -g' command is used to check the link status of all the upLink interfaces. If it fails or displays the status of 'instance 0' only, upgrade the interface driver to support the 'ndd' feature. The upLink requests 'ndd' to retrieve the link status from other ethernet drivers. This is supported for the Solaris ethernet drivers on the sparc platform, but not for most of the gld-based drivers on Solaris x86.
- How to check the upLink instances are recieving the heartbeats properly?
The 'dmesg' command or see the /var/adm/messages is used to check for the proper delivery of heartbeat messages between the upLink interfaces. If it fails, the following message is displayed in the console or in the logs:
"upLink0: Standby link failure - not receiving heartbeats (A)"
The standby link is not recieving any heartbeats from the active link or vice-versa. This latent fault prevents the standby link from assuming the active role. - How to check the link status of all upLink instances?
upDisk
- 'ipfs_diff' (Repair process) fails continuously many times or Updisk restarts without any reason.
First, Run 'ls -iR' on root directory of the dataset, while keeping the standby system offline. Then, bring-up the standby system. If the problem still persists, Stop Updisk process. Mount the Updisk monitored dataset onto a temprorary directory. Delete the '.ipfs' directory. Repeat the same for all the other datasets on that machine. Do the same on standby machine also. Run 'fsck' and restart the Updisk on both the machines.
- Updisk does not wait for all the dependent applications to close.
The modified 'rc' script segment patch involved in release 2.7.0 provides the solution.
- Operations on the dataset are slower while Updisk repair process is happening.
The tests performed in our lab, repair usually takes 10-15 minutes for a dataset of size around 25 to 30 GB. If the limit exceeds, report at support@ccpu.com.
- On logging as a different user other than root (for example, Oracle) some commands are not working.
This credential issue has already being fixed; kindly upgrade to the new package 2.7.0.
- Kernel Panic involves ipfs_close() api as cause of panic.
The bug in the traversal method has already being fixed; kindly upgrade to the new package 2.7.0.
- Unable to perform /bin/pwd or /bin/find in directories replicated by Updisk after successive backup/restore procedures using the 'ufsrestore' and the 'ufsdump'.
The recommended backup/restore procedure:- As a precaution, 'fsck' all the filesystems after completing 'ufsrestore' and before starting the Updisk.
- Remove the '.ipfs' subdirectory in the UFS root directory of every Updisk dataset, before starting the Updisk.
- Immediately after starting the Updisk and before starting any IPFS filesystem activity, Run 'ls -iR' in the highest-level IPFS directories, namely, /opt/abc, which are IPFS directories but whose parent directories are UFS directories.
- Updisk State Mismatch: 'udstat' command reports are inconsistent on active and standby machines.
This bug has already being fixed in the version v2_5_4r01.
- 'ipfs_diff' fails to execute on a valid dataset (v2_3_4r00).
This bug has already being fixed. Kindly upgrade to UpSuite 2.3.4r01 for Solaris-9 or UpSuite 2.6.0r01 for Solaris-10.
- Problem in bringing the Systems Online.
When the connection is established for the first time, internal 'ipfs' consistency checks are made between the two systems. If these checks result in problems that are not urgent, 'ipfs' will continue to function. Otherwise, 'ipfs' drops the communication link and informs Updisk of the problem.
- Problem with the startup exchange.
"Oct 3 09:52:41 left unix: /ipfs: GDAY failed - losing link"
This message indicates that a network error has occurred during the initial exchange; Updisk will attempt to restore the connection immediately. The network error should be addressed, but Updisk will continue to provide links until connection is available."Oct 3 09:52:41 left unix: /ipfs: GDAY wrong - losing link"
This message indicates that the initial exchange was invalid or corrupt. Updisk will attempt to restart the services. If the problem persists, it generally indicates that another application is using the remote port and Updisk is not running on that system. - Messages in /var/adm/messages or in /var/log/upsuite that require Operator intervention.
"Oct 3 09:52:41 left unix: /ipfs: role link state - dropping link, operator intervention for (split) required"
This message indicates that the 'ipfs' cannot proceed with operations until the problem has been fixed. The link is dropped and Updisk notes the reason required for operator intervention is displayed in the status display. - Active/Standby state mismatches and clashes are shown in logs.
"Oct 3 09:52:41 left unix: ROLE CLASH - local ACTIVE remote ACTIVE"
The messages above indicate that both the systems are either in the active or standby role and the 'ipfs' cannot proceed. Updisk will assume a split brain condition. Operator intervention is required before the operations can proceed.
"Oct 3 09:52:41 left unix: ROLE CLASH - local STANDBY remote STANDBY" - Warning messages about the underlying file system.
"Oct 3 09:52:41 left unix: DIRTY detected - active SYNC standby DIRTY"
This message indicates that the 'ipfs' has detected an underlying file system, which is “dirty” (in need of repair) and that the 'ipfs' is expected to be in repair mode, but it is not. In other words, if either of the file systems are dirty, and the 'ipfs' is not in the repair mode, then the above message is generated.Note:
However, there is one circumstance under which the above message is displayed unnecessarily. If a link drops during the replay, Updisk restarts the replay again. Now, 'ipfs' does not have enough information to know that this is the case and thus warns that the standby machine is dirty. In each of the cases described above, no action is required."Oct 3 09:52:41 left unix: NEW FS detected - active OLD standby NEW"
This message indicates that a new file system has been detected and the 'ipfs' is not in repair mode. - Problem with the mounting of the 'ipfs'.
During the mount time, when the 'ipfs' detects that the file system is busy (having operations that are not acknowledged by the standby machine), the following are reported:"Oct 09 18:38:30 port /ipfs: startup down summary - previously BUSY, setting DIRTY"
This message indicates that the active system is being switched off or crashed while having outstanding operations. Thus, a repair should be performed. - Display pattern of Operational Messages.
The operational messages are sent to the console during normal Updisk operations. The role, link, and status of any given mount point are displayed with each output message. Additional information comments may appear to the right of the role, link, and state fields, separated by a hyphen. Operational messages appear in the following format:"servername: mountpoint: role link state - status messages"
"role: active, standby, or startup"
The role field indicates whether this file system has assumed the active, standby, or startup role.
"link: up or down"
The link field can either be in the up or the down status.
"state: normal, repair, replay, or summary"
The state field indicates the state of the system. Please refer to the UpSuite user manual for more details.
- How to identify whether the Updisk link is 'up' or 'down'?
The 'udstat [dataset]' command shows the status of the current link.
- Comments or status messages.
The comments or status messages to the right of your role, link, and state messages, separated by a hyphen, inform about the changes to the role, link, or state. These messages are purely informational in nature and simply intend to inform about the systems condition. If an error is reported, the Updisk, in most cases, can solve the problem; but if the error requires operator intervention, messages will be sent to the console and logs. Additionally, if the 'udstat' command is used, its output indicates that the operator intervention is required.
- Reasons for dropping the communication link between Updisk daemons.
The communication link is dropped under the following circumstances:• Operator shutdown: Updisk or Upbeat has detected that a link has failed and instructed the 'ipfs' to stop accessing it; or, the operator has intentionally dropped the link.
Also, be aware, that a link can be dropped as a side effect of the termination of the link daemon by the system shutting down; the Updisk service being stopped; or the direct termination of the daemon.
• Link error: The 'ipfs' has detected a link error before Updisk or Upbeat informs the 'ipfs' about it.
• Active error: An error has occurred during the normal operations on the active system. The link is down and the Updisk will failover to the standby system. In this scenario, depending on the operational messages displayed, fix the problem on the active machine and then perform a repair operation with the 'udrepair' command.
• Standby error: An error has occurred during the normal operations on the standby system. In this scenario, depending on the operational messages displayed, fix the problem on the standby machine and then perform a repair operation with the 'udrepair' command.
- Reasons for the Active/Standby errors.
The active server will fail only if there is an operational or disk error (where the active server will failover to the standby server); also, the standby server will fail, if it runs out of disk space, encounters disk errors, or other operational difficulties. If both the systems cannot perform the operations correctly, the Updisk will drop the link and the repair will be required. Therefore, it is important to determine which system is causing the error before trying to fix it. Please refer to the UpSuite user manual for more details.
- Miscellaneous Transmission errors.
There can be a protocol error between the two Updisk servers (rarely). An XOP error will appear on the console and in /var/adm/messages to notify the problem. Normally, the Updisk will reset the network link, fix any problems, and continue. For XOP errors, do the following:1. Investigate the health of the standby system because the XOP errors often indicate a bad disk or other problems on the standby.
2. Verify the health of the system by using the 'udstat' command, because if the XOP error is present, the Updisk, usually drops the link. Bring up the new link, which performs the repair automatically.
3. Even if the above seems to solve the problem and the system appears healthy, kindly send the XOP error to Continuous Computing’s Technical Support at 'support@ccpu.com' along with the /var/adm/messages from both the active and the standby servers, so the error can be investigated. - On performing operations on the 'ipfs', the console replies - "File System Access Denied".
If the access to the file system is denied, the dataset may be designated as standby by the Updisk. To verify, use the 'udstat /ipfs' command at the console. Note that the standby machine is read-only and, therefore, all changes must be made to the active machine.
- Problem with the File System (Out of Disk Space).
If one of the datasets is out of disk space, clear files on the active system to free space and then perform a repair (even if it is the standby disk that is out of space).Note 1:
If the standby dataset’s disk is out of space the active system will continue to provide full service, but there will be no replication of files on the standby dataset from the active dataset. If the active system’s disk is out of space, the read, delete, and overwrite operations on the data can be performed, but cannot create files or directories. If one of the datasets is out of disk space, a message similar to the following will appear on the console.
Sep 28 17:16:37 port ipfs: [ID 941318 kern.notice] /i: The standby is out of space and must be fixed.
Note 2:
Sep 30 03:16:03 left unix: WARNING: /ipfs: File system full
Sep 27 10:59:45 left unix: NOTICE: alloc: /ipfs: file system full
Under some circumstances which depend on the order in which the files were created and deleted, the repair may not be able to be completed. Complete the repair as follows:1. Take the standby system offline.
2. Remove files from the standby partition to free the disk space.
3. Restart Updisk and the 'ipfs' performs the repair automatically. - Problem with the 'ipfs' (Unmount Unsuccessful).
The 'ipfs' does not unmount under certain circumstances like: 1. The user is currently working under the Updisk monitored dataset. 2. Some processes are using data monitored by the Updisk. 3. The Updisk is performing some repair operations. Therefore, the user is unable to successfully perform the '/etc/init.d/Updisk stop' operation. After the problem is solved by (either, retrieving out of the Updisk monitored dataset or stopping the process using the Updisk monitored data or waiting until the Updisk closes its repair operation), perform the '/etc/init.d/Updisk start' operation to remount the file systems and restart the Updisk.
- Problem with the virtual file system (Conflicting File Modification Times (mtime)).
The file modification times, under normal operations, may appear up to one second difference between the active and standby systems (the modification times may be diffent only by tens of milliseconds, but some system utilities will round this number up to the nearest second). This is due to the operations of the virtual file system. Because of this, the file modification times may not appear identical on both the active and the standby systems. However, the order of the file modification times should be identical on both the systems. Therefore, the utilities that check, sort, and compare the file modification times should produce equivalent results. For example: If the touch command in UNIX is used to set the file modification time, the order will remain the same in the active and the standby systems.
- Possible reasons for failure of the 'ipfs' repair operation.
If the 'ipfs' repair fails for any reason, the Updisk tries to repair again immediately. Everytime, a repair fails, the Updisk waits longer than the previous repair attempt until a maximum of one attempt for every four minutes is reached; once the maximum is achieved, the attempts continue indefinitely or until the repair is able to successfully complete. If needed, manually initiate the repair by performing the 'udrepair' command on the active system. If the repair is already been started by the Updisk, "The repair is already in progress" message is displayed.
- Split Brain Conditions.
Normally, any High Availability (HA) environment is at risk for the split brain condition. A split brain condition can occur under the following two conditions:1. Extreme traffic and network conditions.
The split brain condition occurs, when both the active and the standby datasets assume the active role . After the split brain is rectified, the Updisk shuts down the service and then signals a split brain condition both to the operator and to the peer. Under normal operating circumstances, this can only happen after a double failure (multiple Updisk restart, operation transmission failure, and so on) has occurred, if only two links and multiple communications failures, if more than two links are used. After one of the conditions described earlier has been detected, message similar to the following on the console is displayed:
2. The heartbeat timeouts is set as too low in the UpSuite.conf file (Because of continuous timeouts the Updisk determines the peer as to be down and acquires active role).Aug 8 08:03:24 yoursys: /ipfs: standby down repair - operator intervention now required to fix (split)
To verify the split brain condition still, perform the 'udstat' command. The output from this command will displays whether the split brain condition is rectified or requires operator intervention.
Note:
The UpSuite HA does not take automatic corrective action once recovery from a split brain condition is detected. The operator or application developer, needs to determine which dataset(s) is most accurate. - Recovering from split brain condition (DO's).
When the Updisk server pair (active and standby) detects a dual active condition, both servers are flagged as split brain and taken offline. As a result, there is no active service. If either server is flagged as split brain, the Updisk flags the other server also as split brain, immediately when they come into contact. Therefore, to fix a split brain condition, the operator must intervene on both servers. If the split-brain flag is rectified from only one server, then on bringing the two systems into contact, the Updisk detects the reamining split brain flag on the other server and automatically re-flags the fixed server as split brain. There are two ways to effectively intervene on both the servers to fix a split-brain condition:• Force one server to be active and let the Updisk force the other server to be standby. This is the most simplest solution.
• Explicitly force one server to be active and the other to be standby. - Recovering from split brain condition (DONT's).
Avoid the following two steps when trying to fix a split brain condition, otherwise, will lead to a new split brain condition.• When both the servers (A and B) are unable to communicate with each other, performing the 'udactive -A -f' command on both the servers (later, when both servers are able to communicate, they both try to acquire active role forcefully (because of "-f" option)).
• Performing the 'udactive -A -f' command on the server A, while the server B is shutdown, offline, or unable to communicate with server A, and then rebooting the server A or restarting the Updisk on the server A (later, server A forgets that the '-f' option was performed, creating a new split brain when B is again operational). - Failover problems.
If one or both of the servers is down:1. Determine which server has the recent data by determining which server shut down first or last, by inspecting the relevant files, system logs, or performing the 'udstat -hh' command.
2. Perform the 'udactive -f dataset' on the server with the recent data to forcefully become active. The 'ipfs' repair will start automatically.
