Standby Servers

Manual Switchover

In the event of a primary database server failure, one standby server can be promoted to become the primary. The existing primary will become unusable at this point and must be removed from the ecosystem. There is no switch-back mechanism.

Promote a Standby

To promote a standby server to become the primary, execute the following command:

fmos ecosystem promote

This tool will attempt to connect to the existing primary server and request that it shut down. Doing so reduces the risk of a “split brain” scenario, where two database servers are operating as the primary at the same time. If the existing primary cannot be contacted, a warning will be displayed, but the process can proceed anyway. In most cases, promoting a standby is only done when the primary has failed, so it may not be possible to shut it down safely.

Reconfigure AS and DB Machines

After a standby server has been promoted to primary, all other standby database machine and all application server machines will need to be pointed to the new server to become operational again. This is accomplished by running the fmos ecosystem switchover command, passing it the FQDN of the new primary as a command-line argument. For example:

fmos ecosystem switchover <server name>

The switchover process may fail if the original primary database server is unavailable. If the fmos ecosystem switchover command fails on an application server, try rebooting the AS and running the command again.

After switching over an application server following an unexpected failure of the primary database machine, the application server may be in a partially degraded state. When the shared filesystem hosted by the original primary DB machine is not available during the switchover process, FMOS needs to forcefully unmount it in order to mount it again from the newly-promoted database machine. This forceful unmount prevents the application server machine from communicating with the original database machine’s IP address, and can cause processes to accumulate. To resolve this situation, any application server machine that has switched its superior server while its original superior server is unavailable should be rebooted as soon as possible.

Recover CA Role

Once all application server and database machines have been reconfigured to use the new primary server, the FMOS ecosystem will be operational again. At this point, however, no new machines can be added to the ecosystem as there is no longer a machine with the CA role. Thus, the CA role needs to be recovered onto the newly-promoted database machine.

This process requires a backup of the certificate authority store from the original CA machine. If it is still accessible, a backup can be created using the fmos ca backup command.

If the CA machine is permanently inaccessible and no certificate authority store is available, the FMOS ecosystem will have to be reprovisioned. For this reason, it is imperative to keep a current backup of the certificate authority store in a safe location at all times!

  1. Copy the most recent backup of the certificate authority store to the new primary database machine.
  2. On the new primary database machine, restore the backup, e.g.:

fmos ca restore ${filename}

  1. Execute fmos ecosystem recover-ca to enable the CA role.

Reprovision Original Primary

Once a primary server has been shut down, either safely when another server was promoted, or forcefully, it must be reprovisioned before it can join the ecosystem again. FMOS must be reinstalled using removable media, and appliance must be configured as a new member of an existing ecosystem. It can be added as a standby DB server, and optionally promoted to primary again using the same switchover procedure.

Automatic Failover

FMOS does not provide a mechanism for automatic failover when the primary database server becomes unavailable. Organizations should ensure that appropriate monitoring is in place in order to be notified of communication issues quickly so that a manual switchover can be initiated in a timely manner.