Skip to content

Failover and Recovery

ClusterControl is programmed with a number of recovery algorithms to automatically respond to different types of common failures affecting your database systems. It understands different types of database topologies and database-related process management to help you determine the best way to recover the cluster. Some topology managers only cover cluster recoveries but you have to handle the node recovery by yourself. ClusterControl supports recovery at both cluster and node levels.

There are two recovery components supported by ClusterControl:

  1. Node – Attempt to recover a node to an operational state. Node recovery covers node recovery issues like if a node was being stopped outside of ClusterControl knowledge, e.g, via user-intervention stop command from SSH console or process killed by OOM process. See Node Recovery.
  2. Cluster – Attempt to recover a cluster to an operational state. Cluster recovery covers recovery attempts to bring up the entire cluster topology. See Cluster Recovery.

These two components are the most important things to make sure the service availability is as high as possible.

Node Recovery

ClusterControl can recover a database node in case of intermittent failure by monitoring the process and connectivity to the database nodes. For the process, it works similarly to systemd, where it will make sure the MySQL service is started and running unless you intentionally stopped it via ClusterControl UI.

If the node comes back online, ClusterControl will establish a connection back to the database node and will perform the necessary actions. The following is what ClusterControl would do to recover a node:

  1. It will wait for systemd/chkconfig/init to start up the monitored services/processes for 30 seconds.
  2. If the monitored services/processes are still down, ClusterControl will try to start the database service automatically.
  3. If ClusterControl is unable to recover the monitored services/processes, an alarm will be raised.

Note

If a database shutdown is initiated by the user via ClusterControl, ClusterControl will not attempt to recover the particular node at a later stage. It expects the user to start it back via ClusterControl UI the Start Node or by using the OS command explicitly.

The recovery includes all database-related services like ProxySQL, HAProxy, MaxScale, Keepalived, PgBouncer, Prometheus exporters, and garbd. Special attention to Prometheus exporters where ClusterControl uses a program called daemon to daemonize the exporter process. ClusterControl will try to connect to the exporter’s listening port for health check and verification. Thus, it’s recommended to open the exporter ports from ClusterControl and Prometheus server to make sure no false alarms during recovery.

Cluster Recovery

ClusterControl understands the database topology and follows best practices in performing the recovery. For a database cluster that comes with built-in fault tolerance like Galera Cluster, NDB Cluster, and MongoDB Replicaset, the failover process will be performed automatically by the database server via quorum calculation, heartbeat, and role switching (if any). ClusterControl monitors the process and makes necessary adjustments to the visualization like reflecting the changes under the Topology view and adjusting the monitoring and management component for the new role e.g, a new primary node in a replica set.

For database technologies that do not have built-in fault tolerance with automatic recoveries like MySQL/MariaDB Replication and PostgreSQL/TimescaleDB Streaming Replication (see further down), ClusterControl will perform the recovery procedures by following the best practices provided by the database vendor. If the recovery fails, user intervention is required, and of course, you will get an alarm notification regarding this.

In a mixed/hybrid topology, for example, an asynchronous replica that is attached to a Galera Cluster or NDB Cluster, the node will be recovered by ClusterControl if cluster recovery is enabled.

Cluster recovery does not apply to standalone MySQL servers. However, it’s recommended to turn on both node and cluster recoveries for this cluster type in the ClusterControl UI.

MySQL/MariaDB Replication

ClusterControl supports recovery of the following MySQL/MariaDB replication setup:

  • Primary-replica with MySQL GTID
  • Primary-replica with MariaDB GTID
  • Primary-primary with MySQL GTID
  • Primary-primary with MariaDB GTID
  • Asynchronous replica attached to a Galera Cluster

ClusterControl will respect the following parameters when performing cluster recovery:

  • enable_cluster_autorecovery
  • auto_manage_readonly
  • repl_password
  • repl_user
  • replication_auto_rebuild_slave
  • replication_check_binlog_filtration_bf_failover
  • replication_check_external_bf_failover
  • replication_failed_reslave_failover_script
  • replication_failover_blacklist
  • replication_failover_events

  • replication_failover_wait_to_apply_timeout

  • replication_failover_whitelist
  • replication_onfail_failover_script
  • replication_post_failover_script
  • replication_post_switchover_script
  • replication_post_unsuccessful_failover_script
  • replication_pre_failover_script
  • replication_pre_switchover_script
  • replication_skip_apply_missing_txs
  • replication_stop_on_error

For more details on each of the parameters, refer to the documentation page.

ClusterControl will obey the following rules when monitoring and managing a primary-replica replication:

  1. All nodes will be started with read_only=ON and super_read_only=ON (regardless of its role).
  2. Only one primary (read_only=OFF) is allowed to operate at any given time.
  3. Rely on the MySQL variable report_host to map the topology.
  4. If there are two or more nodes that have read_only=OFF at a time, ClusterControl will automatically set read_only=ON on both primaries, to protect them against accidental writes. User intervention is required to pick the actual primary by disabling the read-only.

In case the active primary goes down, ClusterControl will attempt to perform the primary failover in the following order:

  1. After 3 seconds of primary unreachability, ClusterControl will raise an alarm.
  2. Check the replica availability, at least one of the replicas must be reachable by ClusterControl.
  3. Pick the replica as a candidate to be a primary.
  4. ClusterControl will calculate the probability of errant transactions if GTID is enabled.
  5. If no errant transaction is detected, the chosen will be promoted as the new primary.
  6. Create and grant the replication user to be used by replicas.
  7. Change the primary for all replicas that were pointing to the old primary to the newly promoted primary.
  8. Start replica and enable read-only.
  9. Flush logs on all nodes.

If the replica promotion fails, ClusterControl will abort the recovery job. User intervention or a cmon service restart is required to trigger the recovery job again.

When the old primary is available again, it will be started as read-only and will not be part of the replication. User intervention is required.

PostgreSQL/TimescaleDB Streaming Replication

ClusterControl supports recovery of the following PostgreSQL replication setup:

  • PostgreSQL Streaming Replication
  • TimescaleDB Streaming Replication

ClusterControl will respect the following parameters when performing cluster recovery:

  • enable_cluster_autorecovery
  • repl_password
  • repl_user

  • replication_auto_rebuild_slave

  • replication_failover_whitelist
  • replication_failover_blacklist

For more details on each of the parameters, refer to the documentation page.

ClusterControl will obey the following rules for managing and monitoring a PostgreSQL streaming replication setup:

  • wal_level is set to replica (or hot_standby depending on the PostgreSQL version).
  • The parameter archive_mode is set to ON on the primary.
  • Set recovery.conf file on the replica nodes, which turns the node into a hot standby with read-only enabled.

In case the active primary goes down, ClusterControl will attempt to perform the cluster recovery in the following order:

  1. After 10 seconds of primary unreachability, ClusterControl will raise an alarm.
  2. After 10 seconds of graceful waiting timeout, ClusterControl will initiate the primary failover job.
  3. Sample the replayLocation and receiveLocation on all available nodes to determine the most advanced node.
  4. Promote the most advanced node as the new primary.
  5. Stop replicas.
  6. Verify the synchronization state with pg_rewind.
  7. Restarting replicas with the new primary.

If the replica promotion fails, ClusterControl will abort the recovery job. User intervention or a cmon service restart is required to trigger the recovery job again.

Attention

When the old primary is available again, it will be forced to shut down and will not be part of the replication. User intervention is required. See further down.

When the old primary comes back online, if the PostgreSQL service is running, ClusterControl will force the shutdown of the PostgreSQL service. This is to protect the server from accidental writes since it would be started without a recovery file (recovery.conf), which means it would be writable. You should expect the following lines will appear in postgresql-{day}.log:

2019-11-27 05:06:10.091 UTC [2392] LOG: database system is ready to accept connections
2019-11-27 05:06:27.696 UTC [2392] LOG: received fast shutdown request
2019-11-27 05:06:27.700 UTC [2392] LOG: aborting any active transactions
2019-11-27 05:06:27.703 UTC [2766] FATAL: terminating connection due to administrator command
2019-11-27 05:06:27.704 UTC [2758] FATAL: terminating connection due to administrator command
2019-11-27 05:06:27.709 UTC [2392] LOG: background worker "logical replication launcher" (PID 2419) exited with exit code 1
2019-11-27 05:06:27.709 UTC [2414] LOG: shutting down
2019-11-27 05:06:27.735 UTC [2392] LOG: database system is shut down

The PostgreSQL was started after the server was back online around 05:06:10 but ClusterControl performs a fast shutdown 17 seconds after that around 05:06:27. If this is something that you would not want it to be, you can disable node recovery for this cluster momentarily.