Failover and Recovery

ClusterControl is equipped with automated recovery algorithms designed to address common failures in database systems. It understands various database topologies and process management, enabling it to determine the optimal recovery strategy for a cluster.

ClusterControl offers two key recovery components:

Node Recovery: This attempts to restore a node to an operational state, covering issues such as a node being stopped outside of ClusterControl's awareness (e.g., via a user-initiated SSH stop command or an OOM process kill). See Node Recovery.
Cluster Recovery: This component aims to bring the entire cluster topology back to an operational state. See Cluster Recovery.

These two components are crucial for ensuring the highest possible service availability.

Attention

ClusterControl provides automatic failover and recovery for nodes within a single cluster. However, this functionality does not extend to replication between separate clusters. Consequently, automatic recovery is not supported for cluster-to-cluster replication. In the event that the primary cluster in Data Center A becomes unavailable, user intervention is required to promote the secondary cluster in Data Center B as the new active cluster. See Create Replica Cluster.

Node Recovery

ClusterControl ensures database node recovery from intermittent failures by monitoring process and connectivity. Similar to systemd, it keeps the database service running unless intentionally stopped via the ClusterControl GUI or CLI. The recovery includes all cluster-related services like ProxySQL, HAProxy, MaxScale, Keepalived, PgBouncer, Prometheus exporters, and garbd. Special attention to Prometheus exporters where ClusterControl will try to connect to the exporter’s listening port for health check and verification. Thus, it’s crucial to expose the exporter ports to ClusterControl and Prometheus server to make sure no false alarms during recovery. See Firewall and security groups for details.

Upon a node coming back online, ClusterControl re-establishes connection (using SSH persistent session) and performs necessary recovery actions:

It waits 30 seconds for systemd/init script to start monitored services/processes.
If services/processes remain down, ClusterControl automatically attempts to start the database service.
If ClusterControl is unable to recover the monitored services/processes, an alarm is raised.

Note

If a user initiates a database shutdown through ClusterControl GUI or CLI, ClusterControl will not automatically attempt to recover that specific node later. The user is expected to restart it either via the Start Node action in the ClusterControl GUI (or ClusterControl CLI), or by explicitly using an operating system command.

Cluster Recovery

ClusterControl excels at database recovery by understanding the topology and adhering to best practices. For clusters with built-in fault tolerance (like Galera, NDB Cluster, Redis Cluster, MongoDB Replicaset, and Elasticsearch), failover is automatic, managed by the database server through quorum calculations, heartbeats, and role switching. ClusterControl monitors this process, updating the Topology view and adjusting monitoring and management components for new roles, such as a new primary node in a replica set.

For database technologies lacking built-in fault tolerance and automatic recovery, such as MySQL/MariaDB Replication and PostgreSQL/TimescaleDB Streaming Replication, ClusterControl performs recovery procedures based on vendor best practices, as explained further down. Should recovery fail, user intervention is required, and an alarm notification will be issued.

In mixed or hybrid topologies, like an asynchronous replica attached to a Galera or NDB Cluster, ClusterControl will recover the node if cluster recovery is enabled.

While cluster recovery does not apply to standalone MySQL servers, it is still recommended to enable both node and cluster recoveries for this type of cluster in the ClusterControl GUI.

MySQL/MariaDB Replication

ClusterControl supports recovery of the following MySQL/MariaDB replication setup:

Primary-replica with MySQL GTID
Primary-replica with MariaDB GTID
Primary-primary with MySQL GTID
Primary-primary with MariaDB GTID
Asynchronous replica attached to a Galera Cluster

ClusterControl will respect the following parameters when performing cluster recovery:

enable_cluster_autorecovery
auto_manage_readonly
repl_password
repl_user
replication_auto_rebuild_slave
replication_check_binlog_filtration_bf_failover
replication_check_external_bf_failover
replication_failed_reslave_failover_script
replication_failover_blacklist
replication_failover_events
replication_failover_wait_to_apply_timeout
replication_failover_whitelist
replication_onfail_failover_script
replication_post_failover_script
replication_post_switchover_script
replication_post_unsuccessful_failover_script
replication_pre_failover_script
replication_pre_switchover_script
replication_skip_apply_missing_txs
replication_stop_on_error

For more details on each of the parameters, see Configuration Options.

ClusterControl adheres to specific rules when monitoring and managing primary-replica replication:

All nodes, regardless of their role, are started with read_only=ON and super_read_only=ON.
Only one primary (read_only=OFF) is permitted at any given time.
The MySQL variable report_host is used to map the topology.
If multiple nodes have read_only=OFF concurrently, ClusterControl automatically sets read_only=ON on all of them to prevent accidental writes. Manual intervention is then required to designate the actual primary by disabling read-only.

In the event of an active primary going offline, ClusterControl attempts a primary failover with the following sequence:

An alarm is raised after 3 seconds of the primary being unreachable from multiple nodes including ClusterControl node, replica nodes or load balancer nodes.
Replica availability is checked, ensuring at least one replica is reachable by ClusterControl.
A replica is chosen as a candidate for promotion to primary.
If GTID is enabled, ClusterControl calculates the probability of errant transactions.
If no errant transactions are detected, the chosen replica is promoted as the new primary.
A replication user is created and granted for use by the replicas.
All replicas previously pointing to the old primary are reconfigured to point to the newly promoted primary.
The replicas are started.
Logs are flushed on all nodes.

If the replica promotion fails, the recovery job is aborted by ClusterControl. To re-trigger the recovery job, user intervention or a ClusterControl service restart is necessary.

When the old primary becomes available again, it is started in read-only mode and is no longer part of the replication. User intervention is required to reincorporate the old primary into the replication by restaging it as a replica, or to remove the node from the cluster using ClusterControl.

PostgreSQL/TimescaleDB Streaming Replication

ClusterControl supports recovery of the following PostgreSQL replication setup:

PostgreSQL Streaming Replication
TimescaleDB Streaming Replication

ClusterControl will respect the following parameters when performing cluster recovery:

enable_cluster_autorecovery
repl_password
repl_user
replication_auto_rebuild_slave
replication_failover_whitelist
replication_failover_blacklist

For more details on each of the parameters, see Configuration Options.

ClusterControl will obey the following rules for managing and monitoring a PostgreSQL streaming replication setup:

wal_level is set to replica (or hot_standby depending on the PostgreSQL version).
The parameter archive_mode is set to ON on the primary.
Set recovery.conf or recovery.signal file on the replica nodes, which turns the node into a hot standby with read-only enabled.

In case the active primary goes down, ClusterControl will attempt to perform the cluster recovery in the following order:

After 10 seconds of primary unreachability, ClusterControl will raise an alarm.
Following a 10-second graceful waiting period, ClusterControl will initiate the primary failover job.
Sample the replayLocation and receiveLocation on all available nodes to identify the most advanced node.
Promote the most advanced node as the new primary.
Stop replicas.
Verify the synchronization state with pg_rewind.
Restarting replicas, pointing to the new primary.

If the replica promotion fails, ClusterControl will abort the recovery job. In such cases, user intervention or a ClusterControl service restart is required to re-trigger the recovery job.

Attention

When the old primary becomes available again, it will be forcefully shut down and will not rejoin the replication setup. User intervention is necessary. See further down.

Upon the old primary coming back online, if the PostgreSQL service is running, ClusterControl will forcefully shut down the PostgreSQL service. This action prevents accidental writes, as the server would otherwise start without a recovery file (recovery.conf or recovery.signal) and thus be writable. You should expect the following lines will appear in postgresql-{day}.log:

2019-11-27 05:06:10.091 UTC [2392] LOG: database system is ready to accept connections
2019-11-27 05:06:27.696 UTC [2392] LOG: received fast shutdown request
2019-11-27 05:06:27.700 UTC [2392] LOG: aborting any active transactions
2019-11-27 05:06:27.703 UTC [2766] FATAL: terminating connection due to administrator command
2019-11-27 05:06:27.704 UTC [2758] FATAL: terminating connection due to administrator command
2019-11-27 05:06:27.709 UTC [2392] LOG: background worker "logical replication launcher" (PID 2419) exited with exit code 1
2019-11-27 05:06:27.709 UTC [2414] LOG: shutting down
2019-11-27 05:06:27.735 UTC [2392] LOG: database system is shut down

The PostgreSQL was started after the server was back online around 05:06:10 but ClusterControl performs a fast shutdown 17 seconds after that around 05:06:27. If this is something that you would not want it to be, you can disable node recovery for this cluster momentarily.