Generally, ClusterControl performs its monitoring, alerting, and trending duties by using the following 4 ways:
- SSH – Host metrics collection using SSH library.
- Prometheus – Host and database metrics collection using Prometheus server and exporters.
- Database client – Database metrics collection using the CMON database client library.
- Advisor – Mini programs written using ClusterControl DSL and running within ClusterControl itself, for monitoring, tuning, and alerting purposes.
Starting from version 1.7.0, ClusterControl supports two methods of monitoring operation:
- Agentless monitoring (default).
- Agent-based monitoring with Prometheus.
The monitoring operation method is a non-global configuration and bounded per cluster. This allows you to have two different database clusters configured with two different monitoring methods simultaneously. For example, Cluster A uses SSH sampling while Cluster B uses Prometheus agent-based setup to gather host monitoring data.
Regardless of the monitoring method chosen, database and load balancer (except HAProxy) metrics are still being sampled by CMON’s database client library agentless and store inside CMON Database for reporting (alarms, notifications, operational reports) and accurate management decision for critical operations like failover and recovery. Having said that, with agent-based monitoring, ClusterControl does not use SSH to sample host metrics which can be very excessive in some environments.
For host and load balancer stats collection, ClusterControl executes this task via SSH with super-user privilege. Therefore, passwordless SSH with super-user privilege is vital, to allow ClusterControl to run the necessary commands remotely with proper escalation. With this pull approach, there are a couple of advantages as compared to the agent-based monitoring method:
- Agentless – There is no need for an agent to be installed, configured, and maintained.
- Unifying the management and monitoring configuration – SSH can be used to pull monitoring metrics or push management jobs on the target nodes.
- Simplify the deployment – The only requirement is proper passwordless SSH setup and that’s it. SSH is also very secure and encrypted.
- Centralized setup – One ClusterControl server can manage multiple servers and clusters, provided it has sufficient resources.
However, there are also drawbacks with the agentless monitoring approach, a.k.a pull mechanism:
- The monitoring data is accurate only from the ClusterControl perspective. For example, if there is a network glitch and ClusterControl loses communication to the monitored host, the sampling will be shipped until the next available cycle.
- For high granularity monitoring, there will be network overhead due to increasing sampling rate where ClusterControl needs to establish more connections to every target host.
- ClusterControl will keep on attempting to re-establish a connection to the target node because it has no agent to do this on its behalf.
- Redundant data sampling if you have more than one ClusterControl server monitoring a cluster since each ClusterControl server has to pull the monitoring data for itself.
The above points are the reasons we introduced agent-based monitoring, as described in the next section.
Starting from version 1.7.0, ClusterControl introduced an agent-based monitoring integration with Prometheus. Other operations like management, scaling, and deployment is still performed through an agentless approach as described in the Management and Deployment Operations. Agent-based monitoring can eliminate excessive SSH connections to the monitored hosts and offload the monitoring jobs to another dedicated monitoring system like Prometheus.
With agent-based configuration, you can use a set of new dashboards that use Prometheus as the data source and give access to its flexible query language and multi-dimensional data model with time series data identified by metric name and key/value pairs. Simply said, in this configuration, ClusterControl integrates with Prometheus to retrieve the collected monitoring data and visualize them in the ClusterControl UI, just like a GUI client for Prometheus. ClusterControl also connects to the exporter via HTTP GET and POST methods to determine the process state for process management purposes. For the list of Prometheus exporters, see Monitoring Tools.
One Prometheus data source can be shared among multiple clusters within ClusterControl. You have the options to deploy a new Prometheus server or import an existing Prometheus server, under ClusterControl → Dashboards → Enable Agent Based Monitoring.
For agentless monitoring mode, ClusterControl monitoring duty only requires OpenSSH server package on the monitored hosts. ClusterControl uses the
libssh client library to collect host metrics for the monitored hosts – CPU, memory, disk usage, network, disk IO, process, etc. OpenSSH client package is required on the ClusterControl host only for setting up passwordless SSH and debugging purposes. Other SSH implementations like Dropbear and TinySSH are not supported.
For agent-based monitoring mode, ClusterControl requires a Prometheus server on port 9090 to be running, and all monitored nodes to be configured with at least three exporters (depending on the node’s role):
- Process exporter (port 9011)
- Node/system metrics exporter (port 9100)
- Database or application exporters:
On every monitored host, ClusterControl will configure and daemonize the exporter process using a program called
daemon. Thus, ClusterControl host is recommended to have an Internet connection to install necessary packages and automate the Prometheus deployment. For offline installation, the packages must be pre-downloaded into
/var/cache/cmon/packages on ClusterControl node. For the list of required packages and links, please refer to
/usr/share/cmon/templates/packages.conf. Apart from the Prometheus scrape process, ClusterControl also connects to the process exporter via HTTP calls directly to determine the process state of the node. No sampling via SSH is involved in this process.
Since ClusterControl 1.7.3 allows multi-instance per single host, it will automatically configure a different exporter port if there are more than one same processes to monitor to avoid port conflict by incrementing the port number for every instance. Supposed you have two ProxySQL instances deployed by ClusterControl, and you would like to monitor them both via Prometheus. ClusterControl will configure the first ProxySQL’s exporter to be running on the default port, 42004 while the second ProxySQL’s exporter port will be configured with port 42005, incremented by 1.
The collector flags are configured based on the node’s role, as shown in the following table (some exporters do not use collector flags):
Database Client Libraries
When gathering the database stats and metrics, regardless of the monitoring operation method, ClusterControl Controller (CMON) connects to the database server directly via database client libraries –
libmysqlclient (MySQL/MariaDB and ProxySQL),
libpq (PostgreSQL) and
libmongoc (MongoDB). That is why it’s crucial to set up proper privileges for the ClusterControl server from the database server’s perspective. For MySQL-based clusters, ClusterControl requires database user “cmon” while for other databases, any username can be used for monitoring, as long as it is granted with super-user privileges. Most of the time, ClusterControl will setup the required privileges (or use the specified database user) automatically during the cluster import or cluster deployment stage.
For load balancers, ClusterControl requires the following additional tools:
- Maxadmin on the MariaDB MaxScale server.
- netcat and/or socat on the HAProxy server to connect to the HAProxy socket file.
- ProxySQL requires MySQL client on the ProxySQL server.
Agentless vs Agent-based Architecture
The following diagram summarizes both host and database monitoring processes executed by ClusterControl using
libssh and database client libraries (agentless approach):
The following diagram summarizes both host and database monitoring processes executed by ClusterControl using Prometheus and database client libraries (agent-based approach):
Timeouts and Intervals
ClusterControl Controller (CMON) is a multi-threaded process. For agentless monitoring, the ClusterControl Controller sampling thread connects via SSH to each monitored host once and maintains a persistent connection (hence, no timeout) until the host drops or disconnects it when sampling host stats. It may establish more connections depending on the jobs assigned to the host since most of the management jobs run in their own thread. For example, cluster recovery runs on the recovery thread, Advisor execution runs on a cron-thread, as well as process monitoring runs on the process collector thread.
For agent-based monitoring, the Scrape Interval and Data Retention period are depending on the Prometheus settings.
ClusterControl monitoring thread performs the following sampling operations in the following interval:
|MySQL query/status/variables||Every second|
|Process collection (
||Every 10 seconds|
|Server detection||Every 10 seconds|
||Every 30 seconds (configurable via
|Database (PostgreSQL and MongoDB only)||Every 30 seconds (configurable via
|Database schema||Every 3 hours (configurable via
|Load balancer||Every 15 seconds (configurable via
The Advisors (imperative scripts), which can be created, compiled, tested, and scheduled directly from ClusterControl UI, under Manage → Developer Studio, can make use of SSH and database client libraries for monitoring, data processing, and alerting within ClusterControl domain with the following restrictions:
- 5 seconds of hard time limit for SSH execution,
- 10 seconds of default time limit for database connection, configurable via
connect_timeoutin CMON configuration file,
- 60 seconds of total script execution time limit before CMON ungracefully aborts it.
For short-interval monitoring data like MySQL queries and status, data are stored directly into the CMON database. While for long-interval monitoring data like weekly/monthly/yearly data points are aggregated every 60 seconds and stored in memory for 10 minutes. These behaviors are not configurable due to architecture design.