Notification Services
Configures third-party notifications on events triggered by ClusterControl. This feature allows users to integrate ClusterControl into your organization’s communication channels, incident response systems, and workflow applications. To prevent users from receiving too many notifications, ClusterControl integrations also allow users to send out only specific critical or warning alerts (see Warning Events and Critical Events).
See also
Introducing the ClusterControl Alerting Integrations.
Supported integrations
Supported services are:
Service type | Service provider |
---|---|
Incident management | |
Messaging platform | |
Others |
|
Notification behavior
Within ClusterControl internal states, there are 3 types of events:
- CREATED - An alarm was raised.
- CHANGED - Something changed in the alarm event.
- ENDED - The alarm was resolved.
However, by default, the ClusterControl Notification process (cmon-events
) will only listen to 2 events - CREATED and ENDED events - and pass them to the configured services. This can be configured with allowed_events
CLI parameter.
The notification behavior is depending on the type of services that receives the event, as shown below:
Service | Notification behavior |
---|---|
|
|
Incident management |
|
Messaging platform |
|
Webhook |
|
Attention
If you already have an alarm raised and then you create, for example, a Slack (or any other) integration, you will never see events for the alarms already raised/created before you created the Slack integration. You will only see alarms created after the integration is set up.
Set up notification
Every supported service has different requirements and configurations as shown in the following subsections.
Slack
Slack is a cloud‑based collaboration platform designed for teams to communicate and work together in real time. For Slack integration, the following information are needed:
- Slack workspace URL - The webhook URLs for your workspace. Commonly it looks like this:
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
- Channel - The Slack's channel name.
The following are the steps to get the required information:
- Create a Slack app by going to Your Apps.
- Go to Features → Incoming Webhooks and toggle on Activate Incoming Webhooks.
- Click on Add New Webhook to Workspace and choose a Slack channel from the workspace in the dropdown and click Allow. Take note on the channel name that you choose.
- Copy the generated Webhook URL. This is the Slack workspace URL that will be used in the integration.
To configure Slack integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Slack.
-
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Workspace URL: The webhook URL of your workspace, taken from the Slack app page.
- Channel: The channel that ClusterControl will post the events, prefixed it with a hash tag (
#
).
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
Telegram
Telegram is a cloud‑based instant messaging app available on mobile, desktop, and web that lets you send text and voice messages, make voice and video calls, and share photos, videos, and files of any type. For Telegram integration, the following information are needed:
- Telegram bot token - The telegram bot API authentication token. Commonly it looks like this:
999999999:AAHfiqksKZ8WmR2zSjiQ7_v4TMAKdiHm8T0
. - Channel - The Telegram's channel name.
The following are the steps to get the required information:
- Open Telegram and create a new bot by contacting
@BotFather
. - Give the bot a unique name and copy the bot token. This is the Telegram bot token that is required.
- Create a new Telegram channel (or use an existing channel).
- Add the created bot to the channel as Administrator.
- Allow your bot to send messages to the channel.
- Copy channel ID from the channel link. For example, if the link is
t.me/mypowerfulchannel
, the channel ID ismypowerfulchannel
.
Tip
To configure Telegram integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Telegram.
-
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Telegram BOT token: The telegram bot API authentication token, provided by Telegram bot
@BotFather
. - Channel: The channel that ClusterControl will post the events, prefixed it with an alias (
@
).
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
ServiceNow
ServiceNow is a cloud‑based platform for digital workflow automation and enterprise service management. For ServiceNow integration, the following information are needed:
- Username - The instance username.
- Password - The password for the instance user.
- Service - The Service name that you want to monitor.
- Configuration item - The ServiceNow components that need to be tracked.
- Instance - The instance name.
Tip
To configure ServiceNow integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → ServiceNow.
-
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Username - The instance username.
- Password - The password for the instance user.
- Service - The Service name that you want to monitor.
- Configuration item - The ServiceNow components that need to be tracked.
- Instance - The instance name.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
PagerDuty
PagerDuty is an incident management platform for IT operations, enabling teams to detect and respond to service disruptions and outages in real time. For PagerDuty integration, the following information is needed:
- Service key - Integration key for PagerDuty Events API v2. Commonly it looks like this:
aaaaa75294fe496b8ba9a87f71bddddd
.
The following are the steps to get the required information:
- Login to your PagerDuty account.
- Create a new service by going to Services → Service Directory → New Service.
- Give the Service a name and generate an (or select an existing) escalation policy.
- In the Integration section, choose "Events API V2". Generate a new Integration Key for this integration and this is the Service key that is required.
To configure PagerDuty integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → PagerDuty.
-
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- Service Key: The PagerDuty Events API v2 integration key of the chosen service.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
Splunk On-Call (formerly known as VictorOps)
Splunk On-Call (formerly known as VictorOps), is an incident management platform for IT operations, enabling teams to detect and respond to service disruptions and outages in real time. For Splunk On-Call integration, the following information is needed:
- REST Integration URL - The Splunk On-Call Service API Endpoint you get when you enable the REST integration. Commonly it looks like this:
https://alert.victorops.com/integrations/generic/<YOUR_REST_ENDPOINT_KEY>/alert/<YOUR_ROUTING_KEY>
.
The following are the steps to get the required information:
- Login to your Splunk On-Call/VictorOps account.
- Create a new integration by going to Settings → Integrations → 3rd Party Integrations → REST – Generic.
- Click on the Enable button to generate an endpoint destination URL.
- Routing keys in Splunk On-Call can be set up and associated by clicking on Settings → Routing Keys. Routing key must be included as part of the REST integration URL.
To configure Splunk On-Call integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → VictorOps.
-
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- REST Integration URL: The Splunk On-Call integration URL for REST API. The URL shall include the REST endpoint key and routing key.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
OpsGenie
Opsgenie is a cloud-based incident management and on-call scheduling platform (part of the Atlassian suite) that centralizes alerts from your monitoring, ticketing, and collaboration tools. For Opsgenie integration, the following information are needed:
- Region - The Opsgenie data-center your account lives in, so your HTTP calls go to the correct endpoint. It is either US or EU.
- Teams - The Opsgenie team identifiers or names that should receive and handle the alerts generated by this integration.
- API key - The unique "GenieKey" token created when you add an API integration under Settings → Integrations → API.
The following are the steps to get the required information:
- Login to your OpsGenie account.
- Create a new integration by going to Settings → Integrations → API.
- Click Continue, expand Steps to configure the integration, then copy the API key.
- Click Turn on integration to enable it.
To configure OpsGenie integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → OpsGenie.
-
In the Service Configuration section, specify the following:
- Region: Choose whether US or EU region.
- Teams: The Splunk On-Call integration URL for REST API. The URL shall include the REST endpoint key and routing key.
- API Key: The unique "GenieKey" token created when you add the API integration.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
ilert
ilert is an incident response platform that helps DevOps, IT, and SRE teams manage on-call schedules, automate alerting, and streamline incident resolution through integrations with monitoring tools, real-time notifications, and collaborative response features. For ilert integration, the following information is needed:
- Alert source URL - The ilert API alert source URL generated in the ClusterControl integration page for ilert. Commonly it looks like this:
https://api.ilert.com/api/v1/events/clustercontrol/xxxxxxxxxxxxxxxxxxxxxxxxxxxx
.
The following are the steps to get the required information:
- Login to your ilert account.
- Go to Alert sources → Alert sources and click Create new alert source.
- Search for "ClusterControl" in the search field, click the ClusterControl tile, and then Next.
- Give your alert source a name, optionally assign teams, and click Next.
- Select an escalation policy by creating a new one or assigning an existing one.
- On the final page, an ClusterControl URL will be generated. Use this information for the URL when you configure ClusterControl integration.
To configure ilert integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → ilert.
-
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- URL: The ilert ClusterControl integration URL generated from the alert source integration dashboard.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
Webhook integration
A webhook is a lightweight API that powers one-way data sharing triggered by events. Together, they enable applications to share data and functionality, and turn the web into something greater than the sum of its parts. APIs and webhooks both allow different software systems to sync up and share information. For webhook integration, the following information is needed:
- URL - The HTTP endpoint of the service that should receive the notifications. Make sure the service is properly set up to handle webhook events.
For webhook integration, ClusterControl will raise an alarm by sending the JSON data using the HTTP POST method to the configured endpoint. The webhook endpoint should receive the following example event if a new alarm is created:
{
"id": 470,
"status": "CREATED",
"component": "Node",
"hostname": "192.168.20.62",
"title": "Server disconnected",
"message": "PostgreSQL server on 192.168.20.62:5432 disconnected: Connect failure: Watchdog: Failed connection to [email protected]:5432. timeout expired\n",
"recommendation": "Check node status on UI and error log of failed server.",
"severity": "CRITICAL"
}
The above event was created and sent out to the webhook endpoint after ClusterControl has detected that one of our database servers (192.168.20.62) was down. After the server came back online, ClusterControl would then send another event with the same alarm ID (470) with a different status “ENDED”, indicating the above alarm has ended and the node is back operational:
{
"id": 470,
"status": "ENDED",
"component": "Node",
"hostname": "192.168.20.62",
"title": "Server disconnected",
"message": "PostgreSQL server on 192.168.20.62:5432 disconnected: Connect failure: Watchdog: Failed connection to [email protected]:5432. timeout expired\n",
"recommendation": "Check node status on UI and error log of failed server.",
"severity": "CRITICAL"
}
Commonly, the “ended” event has an almost identical response text with the “created” event except for the “status” value.
Tip
You may use https://webhook.site/ to test out the webhook integration with ClusterControl.
To configure webhook integration:
-
Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Webhook.
-
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- URL: The HTTP endpoint of the service that should receive the notifications.
-
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
-
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
-
Click Finish to complete the configuration.
Events
The following table lists out all event groups supported by ClusterControl:
Event | Description |
---|---|
All Events | All ClusterControl events including warning and critical events. |
All Warning Events | All ClusterControl warning events, e.g. cluster degradation, network glitch. See Warning Events. |
All Critical Events | All ClusterControl critical events, e.g. cluster failed, host failed. See Critical Events. |
Network | Network-related events, e.g. host unreachable, SSH issues. |
CMON Database | Internal CMON database-related events, e.g. unable to connect to CMON database, datadir mounted as read-only. |
Mail system-related events, e.g. unable to send mail, mail server unreachable. | |
Cluster | Cluster-related events, e.g. cluster failure, cluster degradation, time drifting. |
Cluster Configuration | Cluster configuration events, e.g. SST account mismatch. |
Cluster Recovery | Recovery events, e.g. cluster or node recovery failures. |
Node | Node-related events, e.g. node disconnected, missing GRANT, failed to start HAProxy, failed to start NDB cluster nodes. |
Host | Host-related messages, e.g. CPU/disk/RAM/swap exceeds thresholds, memory full. |
Database Health | Database health-related events, e.g. memory usage of MySQL servers, connections, missing primary key. |
Database Performance | Alarms for long-running transactions, replication lag, and deadlocks. |
Software Installation | Software installation-related events, e.g. license expiration. |
Backup | Backups related events, e.g. backup failed. |
Warning Events
Note
ClusterControl uses the same alarms’ code name for other database clusters which produces a similar result. For example, if a PostgreSQL slave is lagging behind, the alarms internal code name is also called “MySqlReplicationLag”. However, the actual alarm’s response will have proper PostgreSQL relevant texts. This is handled internally.
Area | Alarms | Severity | Description |
---|---|---|---|
Node | MySqlReplicationLag | Warning | MySQL replication slave lag, default 10 seconds. |
MySqlReplicationBroken | Warning | The SQL thread has stopped. For PostgreSQL, it means the slave gets disconnected from the master. | |
CertificateExpiration | Warning | SSL certificate expiration time (<=31 days, >7 days). | |
MySqlAdvisor | Warning | Raised by wsrep_sst_method.js and wsrep_node_name.js advisors. |
|
MySqlTableAnalyzer | Warning | Raised by schema_check_nopk.js advisor. |
|
StorageMyIsam | Warning | Raised by schema_check_myisam.js advisor. |
|
MySqlIndexAnalyzer | Warning | Raised by schema_check_dupl_index.js advisor. |
|
MySqlReplicationLooseServer | Warning | A slave is found but its master can’t be determined, or it is not part of the cluster. | |
Host | HostSwapV2 | Warning | If a configurable number of pages has been swapped in/out during a configurable period of time. Default 20 pages in 10 minutes. |
HostSwapping | Warning | >5% swap space has been used. | |
HostCpuUsage | Warning | >80%, <90% CPU used. | |
HostRamUsage | Warning | >80%, <90% RAM used. | |
HostDiskUsage | Warning | >80%, <90% disk space used on a monitored_mountpoint. | |
ProcessCpuUsage | Warning | >95 % CPU used on average by a process for 15 minutes. | |
Backup | BackupFailed | Warning | The backup job fails. |
Recovery | GaleraWsrepMissing | Warning | wsrep_cluster_address or wsrep_provider is missing. |
GaleraSstAuth | Warning | SST settings (user/pass are wrong). | |
Network | HostFirewall | Warning | The host is not responding to ping after 3 cycles. |
HostSshSlow | Warning | It takes 6-12 seconds to SSH into a host. | |
Cluster | ClusterTimeDrift | Warning | Time drift between ClusterControl and database nodes. |
ClusterLicenseExpire | Warning | The license is about to expire. | |
ClusterInconsitentView | Warning | The load balancer or ClusterControl sees a different set of working nodes (master is down from ClusterControl point-of-view, while load balancer or the slave reports the master working.) |
Critical Events
Note
ClusterControl uses the same alarms’ code name for other database clusters which produces a similar result. For example, if a PostgreSQL server goes down, the alarms internal code name is also called “MySqlDisconnected”. However, the actual alarm’s response will have proper PostgreSQL relevant texts. This is handled internally.
Area | Alarms | Severity | Description |
---|---|---|---|
Node | MySqlDisconnected | Critical | The database server cannot be reached. |
MySqlGrantMissing | Critical | Node does not have the correct privileges set for the cmon user. | |
MySqlLongRunningQuery | Critical | If queries are running for too long time. Only used if configured, by default it is not. | |
ProcFailedRestart | Critical | A process (HAProxy, ProxySQL, Garbd, MaxScale) could not be restarted after a failure. | |
CertificateExpiration | Critical | (<= 7 days), SSL Certificates expiration time. | |
MySqlReplicationMultiMaster | Critical | Multiple writable masters detected. | |
Host | HostSwapV2 | Critical | If a configurable number of pages has been swapped in/out during a configurable period of time. Default 20 pages in 10 minutes. |
HostSwapping | Critical | >20% swap space has been used. | |
HostCpuUsage | Critical | >90% CPU used. | |
HostRamUsage | Critical | >90% RAM used. | |
HostDiskUsage | Critical | >90% disk space used on a monitored_mountpoint . |
|
ProcessCpuUsage | Critical | >99 % CPU used on average by a process for 15 minutes. | |
Backup | BackupVerificationFailed | Critical | Backup verification fails. |
Recovery | GaleraWsrepMissing | Critical | wsrep_cluster_address or wsrep_provider is missing, and still missing after 20 sample cycles (which are ~100 seconds in this case) |
GaleraClusterSplit | Critical | There is a split brain. | |
ClusterRecoveryFail | Critical | Recovery has failed. | |
GaleraConfigProblem1 | Critical | A configuration issue preventing the node to start. | |
GaleraNodeRecoveryFail | Critical | Automatic recovery has failed 3 consecutive times. | |
ReplicationFailoverBlacklistError | Critical | In the case of automatic failover, the only possible candidate is blacklisted, then this alarm is raised with critical severity. | |
Network | HostUnreachable | Critical | The host is not responding to ping after 3 cycles. |
HostSshFailed | Critical | Please check SSH access to the host. The host may also be down. | |
HostSshAuth | Critical | Please check whether the configured SSH key is properly configured and can be authenticated on the host. | |
HostSudoError | Critical | sudo command error on the host. |
|
HostSshSlow | Critical | It takes more than 12 seconds to SSH into a host. | |
Cluster | ClusterFailure | Critical | Cluster is failed. |
ClusterLicenseExpire | Critical | The license is expired. |