Notification Services

Configures third-party notifications on events triggered by ClusterControl. This feature allows users to integrate ClusterControl into your organization’s communication channels, incident response systems, and workflow applications. To prevent users from receiving too many notifications, ClusterControl integrations also allow users to send out only specific critical or warning alerts (see Warning Events and Critical Events).

Supported integrations

Supported services are:

Service type	Service provider
Incident management	PagerDuty Splunk On-Call (formerly known as VictorOps) OpsGenie ServiceNow ilert
Messaging platform	Slack Telegram
Others	Webhook

Notification behavior

Within ClusterControl internal states, there are 3 types of events:

CREATED - An alarm was raised.
CHANGED - Something changed in the alarm event.
ENDED - The alarm was resolved.

However, by default, the ClusterControl Notification process (cmon-events) will only listen to 2 events - CREATED and ENDED events - and pass them to the configured services. This can be configured with allowed_events CLI parameter.

The notification behavior is depending on the type of services that receives the event, as shown below:

Service	Notification behavior
Email	CREATED - Will send an email ENDED - Will send an email
Incident management	CREATED - Will create an incident ENDED - Will resolve an incident
Messaging platform	CREATED - Will send a message ENDED - Will send a message
Webhook	CREATED - Will send a HTTP POST request to the webhook ENDED - Will send a HTTP POST request to the webhook See Webhook integration for examples.

Attention

If you already have an alarm raised and then you create, for example, a Slack (or any other) integration, you will never see events for the alarms already raised/created before you created the Slack integration. You will only see alarms created after the integration is set up.

Set up notification

Every supported service has different requirements and configurations as shown in the following subsections.

Slack

Slack is a cloud‑based collaboration platform designed for teams to communicate and work together in real time. For Slack integration, the following information are needed:

Slack workspace URL - The webhook URLs for your workspace. Commonly it looks like this: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
Channel - The Slack's channel name.

The following are the steps to get the required information:

Create a Slack app by going to Your Apps.
Go to Features → Incoming Webhooks and toggle on Activate Incoming Webhooks.
Click on Add New Webhook to Workspace and choose a Slack channel from the workspace in the dropdown and click Allow. Take note on the channel name that you choose.
Copy the generated Webhook URL. This is the Slack workspace URL that will be used in the integration.

Tip

See Slack documentation - Sending messages using incoming webhooks.

To configure Slack integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Slack.
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Workspace URL: The webhook URL of your workspace, taken from the Slack app page.
- Channel: The channel that ClusterControl will post the events, prefixed it with a hash tag (#).
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

Telegram

Telegram is a cloud‑based instant messaging app available on mobile, desktop, and web that lets you send text and voice messages, make voice and video calls, and share photos, videos, and files of any type. For Telegram integration, the following information are needed:

Telegram bot token - The telegram bot API authentication token. Commonly it looks like this: 999999999:AAHfiqksKZ8WmR2zSjiQ7_v4TMAKdiHm8T0.
Channel - The Telegram's channel name.

The following are the steps to get the required information:

Open Telegram and create a new bot by contacting @BotFather.
Give the bot a unique name and copy the bot token. This is the Telegram bot token that is required.
Create a new Telegram channel (or use an existing channel).
Add the created bot to the channel as Administrator.
Allow your bot to send messages to the channel.
Copy channel ID from the channel link. For example, if the link is t.me/mypowerfulchannel, the channel ID is mypowerfulchannel.

Tip

See Telegram documentation - Bots.

To configure Telegram integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Telegram.
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Telegram BOT token: The telegram bot API authentication token, provided by Telegram bot @BotFather.
- Channel: The channel that ClusterControl will post the events, prefixed it with an alias (@).
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

ServiceNow

ServiceNow is a cloud‑based platform for digital workflow automation and enterprise service management. For ServiceNow integration, the following information are needed:

Username - The instance username.
Password - The password for the instance user.
Service - The Service name that you want to monitor.
Configuration item - The ServiceNow components that need to be tracked.
Instance - The instance name.

Tip

See ServiceNow documentation.

To configure ServiceNow integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → ServiceNow.
In the Service Configuration section, specify the following:
- Integration name: Give a name to this integration.
- Username - The instance username.
- Password - The password for the instance user.
- Service - The Service name that you want to monitor.
- Configuration item - The ServiceNow components that need to be tracked.
- Instance - The instance name.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

PagerDuty

PagerDuty is an incident management platform for IT operations, enabling teams to detect and respond to service disruptions and outages in real time. For PagerDuty integration, the following information is needed:

Service key - Integration key for PagerDuty Events API v2. Commonly it looks like this: aaaaa75294fe496b8ba9a87f71bddddd.

The following are the steps to get the required information:

Login to your PagerDuty account.
Create a new service by going to Services → Service Directory → New Service.
Give the Service a name and generate an (or select an existing) escalation policy.
In the Integration section, choose "Events API V2". Generate a new Integration Key for this integration and this is the Service key that is required.

Tip

See PagerDuty documentation - Services and Integrations.

To configure PagerDuty integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → PagerDuty.
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- Service Key: The PagerDuty Events API v2 integration key of the chosen service.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

Splunk On-Call (formerly known as VictorOps)

Splunk On-Call (formerly known as VictorOps), is an incident management platform for IT operations, enabling teams to detect and respond to service disruptions and outages in real time. For Splunk On-Call integration, the following information is needed:

REST Integration URL - The Splunk On-Call Service API Endpoint you get when you enable the REST integration. Commonly it looks like this: https://alert.victorops.com/integrations/generic/<YOUR_REST_ENDPOINT_KEY>/alert/<YOUR_ROUTING_KEY>.

The following are the steps to get the required information:

Login to your Splunk On-Call/VictorOps account.
Create a new integration by going to Settings → Integrations → 3rd Party Integrations → REST – Generic.
Click on the Enable button to generate an endpoint destination URL.
Routing keys in Splunk On-Call can be set up and associated by clicking on Settings → Routing Keys. Routing key must be included as part of the REST integration URL.

Tip

See Splunk On-Call documentation - REST Endpoint Integration Guide.

To configure Splunk On-Call integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → VictorOps.
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- REST Integration URL: The Splunk On-Call integration URL for REST API. The URL shall include the REST endpoint key and routing key.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

OpsGenie

Opsgenie is a cloud-based incident management and on-call scheduling platform (part of the Atlassian suite) that centralizes alerts from your monitoring, ticketing, and collaboration tools. For Opsgenie integration, the following information are needed:

Region - The Opsgenie data-center your account lives in, so your HTTP calls go to the correct endpoint. It is either US or EU.
Teams - The Opsgenie team identifiers or names that should receive and handle the alerts generated by this integration.
API key - The unique "GenieKey" token created when you add an API integration under Settings → Integrations → API.

The following are the steps to get the required information:

Login to your OpsGenie account.
Create a new integration by going to Settings → Integrations → API.
Click Continue, expand Steps to configure the integration, then copy the API key.
Click Turn on integration to enable it.

Tip

See Opsgenie Support - Create an API Integration.

To configure OpsGenie integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → OpsGenie.
In the Service Configuration section, specify the following:
- Region: Choose whether US or EU region.
- Teams: The Splunk On-Call integration URL for REST API. The URL shall include the REST endpoint key and routing key.
- API Key: The unique "GenieKey" token created when you add the API integration.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

ilert

ilert is an AI-first company offering an all-in-one incident management tool for alerting, on-call management, and incident communication to help companies increase their digital uptime. B2C and B2B companies from across the globe, including well-known brands such as IKEA, Lufthansa Systems, and NTT Data, trust ilert to empower their operations teams and ensure everything is running smoothly. For ilert integration, the following information is needed:

Alert source URL - The ilert API alert source URL generated in the ClusterControl integration page for ilert. Commonly it looks like this: https://api.ilert.com/api/v1/events/clustercontrol/xxxxxxxxxxxxxxxxxxxxxxxxxxxx.

The following are the steps to get the required information:

Login to your ilert account.
Go to Alert sources → Alert sources and click Create new alert source.
Search for "ClusterControl" in the search field, click the ClusterControl tile, and then Next.
Give your alert source a name, optionally assign teams, and click Next.
Select an escalation policy by creating a new one or assigning an existing one.
On the final page, an ClusterControl URL will be generated. Use this information for the URL when you configure ClusterControl integration.

Tip

See ilert Documentation - ClusterControl Integration.

To configure ilert integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → ilert.
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- URL: The ilert ClusterControl integration URL generated from the alert source integration dashboard.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

Webhook integration

A webhook is a lightweight API that powers one-way data sharing triggered by events. Together, they enable applications to share data and functionality, and turn the web into something greater than the sum of its parts. APIs and webhooks both allow different software systems to sync up and share information. For webhook integration, the following information is needed:

URL - The HTTP endpoint of the service that should receive the notifications. Make sure the service is properly set up to handle webhook events.

For webhook integration, ClusterControl will raise an alarm by sending the JSON data using the HTTP POST method to the configured endpoint. The webhook endpoint should receive the following example event if a new alarm is created:

{
  "id": 470,
  "status": "CREATED",
  "component": "Node",
  "hostname": "192.168.20.62",
  "title": "Server disconnected",
  "message": "PostgreSQL server on 192.168.20.62:5432 disconnected: Connect failure: Watchdog: Failed connection to [email protected]:5432. timeout expired\n",
  "recommendation": "Check node status on UI and error log of failed server.",
  "severity": "CRITICAL"
}

The above event was created and sent out to the webhook endpoint after ClusterControl has detected that one of our database servers (192.168.20.62) was down. After the server came back online, ClusterControl would then send another event with the same alarm ID (470) with a different status “ENDED”, indicating the above alarm has ended and the node is back operational:

{
  "id": 470,
  "status": "ENDED",
  "component": "Node",
  "hostname": "192.168.20.62",
  "title": "Server disconnected",
  "message": "PostgreSQL server on 192.168.20.62:5432 disconnected: Connect failure: Watchdog: Failed connection to [email protected]:5432. timeout expired\n",
  "recommendation": "Check node status on UI and error log of failed server.",
  "severity": "CRITICAL"
}

Commonly, the “ended” event has an almost identical response text with the “created” event except for the “status” value.

Tip

You may use https://webhook.site/ to test out the webhook integration with ClusterControl.

To configure webhook integration:

ClusterControl GUI

Go to ClusterControl GUI → Settings → Notification Services → Add new integration → Webhook.
In the Service Configuration section, specify the following:
- Integration name: Give any name to this integration.
- URL: The HTTP endpoint of the service that should receive the notifications.
Then click on Test credentials. Ensure you receive a test notification from ClusterControl before proceeding to the next step.
In the Notification settings section, specify the following:
- Clusters - Choose one or more clusters to be notified. Multiple values are accepted.
- Events - Choose one or more ClusterControl events that will trigger the notification. Multiple values are accepted. Details on the events are described in Warning Events and Critical Events.
Click Finish to complete the configuration.

Events

The following table lists out all event groups supported by ClusterControl:

Event	Description
All Events	All ClusterControl events including warning and critical events.
All Warning Events	All ClusterControl warning events, e.g. cluster degradation, network glitch. See Warning Events.
All Critical Events	All ClusterControl critical events, e.g. cluster failed, host failed. See Critical Events.
Network	Network-related events, e.g. host unreachable, SSH issues.
CMON Database	Internal CMON database-related events, e.g. unable to connect to CMON database, `datadir` mounted as read-only.
Mail	Mail system-related events, e.g. unable to send mail, mail server unreachable.
Cluster	Cluster-related events, e.g. cluster failure, cluster degradation, time drifting.
Cluster Configuration	Cluster configuration events, e.g. SST account mismatch.
Cluster Recovery	Recovery events, e.g. cluster or node recovery failures.
Node	Node-related events, e.g. node disconnected, missing GRANT, failed to start HAProxy, failed to start NDB cluster nodes.
Host	Host-related messages, e.g. CPU/disk/RAM/swap exceeds thresholds, memory full.
Database Health	Database health-related events, e.g. memory usage of MySQL servers, connections, missing primary key.
Database Performance	Alarms for long-running transactions, replication lag, and deadlocks.
Software Installation	Software installation-related events, e.g. license expiration.
Backup	Backups related events, e.g. backup failed.

Warning Events

Note

ClusterControl uses the same alarms’ code name for other database clusters which produces a similar result. For example, if a PostgreSQL slave is lagging behind, the alarms internal code name is also called “MySqlReplicationLag”. However, the actual alarm’s response will have proper PostgreSQL relevant texts. This is handled internally.

Area	Alarms	Severity	Description
Node	MySqlReplicationLag	Warning	MySQL replication slave lag, default 10 seconds.
	MySqlReplicationBroken	Warning	The SQL thread has stopped. For PostgreSQL, it means the slave gets disconnected from the master.
	CertificateExpiration	Warning	SSL certificate expiration time (<=31 days, >7 days).
	MySqlAdvisor	Warning	Raised by `wsrep_sst_method.js` and `wsrep_node_name.js` advisors.
	MySqlTableAnalyzer	Warning	Raised by `schema_check_nopk.js` advisor.
	StorageMyIsam	Warning	Raised by `schema_check_myisam.js` advisor.
	MySqlIndexAnalyzer	Warning	Raised by `schema_check_dupl_index.js` advisor.
	MySqlReplicationLooseServer	Warning	A slave is found but its master can’t be determined, or it is not part of the cluster.
Host	HostSwapV2	Warning	If a configurable number of pages has been swapped in/out during a configurable period of time. Default 20 pages in 10 minutes.
	HostSwapping	Warning	>5% swap space has been used.
	HostCpuUsage	Warning	>80%, <90% CPU used.
	HostRamUsage	Warning	>80%, <90% RAM used.
	HostDiskUsage	Warning	>80%, <90% disk space used on a monitored_mountpoint.
	ProcessCpuUsage	Warning	>95 % CPU used on average by a process for 15 minutes.
Backup	BackupFailed	Warning	The backup job fails.
Recovery	GaleraWsrepMissing	Warning	`wsrep_cluster_address` or `wsrep_provider` is missing.
	GaleraSstAuth	Warning	SST settings (user/pass are wrong).
Network	HostFirewall	Warning	The host is not responding to ping after 3 cycles.
	HostSshSlow	Warning	It takes 6-12 seconds to SSH into a host.
Cluster	ClusterTimeDrift	Warning	Time drift between ClusterControl and database nodes.
	ClusterLicenseExpire	Warning	The license is about to expire.
	ClusterInconsitentView	Warning	The load balancer or ClusterControl sees a different set of working nodes (master is down from ClusterControl point-of-view, while load balancer or the slave reports the master working.)

Critical Events

Note

ClusterControl uses the same alarms’ code name for other database clusters which produces a similar result. For example, if a PostgreSQL server goes down, the alarms internal code name is also called “MySqlDisconnected”. However, the actual alarm’s response will have proper PostgreSQL relevant texts. This is handled internally.

Area	Alarms	Severity	Description
Node	MySqlDisconnected	Critical	The database server cannot be reached.
	MySqlGrantMissing	Critical	Node does not have the correct privileges set for the cmon user.
	MySqlLongRunningQuery	Critical	If queries are running for too long time. Only used if configured, by default it is not.
	ProcFailedRestart	Critical	A process (HAProxy, ProxySQL, Garbd, MaxScale) could not be restarted after a failure.
	CertificateExpiration	Critical	(<= 7 days), SSL Certificates expiration time.
	MySqlReplicationMultiMaster	Critical	Multiple writable masters detected.
Host	HostSwapV2	Critical	If a configurable number of pages has been swapped in/out during a configurable period of time. Default 20 pages in 10 minutes.
	HostSwapping	Critical	>20% swap space has been used.
	HostCpuUsage	Critical	>90% CPU used.
	HostRamUsage	Critical	>90% RAM used.
	HostDiskUsage	Critical	>90% disk space used on a `monitored_mountpoint`.
	ProcessCpuUsage	Critical	>99 % CPU used on average by a process for 15 minutes.
Backup	BackupVerificationFailed	Critical	Backup verification fails.
Recovery	GaleraWsrepMissing	Critical	`wsrep_cluster_address` or `wsrep_provider` is missing, and still missing after 20 sample cycles (which are ~100 seconds in this case)
	GaleraClusterSplit	Critical	There is a split brain.
	ClusterRecoveryFail	Critical	Recovery has failed.
	GaleraConfigProblem1	Critical	A configuration issue preventing the node to start.
	GaleraNodeRecoveryFail	Critical	Automatic recovery has failed 3 consecutive times.
	ReplicationFailoverBlacklistError	Critical	In the case of automatic failover, the only possible candidate is blacklisted, then this alarm is raised with critical severity.
Network	HostUnreachable	Critical	The host is not responding to ping after 3 cycles.
	HostSshFailed	Critical	Please check SSH access to the host. The host may also be down.
	HostSshAuth	Critical	Please check whether the configured SSH key is properly configured and can be authenticated on the host.
	HostSudoError	Critical	`sudo` command error on the host.
	HostSshSlow	Critical	It takes more than 12 seconds to SSH into a host.
Cluster	ClusterFailure	Critical	Cluster is failed.
	ClusterLicenseExpire	Critical	The license is expired.