Monitoring
AxonIQ Console provides a comprehensive monitoring solution for your Axon Server cluster and Axon Framework applications. By defining conditions based on the Axon Server health status or Axon Framework metrics, you can get notified when something goes wrong. This way, you can take action before it becomes a problem.
How it works
The conditions of the workspace, and individual resources, are checked once per minute against the metrics and health collected by AxonIQ Console. When a condition is met, a pending alert is created of the defined level by the condition. If the condition is met for the duration specified in the condition, the alert will no longer be pending and be sent to the configured integrations.
Conditions
You can set up conditions for all monitored resources in your environment. For an overview of all the possible metrics to monitor, see available metrics. To provide a good starting point, AxonIQ Console sets up some conditions by default.
You can set up a condition for the entire environment at once. So, for example, you can set up a condition that triggers an alert when the ingest latency of any event processor in any application exceeds a certain threshold.
However, sometimes there is a subset of resources that have different requirements. For example, you have a specific event processor that calls a slow legacy system, and can be slower than normal. You can then define a condition for the ingest latency of that specific event processor, overriding the environment-wide condition for that specific processor alone.
Environment-wide conditions
On the Monitoring page, you can set up the environment-wide conditions.
You can add a condition to any resource type by clicking "Add new condition". This adds a new condition to the list that you can configure and then save. The formula has the following parts:
Field | Description | Possible values |
---|---|---|
Level |
The level of the alerts, useful for filtering which integration receives which alerts |
Incident, Critical, Major, Minor |
Metric |
The metric to check |
Differs per resource, see available metrics. |
Operator |
The operator to use for the check |
=, !=, >, <, >=, <= |
Value |
The value to compare the metric to |
Any number |
Percentile |
In case the metric is a timer, select the percentile to check against. Generally the ninetieth percentile is recommended |
Minimum, Median, ninetieth, ninetyfifth, Maximum |
Duration |
The amount of minutes until the alert is sent to the configured integrations. This helps prevent false positives. |
Any number |
The screen shows this in a readable format, so you can think of it as: "Create <level> when <metric> <operator> <value> for <duration> minutes", or "Create critical when segment claim percentage != 100% for 2 minutes". You can see this in the screen below.
You can always adjust the conditions by clicking the "Edit" button next to the condition. This makes the entire row editable. You can change any field, except the level and metric. If you want to change the level or metric, you need to delete the condition and add a new one.
Resource-specific conditions
If you want to set up conditions for a specific resource, you can do so by navigating to the resource and clicking "Configure" next to the Alerts header in the top right corner. This opens a dialog where you can add a new condition for that specific instance.
Setting up conditions for a specific instance works similar to setting up conditions for all instances. You can find a list of all available metrics and their defaults below. After adding a specific condition, it can be found in the resource itself.
Conditions with an override are shown on the Monitoring page by a clickable text on the condition’s row that will open a dialog with the overrides defined.
Alerts
When a condition is met, an alert is created. You can see all alerts in the "Current alerts" section on the Monitoring page. Each resource page also has an Alerts section where you can see all alerts for that specific resource. You can also see a badge in all tables where resources are listed with the number of alerts, like in the example below.
When you click on a row with alerts, you are taken to the resource page where you can see all alerts for that resource.
Integrations
AxonIQ Console can send alerts to various integrations. This feature is only available in the AxonIQ Professional plan. You can get started by going to the Integrations page.
Slack
There are three steps to set up Slack integration:
-
Add our Slack app to your workspace
-
Connect your Slack workspace to your AxonIQ Console workspace
-
Set up the channels to send alerts to
Due to the dynamic nature of set-up instructions, we cannot provide a step-by-step guide here. However, you can find this information on the Integrations page.
Available metrics
The following table contains all their available metrics and their defaults. The defaults have been found by our Solution Engineers to be a good start to set up monitoring. Some of these are automatically set up for you when you start using AxonIQ Console.
Resource | Metric | Default threshold | Set up by default |
---|---|---|---|
Message Handler |
Error Rate |
> 1% |
Yes |
Message Handler |
Latency (P90) |
> 200 ms |
Yes |
Message Handler |
Throughput |
> 1000/minute |
No |
Aggregate |
Error Rate |
> 1% |
Yes |
Aggregate |
Latency (P90) |
> 200 ms |
Yes |
Aggregate |
Lock Time (P90) |
> 25 ms |
Yes |
Aggregate |
Load Time (P90) |
> 100 ms |
Yes |
Aggregate |
Event Commit Time (P90) |
> 300 ms |
Yes |
Event Processor |
Segment Claim Percentage |
!= 100% |
Yes |
Event Processor |
Ingest latency |
> 100 ms |
Yes |
Event Processor |
Commit latency |
> 300 ms |
Yes |
Event Processor |
DLQ Size |
> 0 |
Yes |
Application |
Replica Count |
< 1 |
Yes |
Application |
CPU Usage |
> 80% |
Yes |
Application |
Host CPU Usage |
> 80% |
Yes |
Application |
Heap Usage |
> 80% |
Yes |
Application |
Thread Count |
> 200 |
No |
Application |
Query Bus Usage |
> 80% |
Yes |
Application |
Command Bus Usage |
> 80% |
Yes |
Environment |
Used connections count |
> 8 |
No |
Environment |
Used connections percentage |
> 80% |
Yes |
Environment |
Free connections percentage |
⇐ 2 |
No |
Environment |
Free connections percentage |
< 20% |
No |
Axon Server cluster |
Used connections count |
> 8 |
No |
Axon Server cluster |
Used connections percentage |
> 80% |
Yes |
Axon Server cluster |
Free connections percentage |
⇐ 2 |
No |
Axon Server cluster |
Free connections percentage |
< 20% |
No |
Axon Server instance |
Unhealthy connections |
> 0 |
Yes |
Axon Server instance disk |
Free space in MB |
< 1000 |
Yes |
Axon Server replication group |
Healthy replicas |
< 1 |
No |
Axon Server replication group |
Unhealthy replicas |
!= 0 |
Yes |
Axon Server replication group |
Unapplied entries |
> 100 |
Yes |