Monitoring

AxonIQ Console provides a comprehensive monitoring solution for your Axon Server cluster and Axon Framework applications. By defining conditions based on the Axon Server health status or Axon Framework metrics, you can get notified when something goes wrong. This way, you can take action before it becomes a problem.

How it works

The conditions of the workspace, and individual resources, are checked once per minute against the metrics and health collected by AxonIQ Console. When a condition is met, a pending alert is created of the defined level by the condition. If the condition is met for the duration specified in the condition, the alert will no longer be pending and be sent to the configured integrations.

Conditions

You can set up conditions for all monitored resources in your environment. For an overview of all the possible metrics to monitor, see available metrics. To provide a good starting point, AxonIQ Console sets up some conditions by default.

You can set up a condition for the entire environment at once. So, for example, you can set up a condition that triggers an alert when the ingest latency of any event processor in any application exceeds a certain threshold.

However, sometimes there is a subset of resources that have different requirements. For example, you have a specific event processor that calls a slow legacy system, and can be slower than normal. You can then define a condition for the ingest latency of that specific event processor, overriding the environment-wide condition for that specific processor alone.

Environment-wide conditions

On the Monitoring page, you can set up the environment-wide conditions.

Screenshot of the Monitoring Conditions screen in AxonIQ Console

You can add a condition to any resource type by clicking "Add new condition". This adds a new condition to the list that you can configure and then save. The formula has the following parts:

Field	Description	Possible values
Level	The level of the alerts, useful for filtering which integration receives which alerts	Incident, Critical, Major, Minor
Metric	The metric to check	Differs per resource, see available metrics.
Operator	The operator to use for the check	=, !=, >, <, >=, <=
Value	The value to compare the metric to	Any number
Percentile	In case the metric is a timer, select the percentile to check against. Generally the ninetieth percentile is recommended	Minimum, Median, ninetieth, ninetyfifth, Maximum
Duration	The amount of minutes until the alert is sent to the configured integrations. This helps prevent false positives.	Any number

Field

Description

Possible values

Level

The level of the alerts, useful for filtering which integration receives which alerts

Incident, Critical, Major, Minor

Metric

The metric to check

Differs per resource, see available metrics.

Operator

The operator to use for the check

=, !=, >, <, >=, <=

Value

The value to compare the metric to

Any number

Percentile

In case the metric is a timer, select the percentile to check against. Generally the ninetieth percentile is recommended

Minimum, Median, ninetieth, ninetyfifth, Maximum

Duration

The amount of minutes until the alert is sent to the configured integrations. This helps prevent false positives.

Any number

The screen shows this in a readable format, so you can think of it as: "Create <level> when <metric> <operator> <value> for <duration> minutes", or "Create critical when segment claim percentage != 100% for 2 minutes". You can see this in the screen below.

Screenshot of the Monitoring Conditions screen in AxonIQ Console with a new condition being added

You can always adjust the conditions by clicking the "Edit" button next to the condition. This makes the entire row editable. You can change any field, except the level and metric. If you want to change the level or metric, you need to delete the condition and add a new one.

Resource-specific conditions

If you want to set up conditions for a specific resource, you can do so by navigating to the resource and clicking "Configure" next to the Alerts header in the top right corner. This opens a dialog where you can add a new condition for that specific instance.

Screenshot of a specific resource with the Configure button in the top right corner

Setting up conditions for a specific instance works similar to setting up conditions for all instances. You can find a list of all available metrics and their defaults below. After adding a specific condition, it can be found in the resource itself.

Conditions with an override are shown on the Monitoring page by a clickable text on the condition’s row that will open a dialog with the overrides defined.

Override page of the monitoring section showing a handler override

Alerts

When a condition is met, an alert is created. You can see all alerts in the "Current alerts" section on the Monitoring page. Each resource page also has an Alerts section where you can see all alerts for that specific resource. You can also see a badge in all tables where resources are listed with the number of alerts, like in the example below.

When you click on a row with alerts, you are taken to the resource page where you can see all alerts for that resource.

Integrations

AxonIQ Console can send alerts to various integrations. This feature is only available in the AxonIQ Professional plan. You can get started by going to the Integrations page.

Slack

There are three steps to set up Slack integration:

Add our Slack app to your workspace
Connect your Slack workspace to your AxonIQ Console workspace
Set up the channels to send alerts to

Due to the dynamic nature of set-up instructions, we cannot provide a step-by-step guide here. However, you can find this information on the Integrations page.

Screenshot of the Integrations section in the Monitoring tab in AxonIQ Console

Email

Setting up email integration is easy. Just enter the email address, the level of the alerts you want to receive, and click "Add integration".

Available metrics

The following table contains all their available metrics and their defaults. The defaults have been found by our Solution Engineers to be a good start to set up monitoring. Some of these are automatically set up for you when you start using AxonIQ Console.

Resource	Metric	Default threshold	Set up by default
Message Handler	Error Rate	> 1%	Yes
Message Handler	Latency (P90)	> 200 ms	Yes
Message Handler	Throughput	> 1000/minute	No
Aggregate	Error Rate	> 1%	Yes
Aggregate	Latency (P90)	> 200 ms	Yes
Aggregate	Lock Time (P90)	> 25 ms	Yes
Aggregate	Load Time (P90)	> 100 ms	Yes
Aggregate	Event Commit Time (P90)	> 300 ms	Yes
Event Processor	Segment Claim Percentage	!= 100%	Yes
Event Processor	Ingest latency	> 100 ms	Yes
Event Processor	Commit latency	> 300 ms	Yes
Event Processor	DLQ Size	> 0	Yes
Application	Replica Count	< 1	Yes
Application	CPU Usage	> 80%	Yes
Application	Host CPU Usage	> 80%	Yes
Application	Heap Usage	> 80%	Yes
Application	Thread Count	> 200	No
Application	Query Bus Usage	> 80%	Yes
Application	Command Bus Usage	> 80%	Yes
Environment	Used connections count	> 8	No
Environment	Used connections percentage	> 80%	Yes
Environment	Free connections percentage	⇐ 2	No
Environment	Free connections percentage	< 20%	No
Axon Server cluster	Used connections count	> 8	No
Axon Server cluster	Used connections percentage	> 80%	Yes
Axon Server cluster	Free connections percentage	⇐ 2	No
Axon Server cluster	Free connections percentage	< 20%	No
Axon Server instance	Unhealthy connections	> 0	Yes
Axon Server instance disk	Free space in MB	< 1000	Yes
Axon Server replication group	Healthy replicas	< 1	No
Axon Server replication group	Unhealthy replicas	!= 0	Yes
Axon Server replication group	Unapplied entries	> 100	Yes

Resource

Metric

Default threshold

Set up by default

Message Handler

Error Rate

> 1%

Yes

Message Handler

Latency (P90)

> 200 ms

Yes

Message Handler

Throughput

> 1000/minute

Aggregate

Error Rate

> 1%

Yes

Aggregate

Latency (P90)

> 200 ms

Yes

Aggregate

Lock Time (P90)

> 25 ms

Yes

Aggregate

Load Time (P90)

> 100 ms

Yes

Aggregate

Event Commit Time (P90)

> 300 ms

Yes

Event Processor

Segment Claim Percentage

!= 100%

Yes

Event Processor

Ingest latency

> 100 ms

Yes

Event Processor

Commit latency

> 300 ms

Yes

Event Processor

DLQ Size

> 0

Yes

Application

Replica Count

< 1

Yes

Application

CPU Usage

> 80%

Yes

Application

Host CPU Usage

> 80%

Yes

Application

Heap Usage

> 80%

Yes

Application

Thread Count

> 200

Application

Query Bus Usage

> 80%

Yes

Application

Command Bus Usage

> 80%

Yes

Environment

Used connections count

> 8

Environment

Used connections percentage

> 80%

Yes

Environment

Free connections percentage

⇐ 2

Environment

Free connections percentage

< 20%

Axon Server cluster

Used connections count

> 8

Axon Server cluster

Used connections percentage

> 80%

Yes

Axon Server cluster

Free connections percentage

⇐ 2

Axon Server cluster

Free connections percentage

< 20%

Axon Server instance

Unhealthy connections

> 0

Yes

Axon Server instance disk

Free space in MB

< 1000

Yes

Axon Server replication group

Healthy replicas

< 1

Axon Server replication group

Unhealthy replicas

!= 0

Yes

Axon Server replication group

Unapplied entries

> 100

Yes