All about Alerting in ELK stack
watcher_integration

Introduction

Alerting lets you take action based on changes in your data. It is designed around the principle that, if you can query something in Elasticsearch, you can alert on it. Simply define a query, condition, schedule, the actions to take, and Alerting will do the rest.

Till ElasticSearch v7.6, Watcher was the only way to setup alerting in ELK. Starting v7.7 Alerting is integrated with APMMetricsSIEMUptime, can be centrally managed from the Management UI, and provides a set of built-in actions and alerts for you to use. We will go through both the options.

image-37
Alerting Options

Elastic has excellent documentation on alerting, we have tried to collate everything in a single post and also share our experience with different configurations.

Watcher

Watcher is part of X-Pack in Elastic. X-Pack is an Elastic Stack extension that provides security, alerting, monitoring, reporting, machine learning, and many other capabilities. By default, when you install Elasticsearch, X-Pack is installed. However, it is enabled for only a 30 day trial period post which you have to purchase a subscription to keep using the full functionality of X-Pack.

To get started with the Watcher, navigate to Management > Elasticsearch > Watcher.

There are two ways to create watcher-

  • Simple threshold alert using watcher UI
  • Advanced watcher using API Syntax (json)

Elastic has very good documentation on how to get started with watcher.

https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-advanced-watch

Lets looks at both options on a high level here-

Creating a threshold watcher alert

Threshold watcher can be used for certain basic conditions where you need to check data against a threshold and trigger an action. This type of watcher does not support complex conditions or aggregations so its usage is very limited.

For example- Trigger alert when average CPU usage on a machine goes above a certain threshold. CPU is a system metric so we can use metricbeat index to configure the alert.

image-40

  1. Select index pattern to query the data from. In this case it is metricbeat-*
  2. Select time field and frequency of the alert to run on a schedule. We have selected it to run every 5 mins
  3. In the condition section, select average of system.process.cpu.total.norm.pct field to group by hostname.
  4. Select threshold and time period for which threshold needs to be checked.
  5. Last select the action. Watcher supports multiple actions, their configuration we will look in separate section.
image-41

6. You can check the json version (for creating watcher through API) by clicking on show request hyperlink at the bottom of page.

image-42

7. Go through the json as it will help you in creating advanced watcher.

Creating an advanced watcher

Advance watcher require some prior knowledge of using elasticsearch queries and syntax. Sections are still the same- input, condition and action.

Example- Trigger an alert if elasticsearch service goes down on any node. We will be using heartbeat index to configure it. As we need to check monitor.status text field so threshold watch cannot be used for this scenario. Also we want to send the list of hosts where service is down in the alert action.

  1. Once you click on Create advanced watch option, elastic gives you a default template which you can update according to requirement.
  2. Update trigger interval as 5 mins and keep heartbeat-* as index to query.
  3. Update the time range for which the events need to be filtered.
  4. Next add the filters to select specific monitor (like ElasticSearch) and monitor.status as not ‘up’ (which means all down monitor events will be selected)
  5. Create an aggregation on the url.domain (refers to hostname) so that we can use it to list hostnames in the alert action.
  6. Add condition for payload.hits > =1 which means at least one down events is identified.
  7. In the action section add desired actions like email/slack/webhook etc with respective configurations. Refer link for action configs.
  8. Select create.
  9. You can check watch history (for advanced watch as well as threshold watch) by clicking on watch id.

See full config-

{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "heartbeat-*"
        ],
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "{{ctx.trigger.scheduled_time}}||-5m"
                    }
                  }
                },
                {
                  "term": {
                    "monitor.name": "ES_Service_Monitor"
                  }
                }
              ],
              "must_not": {
                "match": {
                  "monitor.status": "up"
                }
              }
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "url.domain",
                "size": 20
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
    "notify-slack": {
      "slack": {
        "account": "monitoring",
        "message": {
          "from": "ELK-Monitor",
          "to": [
            "#elastic_alerts"
          ],
          "text": "ElasticSearch Service Down Alert",
          "attachments": [
            {
              "color": "danger",
              "title": "Alert Details",
              "text": "ElasticSearch Service Down on - \\n {{#ctx.payload.aggregations.nodes.buckets}} Host - {{key}} | Status - Down \\n {{/ctx.payload.aggregations.nodes.buckets}}"
            }
          ]
        }
      }
    },
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "[email protected]",
          "[email protected]"
        ],
        "subject": "ElasticSearch Service Down Alert",
        "body": {
          "text": "ElasticSearch Service Down on - \\n {{#ctx.payload.aggregations.nodes.buckets}} Host - {{key}} | Status - Down \\n {{/ctx.payload.aggregations.nodes.buckets}}"
        }
      }
    }
  }
}

Lets also look at watcher example with input as http API instead of index. Example- Alert when elasticsearch cluster status becomes red.

{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "http": {
      "request": {
        "scheme": "http",
        "host": "elastichost.techmanyu.com",
        "port": 9200,
        "method": "get",
        "path": "/_cluster/health",
        "params": {},
        "headers": {},
        "auth": {
          "basic": {
            "username": "elastic",
            "password": "password"
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "eq": "red"
      }
    }
  },
  "actions": {
    "notify-slack": {
      "slack": {
        "account": "monitoring",
        "message": {
          "from": "ELK-Monitor",
          "to": [
            "#elastic_alerts"
          ],
          "text": "ElasticSearch Service Down Alert",
          "attachments": [
            {
              "color": "danger",
              "title": "Alert Details",
              "text": "Status of the Cluster is {{ctx.payload.status}}."
            }
          ]
        }
      }
    }
  }
}

Throttling

Throttling is an important aspect of alerting as you do not want repeated alerts for the same situation. Most of the times, too many alerts can lead to miss important ones and thus delay action.

Throttling can only be configured with advanced watch as of now however in future versions it can come for threshold watch as well.

Throttling can configured at two levels-

  • At watch level which gets applied to all the actions for the alert
  • At action level which gets applied to only the specific action

How throttling works

During the watch execution, once the condition is met, a decision is made per configured throttling period as to whether it should be throttled. The main purpose of action throttling is to prevent too many executions of the same action for the same watch.

You can define a throttling period as part of the action configuration to limit how often the action is executed. When you set a throttling period, Watcher prevents repeated execution of the action if it has already executed within the throttling period time frame (now - throttling period).

Note that throttling period does not reset if condition becomes true and false again. It takes full throttling period for throttling to reset and trigger an alert again.

For example, say throttling period at watch level or action level is set for 1 hour to check for elastic search down monitor. When the service goes down, the condition meets, action gets fired and throttling period starts. Now service comes back up and alert condition becomes false at next run but still throttling period is not reset. Again in next run if service goes down and condition meets, action will not get fired because throttling period is still not completed. No action get fired until the full 1 hour throttling period expires. (Though as per concept of throttling, period should get reset if condition becomes false but not in the case of watcher in elastic)

Other option is to manually acknowledge the watch when it is fired. Acknowledging the watch throttles the alert till condition becomes false again. (not time based)

Example of watch level throttling-

In the above example you can add throttling parameter at root level in watch.

image-44

Example of action level throttling-

image-45

You can use “throttle_period” also in place of “throttle_period_in_millis” and on save it gets convered to millis.

Example

"throttle_period" : "1h"

Acknowleging Watcher

As explained earlier, if you want condition based throttling and not time based then you can manually aknowledge the alert and it will get throttled till the condition remains true.

image-47

It can be done through API as well.

Refer link for more details on throttling.

Settings to enable Email and Slack Actions

Email and Slack are two most common used alerting channels so lets check how to enable these actions.

Email

For enabling email actions, we need to add smtp configurations to elasticsearch.yml file on all elastic nodes.

xpack:  
  notification:
    email:
      account:
        exchange_account:
          smtp:
            host: "mail.techmanyu.com"
            port: 25
          email_defaults:
            from: "[email protected]"

Slack

To enable slack actions, update the following config in elasticsearch.yml on all elastic nodes.

xpack:    
  slack:
    default_account: "monitoring"
    account:
      monitoring:
        message_defaults:
          from: "ELK-monitoring"

Next step is to add webhook URL to elasticsearch-keystore. Follow before post to check detailed steps-

Alerts and Actions

Starting v7.7, Kibana supports inbuilt alerting from UI and integration with multiple modules like APM, Uptime, Metrics etc. This was released May 2020 (just weeks back and we are yet to full evaluate the capabilties)

Documentation- https://www.elastic.co/guide/en/kibana/7.x/alerting-getting-started.html

To get started with the Watcher, navigate to Management > Kibana > Alerts and Actions

First question which comes to mind is when watchers are already there what is the need for Kibana alerting ! What is the different between these two !

Difference between ElasticSearch Watcher and Kibana Alerting

  • First of all, watchers are executed at elasticsearch layer while alerting is executed at kibana layer.
  • As watchers are executed at elastic layer, access management cannot be done on them using spaces as spaces are managed in Kibana. Kibana alerts can be managed using spaces.
  • No advanced query knowledge required for Kibana alerts as these are only UI based.
  • Kibana alerts are coupled with modules like APM, Metrics, Uptime etc so direct data alerts can be created.

Check below post related to Kibana Access Management-

Create Connector

Connectors provide a central place to store connection information for services and integrations like Email, Slack, PageDuty, Webhook etc. Unlike watcher, you dont have to store webhook url or settings in yml files. All the settings can be directly updated using UI.

image-48
image-49

Create Alert

Example- Notify alert if average transaction duration exceeds 5 seconds for last 30 minutes.

  1. Like watcher alert, select frequency as the trigger interval.
  2. Select Notify every field which works as throttling period.
  3. Select index which is apm-* in our case.
  4. Select condition as average of transaction.duration.us exceeding 5 seconds for all documents.
  5. Select action as one or more of the previously created connections.
  6. Done
image-50

You can also create these alerts directly from APM, SIEM or Uptime modules.

For example go to Uptime –> Alerts –> Create Alerts

It gives all the options pre filled specific to uptime.

image-51

Thats it about alerting in ELK stack. We will keep this post updating with any findings or updates.

Check our post on auto-clear notifications using watcher-

References:

https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html

https://www.elastic.co/guide/en/kibana/current/watcher-ui.html#watcher-create-advanced-watch

Thanks for checking out. Do share your comments and feedback.

Categories
Comments
All comments.
Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.