Auto clear notification using Watcher

Alerting lets you take action based on changes in your data. In ELK stack, we can create alerts using Watcher. In our previous post, we discussed all aspects of alerting in ELK.

We saw how we can leverage different channels for alerts like email, slack, webhook etc and set them as action in watcher.

This way we get alerts whenever service is down or metrics has reached certain threshold.

But how do we get a clear notification once alert is OK ?

For instance, we are using webhook action to create tickets in JIRA or any other ticketing platform for tracking. Once alert is cleared in watcher we would also want the corresponding ticket in JIRA to be closed automatically. This feature does not come out of the box in watcher.

We are able to achieve this by using .watcher-history index data by checking previous execution results. We can do it by creating a separate watcher for clear notifications. So we would need to have two watcher for each alert –

  • One for creating ticket when service goes down or reaches threshold
  • One for closing the ticket once main alert is back to OK state

Lets take example of Elastic Service down alert –

  • Main Watcher – Create a watcher to trigger webhook action when elasticsearch service goes down on any node. (using heatbeat index)

watcher id – es_service_availability

{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "heartbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "{{ctx.trigger.scheduled_time}}||-5m"
                    }
                  }
                },
                {
                  "term": {
                    "monitor.name": "ElasticSearch_Service"
                  }
                }
              ],
              "must_not": {
                "match": {
                  "monitor.status": "up"
                }
              }
            }
          },
          "aggs": {
            "nodes": {
              "terms": {
                "field": "url.domain",
                "size": 20
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gte": 1
      }
    }
  },
  "actions": {
//this is just an example, not actual jira api config
    "create_moogsoft": {
      "webhook": {
        "scheme": "https",
        "host": "jira.techmanyu.com",
        "port": 443,
        "method": "post",
        "path": "/createticket",
        "params": {},
        "headers": {
          "Content-Type": "application/json"
        },
        "body": """{"host":"ELK","source": "ELK","severity": "3","description": "ELK - ElasticSearch Service Down on - 
 {{#ctx.payload.aggregations.nodes.buckets}} Host - {{key}} | Status - Down 
 {{/ctx.payload.aggregations.nodes.buckets}}""""
      }
    }
  }
}

The above watcher will create ticket in JIRA with severity 3 once ElasticSearch service shows down in heartbeat/uptime.

  • Clear Watcher – Now create another watcher for closing the ticket once ElasticSearch service is up again. (using watcher-history index data)

watcher id- es_service_availability_clear

{
  "trigger": {
    "schedule": {
      "interval": "4m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "<.watcher-history-*>"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "sort": [
            {
              "trigger_event.triggered_time": {
                "order": "desc"
              }
            }
          ],
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "trigger_event.triggered_time": {
                      "gte": "now-1h/m"
                    }
                  }
                }
              ],
              "must": {
                "term": {
                  "watch_id": "es_service_availability"
                }
              }
            }
          },
          "size": 2
        }
      },
      "timeout_in_millis": 15000
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload.hits.hits.0._source.result.condition.met == false && ctx.payload.hits.hits.1._source.result.condition.met == true",
      "lang": "painless"
    }
  },
  "actions": {
// this is just an example, not actual jira api config
    "clear_moogsoft": {
      "webhook": {
        "scheme": "https",
        "host": "jira.techmanyu.com",
        "port": 443,
        "method": "post",
        "path": "/closeticket",
        "params": {},
        "headers": {
          "Content-Type": "application/json"
        },
        "body": """{"host":"ELK","source": "ELK","severity": "0","description": "ELK - ElasticSearch Service is up""""
      }
    }
  }
}

Above watcher fetches last two events from .watcher-history-* index for watcher id- es_service_availability (which is our main watcher for creating ticket).

For the fetched last two events, it checks for result.condition.met value.

  1. When main watcher fires (es service down), the result.condition.met value will be true.
  2. When service is up and main watcher executes next time, result.condition.met value will be false as it does not meet the down condition.
  3. Now, when the second watcher runs and gets the last two result.condition.met values, it will match second last event result.condition.met as true and last result.condition.met value as false which means service was down but is now up as per last execution. So it will send a close notification to JIRA.

You would probably be thinking, why can’t we do this in one single watcher! Well we can but it will keep on triggering clear notification as long as service is up. Throttling can’t be used to suppress these clear alerts because throttling is time based irrespective of condition.

Let us know if this works well for you or if you have a better way of handling clear notifications in elastic.

Categories
Comments
All comments.
Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.