跳至主要內容

告警

...大约 6 分钟

告警

概览

IoTDB 告警功能预计支持两种模式:

  • 写入触发:用户写入原始数据到原始时间序列,每插入一条数据都会触发 Trigger 的判断逻辑,
    若满足告警要求则发送告警到下游数据接收器,
    数据接收器再转发告警到外部终端。这种模式:

    • 适合需要即时监控每一条数据的场景。
    • 由于触发器中的运算会影响数据写入性能,适合对原始数据写入性能不敏感的场景。
  • 持续查询:用户写入原始数据到原始时间序列,
    ContinousQuery 定时查询原始时间序列,将查询结果写入新的时间序列,
    每一次写入触发 Trigger 的判断逻辑,
    若满足告警要求则发送告警到下游数据接收器,
    数据接收器再转发告警到外部终端。这种模式:

    • 适合需要定时查询数据在某一段时间内的情况的场景。
    • 适合需要将原始数据降采样并持久化的场景。
    • 由于定时查询几乎不影响原始时间序列的写入,适合对原始数据写入性能敏感的场景。

随着 Trigger 模块的引入,可以实现写入触发模式的告警。

部署 AlertManager

安装与运行

二进制文件

预编译好的二进制文件可在 这里open in new window 下载。

运行方法:

./alertmanager --config.file=<your_file>

Docker 镜像

可在 Quay.ioopen in new window
Docker Hubopen in new window 获得。

运行方法:

docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

配置

如下是一个示例,可以覆盖到大部分配置规则,详细的配置规则参见
这里open in new window

示例:

# alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.org'

# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-X-mails'

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #
  # To aggregate by all possible labels use '...' as the sole label name.
  # This effectively disables aggregation entirely, passing through all
  # alerts as-is. This is unlikely to be what you want, unless you have
  # a very low alert volume or your upstream notification system performs
  # its own grouping. Example: group_by: [...]
  group_by: ['alertname', 'cluster']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails

    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager

  - match:
      service: files
    receiver: team-Y-mails

    routes:
    - match:
        severity: critical
      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database

    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]

    routes:
    - match:
        owner: team-X
      receiver: team-X-pager

    - match:
        owner: team-Y
      receiver: team-Y-pager

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  # CAUTION: 
  #   If all label names listed in `equal` are missing 
  #   from both the source and target alerts,
  #   the inhibition rule will apply!
  equal: ['alertname']

receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'

- name: 'team-X-pager'
  email_configs:
  - to: 'team-X+alerts-critical@example.org'
  pagerduty_configs:
  - routing_key: <team-X-key>

- name: 'team-Y-mails'
  email_configs:
  - to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'
  pagerduty_configs:
  - routing_key: <team-Y-key>

- name: 'team-DB-pager'
  pagerduty_configs:
  - routing_key: <team-DB-key>

在后面的示例中,我们采用的配置如下:

# alertmanager.yml

global: 
  smtp_smarthost: ''
  smtp_from: '' 
  smtp_auth_username: '' 
  smtp_auth_password: '' 
  smtp_require_tls: false

route:
  group_by: ['alertname'] 
  group_wait: 1m
  group_interval: 10m
  repeat_interval: 10h 
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs: 
    - to: '' 

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

API

AlertManager API 分为 v1v2 两个版本,当前 AlertManager API 版本为 v2
(配置参见
api/v2/openapi.yamlopen in new window)。

默认配置的前缀为 /api/v1/api/v2
发送告警的 endpoint 为 /api/v1/alerts/api/v2/alerts
如果用户指定了 --web.route-prefix
例如 --web.route-prefix=/alertmanager/
那么前缀将会变为 /alertmanager/api/v1/alertmanager/api/v2
发送告警的 endpoint 变为 /alertmanager/api/v1/alerts
/alertmanager/api/v2/alerts

创建 trigger

编写 trigger 类

用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。
具体配置流程参见 Triggers

下面的示例创建了 org.apache.iotdb.trigger.ClusterAlertingExample 类,
alertManagerHandler
成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。

value > 100.0 时,发送 severitycritical 的告警;
50.0 < value <= 100.0 时,发送 severitywarning 的告警。

package org.apache.iotdb.trigger;

import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerConfiguration;
import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerEvent;
import org.apache.iotdb.db.engine.trigger.sink.alertmanager.AlertManagerHandler;
import org.apache.iotdb.trigger.api.Trigger;
import org.apache.iotdb.trigger.api.TriggerAttributes;
import org.apache.iotdb.tsfile.file.metadata.enums.TSDataType;
import org.apache.iotdb.tsfile.write.record.Tablet;
import org.apache.iotdb.tsfile.write.schema.MeasurementSchema;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;

public class ClusterAlertingExample implements Trigger {
  private static final Logger LOGGER = LoggerFactory.getLogger(ClusterAlertingExample.class);

  private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler();

  private final AlertManagerConfiguration alertManagerConfiguration =
      new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts");

  private String alertname;

  private final HashMap<String, String> labels = new HashMap<>();

  private final HashMap<String, String> annotations = new HashMap<>();

  @Override
  public void onCreate(TriggerAttributes attributes) throws Exception {
    alertname = "alert_test";

    labels.put("series", "root.ln.wf01.wt01.temperature");
    labels.put("value", "");
    labels.put("severity", "");

    annotations.put("summary", "high temperature");
    annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}");

    alertManagerHandler.open(alertManagerConfiguration);
  }

  @Override
  public void onDrop() throws IOException {
    alertManagerHandler.close();
  }

  @Override
  public boolean fire(Tablet tablet) throws Exception {
    List<MeasurementSchema> measurementSchemaList = tablet.getSchemas();
    for (int i = 0, n = measurementSchemaList.size(); i < n; i++) {
      if (measurementSchemaList.get(i).getType().equals(TSDataType.DOUBLE)) {
        // for example, we only deal with the columns of Double type
        double[] values = (double[]) tablet.values[i];
        for (double value : values) {
          if (value > 100.0) {
            LOGGER.info("trigger value > 100");
            labels.put("value", String.valueOf(value));
            labels.put("severity", "critical");
            AlertManagerEvent alertManagerEvent =
                new AlertManagerEvent(alertname, labels, annotations);
            alertManagerHandler.onEvent(alertManagerEvent);
          } else if (value > 50.0) {
            LOGGER.info("trigger value > 50");
            labels.put("value", String.valueOf(value));
            labels.put("severity", "warning");
            AlertManagerEvent alertManagerEvent =
                new AlertManagerEvent(alertname, labels, annotations);
            alertManagerHandler.onEvent(alertManagerEvent);
          }
        }
      }
    }
    return true;
  }
}

创建 trigger

如下的 sql 语句在 root.ln.wf01.wt01.temperature
时间序列上注册了名为 root-ln-wf01-wt01-alert
运行逻辑由 org.apache.iotdb.trigger.ClusterAlertingExample
类定义的触发器。

  CREATE STATELESS TRIGGER `root-ln-wf01-wt01-alert`
  AFTER INSERT
  ON root.ln.wf01.wt01.temperature
  AS "org.apache.iotdb.trigger.ClusterAlertingExample"
  USING URI 'http://jar/ClusterAlertingExample.jar'

写入数据

当我们完成 AlertManager 的部署和启动、Trigger 的创建,
可以通过向时间序列写入数据来测试告警功能。

INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);

执行完上述写入语句后,可以收到告警邮件。由于我们的 AlertManager 配置中设定 severitycritical 的告警
会抑制 severitywarning 的告警,我们收到的告警邮件中只包含写入
(5, 120) 后触发的告警。

alerting

Copyright © 2024 The Apache Software Foundation.
Apache IoTDB, IoTDB, Apache, the Apache feather logo, and the Apache IoTDB project logo are either registered trademarks or trademarks of The Apache Software Foundation in all countries

Have a question? Connect with us on QQ, WeChat, or Slack. Join the community now.