IoTDB Website

2023/7/10大约 6 分钟

告警

概览

IoTDB 告警功能预计支持两种模式：

写入触发：用户写入原始数据到原始时间序列，每插入一条数据都会触发 trigger 的判断逻辑，
若满足告警要求则发送告警到下游数据接收器，
数据接收器再转发告警到外部终端。这种模式：
- 适合需要即时监控每一条数据的场景。
- 由于触发器中的运算会影响数据写入性能，适合对原始数据写入性能不敏感的场景。
持续查询：用户写入原始数据到原始时间序列，
ContinousQuery 定时查询原始时间序列，将查询结果写入新的时间序列，
每一次写入触发 trigger 的判断逻辑，
若满足告警要求则发送告警到下游数据接收器，
数据接收器再转发告警到外部终端。这种模式：
- 适合需要定时查询数据在某一段时间内的情况的场景。
- 适合需要将原始数据降采样并持久化的场景。
- 由于定时查询几乎不影响原始时间序列的写入，适合对原始数据写入性能敏感的场景。

随着 trigger 模块和 sink 模块的引入，
目前用户使用这两个模块，配合 AlertManager 可以实现写入触发模式的告警。

部署 AlertManager

安装与运行

二进制文件

预编译好的二进制文件可在这里下载。

运行方法：

./alertmanager --config.file=<your_file>

Docker 镜像

可在 Quay.io
或 Docker Hub 获得。

运行方法：

docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager

配置

如下是一个示例，可以覆盖到大部分配置规则，详细的配置规则参见
这里。

示例：

# alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@example.org'

# The root route on which each incoming alert enters.
route:
  # The root route must not have any matchers as it is the entry point for
  # all alerts. It needs to have a receiver configured so alerts that do not
  # match any of the sub-routes are sent to someone.
  receiver: 'team-X-mails'

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  #
  # To aggregate by all possible labels use '...' as the sole label name.
  # This effectively disables aggregation entirely, passing through all
  # alerts as-is. This is unlikely to be what you want, unless you have
  # a very low alert volume or your upstream notification system performs
  # its own grouping. Example: group_by: [...]
  group_by: ['alertname', 'cluster']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails

    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager

  - match:
      service: files
    receiver: team-Y-mails

    routes:
    - match:
        severity: critical
      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database

    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]

    routes:
    - match:
        owner: team-X
      receiver: team-X-pager

    - match:
        owner: team-Y
      receiver: team-Y-pager

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  # CAUTION: 
  #   If all label names listed in `equal` are missing 
  #   from both the source and target alerts,
  #   the inhibition rule will apply!
  equal: ['alertname']

receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'team-X+alerts@example.org, team-Y+alerts@example.org'

- name: 'team-X-pager'
  email_configs:
  - to: 'team-X+alerts-critical@example.org'
  pagerduty_configs:
  - routing_key: <team-X-key>

- name: 'team-Y-mails'
  email_configs:
  - to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'
  pagerduty_configs:
  - routing_key: <team-Y-key>

- name: 'team-DB-pager'
  pagerduty_configs:
  - routing_key: <team-DB-key>

在后面的示例中，我们采用的配置如下：

# alertmanager.yml

global: 
  smtp_smarthost: ''
  smtp_from: '' 
  smtp_auth_username: '' 
  smtp_auth_password: '' 
  smtp_require_tls: false

route:
  group_by: ['alertname'] 
  group_wait: 1m
  group_interval: 10m
  repeat_interval: 10h 
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs: 
    - to: '' 

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

API

AlertManager API 分为 v1 和 v2 两个版本，当前 AlertManager API 版本为 v2
（配置参见
api/v2/openapi.yaml)。

默认配置的前缀为 /api/v1 或 /api/v2，
发送告警的 endpoint 为 /api/v1/alerts 或 /api/v2/alerts。
如果用户指定了 --web.route-prefix，
例如 --web.route-prefix=/alertmanager/，
那么前缀将会变为 /alertmanager/api/v1 或 /alertmanager/api/v2，
发送告警的 endpoint 变为 /alertmanager/api/v1/alerts
或 /alertmanager/api/v2/alerts。

创建 trigger

编写 trigger 类

用户通过自行创建 Java 类、编写钩子中的逻辑来定义一个触发器。
具体配置流程以及 Sink 模块提供的 AlertManagerSink 相关工具类的使用方法参见 Triggers。

下面的示例创建了 org.apache.iotdb.trigger.AlertingExample 类，
其 alertManagerHandler
成员变量可发送告警至地址为 http://127.0.0.1:9093/ 的 AlertManager 实例。

当 value > 100.0 时，发送 severity 为 critical 的告警；
当 50.0 < value <= 100.0 时，发送 severity 为 warning 的告警。

package org.apache.iotdb.trigger;

/*
此处省略包的引入
*/

public class AlertingExample implements Trigger {

  private final AlertManagerHandler alertManagerHandler = new AlertManagerHandler();

  private final AlertManagerConfiguration alertManagerConfiguration =
      new AlertManagerConfiguration("http://127.0.0.1:9093/api/v2/alerts");

  private String alertname;

  private final HashMap<String, String> labels = new HashMap<>();

  private final HashMap<String, String> annotations = new HashMap<>();

  @Override
  public void onCreate(TriggerAttributes attributes) throws Exception {
    alertManagerHandler.open(alertManagerConfiguration);

    alertname = "alert_test";

    labels.put("series", "root.ln.wf01.wt01.temperature");
    labels.put("value", "");
    labels.put("severity", "");

    annotations.put("summary", "high temperature");
    annotations.put("description", "{{.alertname}}: {{.series}} is {{.value}}");
  }

  @Override
  public void onDrop() throws IOException {
    alertManagerHandler.close();
  }

  @Override
  public void onStart() {
    alertManagerHandler.open(alertManagerConfiguration);
  }

  @Override
  public void onStop() throws Exception {
    alertManagerHandler.close();
  }

  @Override
  public Double fire(long timestamp, Double value) throws Exception {
    if (value > 100.0) {
      labels.put("value", String.valueOf(value));
      labels.put("severity", "critical");
      AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
      alertManagerHandler.onEvent(alertManagerEvent);
    } else if (value > 50.0) {
      labels.put("value", String.valueOf(value));
      labels.put("severity", "warning");
      AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
      alertManagerHandler.onEvent(alertManagerEvent);
    }

    return value;
  }

  @Override
  public double[] fire(long[] timestamps, double[] values) throws Exception {
    for (double value : values) {
      if (value > 100.0) {
        labels.put("value", String.valueOf(value));
        labels.put("severity", "critical");
        AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
        alertManagerHandler.onEvent(alertManagerEvent);
      } else if (value > 50.0) {
        labels.put("value", String.valueOf(value));
        labels.put("severity", "warning");
        AlertManagerEvent alertManagerEvent = new AlertManagerEvent(alertname, labels, annotations);
        alertManagerHandler.onEvent(alertManagerEvent);
      }
    }
    return values;
  }
}

创建 trigger

如下的 sql 语句在 root.ln.wf01.wt01.temperature
时间序列上注册了名为 root-ln-wf01-wt01-alert、
运行逻辑由 org.apache.iotdb.trigger.AlertingExample
类定义的触发器。

  CREATE TRIGGER `root-ln-wf01-wt01-alert`
  AFTER INSERT
  ON root.ln.wf01.wt01.temperature
  AS "org.apache.iotdb.trigger.AlertingExample"

写入数据

当我们完成 AlertManager 的部署和启动、Trigger 的创建，
可以通过向时间序列写入数据来测试告警功能。

INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (1, 0);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (2, 30);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (3, 60);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (4, 90);
INSERT INTO root.ln.wf01.wt01(timestamp, temperature) VALUES (5, 120);

执行完上述写入语句后，可以收到告警邮件。由于我们的 AlertManager 配置中设定 severity 为 critical 的告警
会抑制 severity 为 warning 的告警，我们收到的告警邮件中只包含写入
(5, 120) 后触发的告警。