Terraform と監視システム

2023年9月6日 11:45

どうも！2回目の更新の山田です！
Terraformについて、以前対応した案件で困ったことがあったので共有させていただきます。

監視システム「Prometheus」をTerraformで構築しました。
以下のレイヤー構成
　L1にネットワークやConsulなど。
　L2にPrometheus本体
　L3にアラートルール
※本来であれば、他のレイヤーがあり別システムを作成しておりますが、今回は割愛しております。

本体とアラートルールを分けている理由としては、アラートルールは運用していく上で、修正する頻度が高いと考え、アラートルールだけ更新できる（L3だけデプロイ）構成にしました。

metricsの一覧取得
curl -s http://localhost:9090/metrics
curl -s http://localhost:9100/metrics
curl -s http://localhost:9093/metrics
curl -s http://localhost:9093/#/alerts
curl -G http://localhost:9090/api/v1/targets/metadata
curl -s http://localhost:9090/api/v1/targets

port一覧
・Prometheus:9090
・Alertmanager:9093
・AzureMetricsExporter:9276
・Grafana:3000
・postgres:9187

●prometheus.ymlは、以下の通り
Jobは、exporterにより増えていきます。
参考として、node_exporterとazure_exporterを記述
==============
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "alert_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

- job_name: 'node'
static_configs:
- targets: ['localhost:9100']

- job_name: 'azure'
static_configs:
- targets: ['localhost:9276']

●alert_rules.ymlは、こんな感じ
・Prometheus と同じディレクトリに格納しておく必要あり
==============
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m

●alertmanagerは、こんな感じ
alertmanager.yml
・デフォルト
global:
resolve_timeout: 5m

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

と、ここまでは良かったのですが、問題が！
Prometheusを監視してくれる約束が、既存で監視しているZabbixじゃないと見ないと監視チームに言われてしまい。。。汗
Zabbixにアラートを連携することになりました。
調べた結果、webhookで可能と分かったのですが、提供されているwebhookに誤りがあり、上手く行かずに苦戦しました。
こんなことする案件は、数少ないとは思いますが、PrometheusからZabbixへは連携できるということを知っていただきたいために記載しました。

alertmanager-zabbix-webhook
こちらを使うのですが、そのままだと上手く行かず、修正が必要です。
webhook.goファイルの修正です。
システムによって修正内容が異なる部分ではありますが、こちらのファイルを修正してもらえれば、Zabbix連携が可能です。

#terraform #zabbix #prometheus #alertmanager #webhook #監視システム #レイヤー構成

Terraform と監視システム

いいなと思ったら応援しよう！