Compare commits
No commits in common. "f526af694da39a5e914a756c6314500d65122a00" and "b7cb74f1b517093cfc18ae9fab9c920482ff364a" have entirely different histories.
f526af694d
...
b7cb74f1b5
4 changed files with 4 additions and 47 deletions
|
|
@ -66,7 +66,7 @@ graph TB
|
|||
| Forgejo Registry Integrity Probe | Alpine 3.20 + curl/jq | `stacks/monitoring/modules/monitoring/main.tf` | CronJob every 15m: walks `/v2/_catalog` on `forgejo.viktorbarzin.me` (HTTP via in-cluster service), HEADs every tagged manifest + index child; emits `registry_manifest_integrity_*` metrics to Pushgateway. Replaces the legacy `registry-integrity-probe` against `registry.viktorbarzin.me:5050` decommissioned in Phase 4 of forgejo-registry-consolidation 2026-05-07. |
|
||||
| blackbox-exporter (Authentik walling-off guard) | `prom/blackbox-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf` | Single-purpose blackbox-exporter. Its `http_no_authentik_redirect` module probes each must-stay-public carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff the response redirects to Authentik. Scraped by job `blackbox-authentik-walloff` (1m); feeds alert `AuthentikWallingOffPublicPath`. Target list = `local.authentik_walloff_targets` in the same file. |
|
||||
| snmp-exporter | `prom/snmp-exporter` (Keel-managed) | `stacks/monitoring/modules/monitoring/snmp_exporter.tf` + `ups_snmp_values.yaml` | SNMP→Prometheus bridge. Modules in `ups_snmp_values.yaml`: `huawei` (UPS), `if_mib`/`ip_mib`, and **`dell_idrac`** (R730 iDRAC, merged from `prometheus_snmp_chart_values.yaml` 2026-06-05 + hand-added fan-RPM `coolingDeviceReading` / amperage location lookup). Scrape jobs: `snmp-ups` (30s, module=huawei), **`snmp-idrac` (1m, module=dell_idrac, auth=public_v2)** — the FAST primary source for R730 health/thermal/power/fan/voltage since the 2026-06-05 Redfish→SNMP migration (~3.7s/scrape vs Redfish ~18.5s). Relabels all metrics to `r730_idrac_<mibName>`. |
|
||||
| idrac-redfish-exporter | `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix` (mrlhansen/idrac_exporter, Keel-managed) | `stacks/monitoring/modules/monitoring/idrac.tf` | **Slow remnant** (10m scrape, job `redfish-idrac`) since the 2026-06-05 SNMP migration — was the sole iDRAC source at a 3m interval, demoted once SNMP took over the fast path. Trimmed to `system,sensors,power,storage,network,memory`. Serves only what SNMP can't (indicator LED, NIC link-speed Mbps, machine/BIOS info, per-drive storage table). **HA Sofia's R730 sensors moved off this exporter to a fast Prometheus SNMP query on 2026-06-05** (see the iDRAC subsection under "How It Works"), so the `sensors` collector here is now vestigial. |
|
||||
| idrac-redfish-exporter | `viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix` (mrlhansen/idrac_exporter, Keel-managed) | `stacks/monitoring/modules/monitoring/idrac.tf` | **Slow remnant** (10m scrape, job `redfish-idrac`) since the 2026-06-05 SNMP migration — was the sole iDRAC source at a 3m interval, demoted once SNMP took over the fast path. Trimmed to `system,sensors,power,storage,network,memory`. Serves only what SNMP can't (indicator LED, NIC link-speed Mbps, machine/BIOS info, per-drive storage table) **and keeps HA Sofia's `sensor.r730_fan_speed` REST sensor alive** — that sensor reads `idrac_sensors_fan_speed` from this exporter directly, so the `sensors` collector must stay enabled here. |
|
||||
|
||||
## How It Works
|
||||
|
||||
|
|
@ -151,8 +151,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia
|
|||
The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`.
|
||||
|
||||
- **`snmp-idrac` (FAST, primary, 1m / 30s):** snmp-exporter `dell_idrac` module against `:161` (v2c, community `Public0` = `auth=public_v2`). ~3.7s/scrape. Serves all dynamic + health + alerting metrics: `r730_idrac_temperatureProbeReading` (tenths-°C, ÷10), `coolingDeviceReading` (fan RPM, label `coolingDeviceLocationName`), `amperageProbeReading{amperageProbeLocationName="System Board Pwr Consumption"}` (watts), `powerSupplyCurrentInputVoltage`, `globalSystemStatus`, `systemPowerState`, `powerSupplyStatus`, `physicalDiskComponentStatus`, `systemStateMemoryDeviceStatusCombined`, etc.
|
||||
- **`redfish-idrac` (SLOW remnant, 10m / 45s):** the old mrlhansen exporter, trimmed, kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS info, per-drive storage table). Its `sensors` collector is now **vestigial** (HA moved off it — see next bullet) and could be dropped.
|
||||
- **HA Sofia R730 sensors → Prometheus SNMP (2026-06-05):** ha-sofia's 7 REST sensors (`/config/rest_resources/idrac_redfish_exporter.yaml` — CPU/exhaust/inlet temp, power, 2× PSU voltage, fan speed) were re-pointed from the slow on-demand Redfish exporter (`scan_interval: 120`, ~16-22s/fetch, intermittent `unavailable` blips) to a **fast Prometheus query of the SNMP values** (`scan_interval: 30`, instant): `https://prometheus-query.viktorbarzin.lan/api/v1/query?query={__name__=~"r730_idrac_…"}`, one query → JSON, each sensor filters by metric+label (temps ÷10). The `prometheus-query.viktorbarzin.lan` ingress is **local-only, `auth=none`, path-scoped to `/api/v1/query`** (added in `prometheus.tf`) so HA can query the API without the Authentik gate on `prometheus.viktorbarzin.me`. Its Technitium CNAME (→ `ingress.viktorbarzin.lan`) was added **manually via the API** — like the other `.lan` exporter hosts it is NOT auto-synced (the `technitium-ingress-dns-sync` CronJob only creates `.me` records; same gap as the Loki-sensor follow-up noted above). HA-side file is auto-version-controlled by the ha-sofia HomeAssistantVersionControl add-on; pre-migration copy saved at `/config/idrac_redfish_exporter.bak-pre-snmp`.
|
||||
- **`redfish-idrac` (SLOW remnant, 10m / 45s):** the old mrlhansen exporter, trimmed, kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS info, per-drive storage table) and to feed **HA Sofia's `sensor.r730_fan_speed`** (reads `idrac_sensors_fan_speed` from the exporter HTTP endpoint directly — NOT via Prometheus, so its freshness is HA's REST poll, independent of the 10m Prometheus scrape).
|
||||
|
||||
**Gotchas:**
|
||||
- **Enum values differ from the old Redfish metrics.** DellStatus: `3 = OK` (was Redfish `1`); `systemPowerState`: `4 = on` (was `2`). All iDRAC alert exprs were rewritten accordingly (`!= 3`, `!= 4`).
|
||||
|
|
|
|||
|
|
@ -37,19 +37,6 @@ alloy:
|
|||
discovery.relabel "pod_logs" {
|
||||
targets = discovery.kubernetes.pod.targets
|
||||
|
||||
// Drop high-volume, low-value producers from Loki to cut sdc write wear
|
||||
// (the log PVC is on the contended sdc HDD). goflow2 emits one JSON line
|
||||
// per NetFlow record to stdout (~8 GB/day, ~64% of all cluster logs) but
|
||||
// we only use its Prometheus aggregate metrics, not the per-flow logs;
|
||||
// vpa = Goldilocks/VPA recommender chatter (~1.3 GB/day). Both reversible
|
||||
// — remove this rule to ship them again. (Added 2026-06-05.)
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name"]
|
||||
separator = "/"
|
||||
regex = "monitoring/goflow2-.*|vpa/.*"
|
||||
action = "drop"
|
||||
}
|
||||
|
||||
// Label creation - "namespace" field from "__meta_kubernetes_namespace"
|
||||
rule {
|
||||
source_labels = ["__meta_kubernetes_namespace"]
|
||||
|
|
|
|||
|
|
@ -62,28 +62,3 @@ resource "helm_release" "prometheus" {
|
|||
|
||||
values = [templatefile("${path.module}/prometheus_chart_values.tpl", { alertmanager_mail_pass = var.alertmanager_account_password, alertmanager_slack_api_url = var.alertmanager_slack_api_url, tuya_api_key = var.tiny_tuya_service_secret, haos_api_token = var.haos_api_token, authentik_walloff_targets = local.authentik_walloff_targets })]
|
||||
}
|
||||
|
||||
# Local-only Prometheus query-API ingress for ha-sofia REST sensors (added
|
||||
# 2026-06-05). ha-sofia (external HAOS) reads R730 iDRAC SNMP metrics
|
||||
# (r730_idrac_coolingDeviceReading, etc.) by querying Prometheus directly via
|
||||
# this host instead of hitting the slow on-demand Redfish exporter. Distinct
|
||||
# host (prometheus-query.viktorbarzin.lan) + resource name to avoid colliding
|
||||
# with the chart-created `prometheus-server` ingress (prometheus.viktorbarzin.me).
|
||||
# Path-scoped to /api/v1/query so ONLY the read-only instant-query endpoint is
|
||||
# reachable on the LAN — not the UI, admin, or federation endpoints.
|
||||
module "prometheus-query-ingress" {
|
||||
source = "../../../../modules/kubernetes/ingress_factory"
|
||||
# auth = "none": ha-sofia REST sensor queries the Prometheus HTTP API
|
||||
# programmatically (no browser, no SSO cookie); the allow_local_access_only
|
||||
# IP allowlist (LAN subnets) is the gate. Authentik OIDC would 302 every call.
|
||||
auth = "none"
|
||||
namespace = kubernetes_namespace.monitoring.metadata[0].name
|
||||
name = "prometheus-query"
|
||||
service_name = "prometheus-server"
|
||||
root_domain = "viktorbarzin.lan"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
allow_local_access_only = true
|
||||
ssl_redirect = false
|
||||
port = 80
|
||||
ingress_path = ["/api/v1/query"]
|
||||
}
|
||||
|
|
|
|||
|
|
@ -3148,12 +3148,8 @@ extraScrapeConfigs: |
|
|||
- "crowdsec-service.crowdsec.svc.cluster.local:6060"
|
||||
metrics_path: '/metrics'
|
||||
- job_name: 'snmp-idrac'
|
||||
# 30s (was 1m) so the HA dashboard iDRAC metrics (temps / fan RPM / power /
|
||||
# voltage, read by ha-sofia's prometheus-query.lan REST sensors) refresh
|
||||
# every 30s — matching the fan-control daemon's Pushgateway metrics. The
|
||||
# SNMP scrape takes ~3-4s; snmp-ups also runs at 30s. (2026-06-05)
|
||||
scrape_interval: 30s
|
||||
scrape_timeout: 15s
|
||||
scrape_interval: 1m
|
||||
scrape_timeout: 30s
|
||||
params:
|
||||
module: [dell_idrac]
|
||||
auth: [public_v2]
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue