monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
279b88d2bc
commit
7fe2d9780e
5 changed files with 271 additions and 0 deletions
|
|
@ -235,6 +235,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
|
||||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
|
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
|
||||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
|
||||||
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
- **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
|
||||||
|
- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).
|
||||||
|
|
||||||
## Security Posture (Wave 1 — locked 2026-05-18)
|
## Security Posture (Wave 1 — locked 2026-05-18)
|
||||||
|
|
||||||
|
|
|
||||||
72
docs/runbooks/pfsense-egress.md
Normal file
72
docs/runbooks/pfsense-egress.md
Normal file
|
|
@ -0,0 +1,72 @@
|
||||||
|
# Runbook: pfSense WAN / egress outage
|
||||||
|
|
||||||
|
**Scope:** the cluster (and home) loses **internet egress** while pfSense is
|
||||||
|
otherwise alive — internal VLAN routing and DNS keep working. This is the
|
||||||
|
**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
|
||||||
|
IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
|
||||||
|
stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
|
||||||
|
probe existed; the cloudflared replica metric stayed green). The alerts +
|
||||||
|
probes below close that gap. Incident detail: memory ids #6715–#6723.
|
||||||
|
|
||||||
|
pfSense is a **single point of failure** (no HA): it is the k8s default gateway
|
||||||
|
(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
|
||||||
|
**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
|
||||||
|
Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
|
||||||
|
|
||||||
|
## Alerts (all in `stacks/monitoring/modules/monitoring/`)
|
||||||
|
|
||||||
|
| Alert | Signal | Means |
|
||||||
|
|-------|--------|-------|
|
||||||
|
| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
|
||||||
|
| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
|
||||||
|
| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
|
||||||
|
| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
|
||||||
|
| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
|
||||||
|
| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
|
||||||
|
|
||||||
|
Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
|
||||||
|
NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
|
||||||
|
/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
|
||||||
|
alert pages, not a storm.
|
||||||
|
|
||||||
|
`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
|
||||||
|
the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
|
||||||
|
metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
|
||||||
|
|
||||||
|
## Diagnose (read-only first)
|
||||||
|
|
||||||
|
1. **Confirm scope** — is it egress-only or total?
|
||||||
|
- `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
|
||||||
|
- Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
|
||||||
|
2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
|
||||||
|
```
|
||||||
|
ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784)
|
||||||
|
clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms
|
||||||
|
clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete
|
||||||
|
clog /var/log/system.log | tail -200
|
||||||
|
netstat -rn | head # is the default route present?
|
||||||
|
ls -la /var/crash/ # panic/textdump?
|
||||||
|
```
|
||||||
|
(If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
|
||||||
|
config.xml — re-add the key via console or WebGUI; see id #6718.)
|
||||||
|
3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
|
||||||
|
clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
|
||||||
|
fault is unlikely; a reboot fixing it points at **pfSense-side state**.
|
||||||
|
|
||||||
|
## Recover
|
||||||
|
|
||||||
|
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
|
||||||
|
dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
|
||||||
|
the volatile evidence needed to find the real mechanism).
|
||||||
|
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
|
||||||
|
WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
|
||||||
|
re-eval. Confirm `netstat -rn` shows the default route restored.
|
||||||
|
|
||||||
|
## Prevent / harden (deferred, needs a live-pfSense change)
|
||||||
|
|
||||||
|
Not done in this monitoring change — tracked for a follow-up with hands-on
|
||||||
|
pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
|
||||||
|
instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
|
||||||
|
the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
|
||||||
|
ship pfSense system/gateway/routing syslog to the cluster so these logs become
|
||||||
|
centrally queryable.
|
||||||
|
|
@ -118,6 +118,35 @@ resource "kubernetes_config_map" "blackbox_exporter_config" {
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
# ICMP egress probes (added 2026-06-28 after the 2026-06-27 pfSense
|
||||||
|
# WAN/egress black-hole incident). Drive the wan-gateway-icmp +
|
||||||
|
# internet-egress-icmp scrape jobs (extraScrapeConfigs) — they probe
|
||||||
|
# from INSIDE the cluster, so they traverse the exact node -> pfSense NAT
|
||||||
|
# egress path that failed. ICMP needs CAP_NET_RAW (added to the
|
||||||
|
# deployment container below).
|
||||||
|
icmp_egress = {
|
||||||
|
prober = "icmp"
|
||||||
|
timeout = "5s"
|
||||||
|
icmp = {
|
||||||
|
preferred_ip_protocol = "ip4"
|
||||||
|
ip_protocol_fallback = false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
# External DNS reachability: a UDP/53 query (cloudflare.com A) sent to a
|
||||||
|
# public resolver target. Fails when egress is down OR the upstream
|
||||||
|
# resolver is unreachable — a distinct failure surface from ICMP.
|
||||||
|
dns_external = {
|
||||||
|
prober = "dns"
|
||||||
|
timeout = "5s"
|
||||||
|
dns = {
|
||||||
|
transport_protocol = "udp"
|
||||||
|
preferred_ip_protocol = "ip4"
|
||||||
|
ip_protocol_fallback = false
|
||||||
|
query_name = "cloudflare.com"
|
||||||
|
query_type = "A"
|
||||||
|
valid_rcodes = ["NOERROR"]
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
@ -175,6 +204,15 @@ resource "kubernetes_deployment" "blackbox_exporter" {
|
||||||
memory = "48Mi"
|
memory = "48Mi"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
# The icmp_egress module needs raw sockets to send ICMP echo. NET_RAW
|
||||||
|
# only — NOT privileged and NOT NET_ADMIN/SYS_ADMIN — so it stays
|
||||||
|
# within the Kyverno wave-1 deny-privileged / restrict-sys-admin
|
||||||
|
# policies. (2026-06-28, pfSense egress monitoring.)
|
||||||
|
security_context {
|
||||||
|
capabilities {
|
||||||
|
add = ["NET_RAW"]
|
||||||
|
}
|
||||||
|
}
|
||||||
volume_mount {
|
volume_mount {
|
||||||
name = "config-volume"
|
name = "config-volume"
|
||||||
mount_path = "/etc/blackbox_exporter/"
|
mount_path = "/etc/blackbox_exporter/"
|
||||||
|
|
|
||||||
|
|
@ -194,6 +194,31 @@ resource "kubernetes_config_map" "loki_alert_rules" {
|
||||||
},
|
},
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
# Egress / pfSense (added 2026-06-28 after the 2026-06-27 WAN/egress
|
||||||
|
# incident). Cloudflared edge-connection failures are the log canary
|
||||||
|
# that fired FIRST + most reliably — the cloudflared *deployment*
|
||||||
|
# replica metric stays GREEN during a tunnel-connection outage (pods
|
||||||
|
# Running, tunnels failing), so a metric alert is blind to this.
|
||||||
|
# Routed via Loki ruler → Alertmanager → slack by severity; inhibited
|
||||||
|
# under WANGatewayUnreachable/InternetEgressDown so it doesn't
|
||||||
|
# double-page. Calibrated against live Loki 2026-06-28: steady-state
|
||||||
|
# ~2 matches/6h; the incident ran 37-85 matches/5m, so >20/5m sits
|
||||||
|
# well clear of noise. Runbook: docs/runbooks/pfsense-egress.md.
|
||||||
|
name = "Egress / pfSense"
|
||||||
|
rules = [
|
||||||
|
{
|
||||||
|
alert = "CloudflaredTunnelConnLoss"
|
||||||
|
expr = "sum(count_over_time({namespace=\"cloudflared\"} |~ \"(?i)(lost connection with the edge|failed to dial|register tunnel error|failed to serve quic)\" [5m])) > 20"
|
||||||
|
for = "2m"
|
||||||
|
labels = { severity = "warning", subsystem = "pfsense" }
|
||||||
|
annotations = {
|
||||||
|
summary = "cloudflared losing edge/tunnel connections (>20/5m) — possible egress/WAN trouble"
|
||||||
|
description = "cloudflared edge-connection failures exceeded 20 in 5m (steady-state ~2/6h; the 2026-06-27 egress incident hit 37-85/5m). Pods usually stay Running so the replica-health alert is blind — this log canary is the early egress signal. Correlate with InternetEgressDown / EgressOnlyDivergence. Runbook: docs/runbooks/pfsense-egress.md."
|
||||||
|
}
|
||||||
|
},
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
# t3 session-auth + auto-upgrade health (devvm host scripts → journald →
|
# t3 session-auth + auto-upgrade health (devvm host scripts → journald →
|
||||||
# Loki). Backstops the gated-nightly t3 tracker: the dispatch logs every
|
# Loki). Backstops the gated-nightly t3 tracker: the dispatch logs every
|
||||||
|
|
|
||||||
|
|
@ -208,6 +208,16 @@ alertmanager:
|
||||||
target_matchers:
|
target_matchers:
|
||||||
- alertname = T3ProbeDropBurst
|
- alertname = T3ProbeDropBurst
|
||||||
equal: [leg]
|
equal: [leg]
|
||||||
|
# pfSense egress cascade (2026-06-28): a WAN-gateway-down or full
|
||||||
|
# internet-egress-down is the root cause; the external-DNS, egress-only
|
||||||
|
# divergence, cloudflared-tunnel, email-roundtrip and external-divergence
|
||||||
|
# alerts are downstream symptoms of the same black-hole. Suppress them so
|
||||||
|
# one root alert pages, not a storm (mirrors the NodeDown/PowerOutage
|
||||||
|
# cascades). No `equal` — these symptom alerts carry no shared label.
|
||||||
|
- source_matchers:
|
||||||
|
- alertname =~ "WANGatewayUnreachable|InternetEgressDown"
|
||||||
|
target_matchers:
|
||||||
|
- alertname =~ "ExternalDNSResolutionDown|EgressOnlyDivergence|CloudflaredTunnelConnLoss|EmailRoundtripFailing|EmailRoundtripStale|ExternalAccessDivergence"
|
||||||
receivers:
|
receivers:
|
||||||
- name: slack-critical
|
- name: slack-critical
|
||||||
slack_configs:
|
slack_configs:
|
||||||
|
|
@ -3292,6 +3302,79 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "Public path walled off by Authentik: {{ $labels.service }} ({{ $labels.instance }})"
|
summary: "Public path walled off by Authentik: {{ $labels.service }} ({{ $labels.instance }})"
|
||||||
description: "The must-stay-public URL {{ $labels.instance }} (carve-out `{{ $labels.service }}`) is failing its blackbox probe — most likely it now 302-redirects to Authentik SSO. A path-scoped `auth = \"none\"` carve-out probably regressed (TF revert / deploy / ingress_factory auth default flipping back to \"required\"). Native-client / public / webhook / WebSocket / SPA-XHR traffic to this endpoint is broken for strangers and machines. Check the owning stack's ingress_factory `auth` + `ingress_path`, and curl the URL: `curl -sI '{{ $labels.instance }}'` — a Location to authentik.viktorbarzin.me confirms the regression. Probe config + target list: stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf."
|
description: "The must-stay-public URL {{ $labels.instance }} (carve-out `{{ $labels.service }}`) is failing its blackbox probe — most likely it now 302-redirects to Authentik SSO. A path-scoped `auth = \"none\"` carve-out probably regressed (TF revert / deploy / ingress_factory auth default flipping back to \"required\"). Native-client / public / webhook / WebSocket / SPA-XHR traffic to this endpoint is broken for strangers and machines. Check the owning stack's ingress_factory `auth` + `ingress_path`, and curl the URL: `curl -sI '{{ $labels.instance }}'` — a Location to authentik.viktorbarzin.me confirms the regression. Probe config + target list: stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf."
|
||||||
|
# pfSense / egress alerts (added 2026-06-28 after the 2026-06-27 WAN/egress
|
||||||
|
# black-hole incident: pfSense VMID 101 stopped passing internet egress for
|
||||||
|
# ~20min while internal routing + Unbound stayed up; only a reboot
|
||||||
|
# recovered it, and NOTHING alerted — the cloudflared replica metric stayed
|
||||||
|
# green and there was no egress probe). Probe metrics come from the
|
||||||
|
# blackbox scrape jobs wan-gateway-icmp / internet-egress-icmp /
|
||||||
|
# internet-egress-dns (extraScrapeConfigs). Criticals route to
|
||||||
|
# slack-critical via the severity=critical child route; WAN/egress-down
|
||||||
|
# inhibits the downstream egress symptoms (see inhibit_rules).
|
||||||
|
# Runbook: docs/runbooks/pfsense-egress.md.
|
||||||
|
- name: Egress / pfSense
|
||||||
|
rules:
|
||||||
|
- alert: WANGatewayUnreachable
|
||||||
|
# In-cluster ICMP to the pfSense WAN gateway (192.168.1.1). 0 = the
|
||||||
|
# gateway pfSense routes through is unreachable from the cluster.
|
||||||
|
expr: probe_success{job="wan-gateway-icmp"} == 0
|
||||||
|
for: 3m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
subsystem: pfsense
|
||||||
|
annotations:
|
||||||
|
summary: "pfSense WAN gateway 192.168.1.1 unreachable from the cluster (>3m)"
|
||||||
|
description: "In-cluster blackbox ICMP to the pfSense upstream gateway (192.168.1.1) has failed for >3m. pfSense egress/NAT or the WAN path is likely down (the 2026-06-27 incident class); internal VLAN routing may still work. Check pfSense System > Routing > Gateways + dpinger; on-box `clog /var/log/gateways.log`. Runbook: docs/runbooks/pfsense-egress.md."
|
||||||
|
- alert: InternetEgressDown
|
||||||
|
# ICMP to BOTH Quad9 and Cloudflare from in-cluster (path crosses
|
||||||
|
# pfSense NAT). max()==0 = NEITHER answered = a true egress
|
||||||
|
# black-hole, not a single-provider blip.
|
||||||
|
expr: max(probe_success{job="internet-egress-icmp"}) == 0
|
||||||
|
for: 2m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
subsystem: pfsense
|
||||||
|
annotations:
|
||||||
|
summary: "Internet egress DOWN — in-cluster ICMP to both 9.9.9.9 and 1.1.1.1 failing (>2m)"
|
||||||
|
description: "Neither 9.9.9.9 nor 1.1.1.1 is reachable via ICMP from inside the cluster for >2m — internet egress through pfSense NAT is black-holed (the 2026-06-27 incident: egress died ~20min while internal stayed up). Check pfSense WAN/gateway/NAT; recovery per docs/runbooks/pfsense-egress.md (gateway re-eval / reboot is the known fix)."
|
||||||
|
- alert: ExternalDNSResolutionDown
|
||||||
|
# UDP/53 DNS query to public resolvers — a distinct failure surface
|
||||||
|
# from ICMP (catches DNS-only egress breakage).
|
||||||
|
expr: max(probe_success{job="internet-egress-dns"}) == 0
|
||||||
|
for: 3m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
subsystem: pfsense
|
||||||
|
annotations:
|
||||||
|
summary: "External DNS resolution failing via both 9.9.9.9 and 1.1.1.1 (>3m)"
|
||||||
|
description: "In-cluster DNS queries (cloudflare.com A) to both public resolvers fail for >3m — egress or external DNS is broken. If InternetEgressDown also fires it's an egress black-hole; DNS-only points at the resolver path."
|
||||||
|
- alert: EgressOnlyDivergence
|
||||||
|
# The 2026-06-27 SIGNATURE: the t3-probe cloudflare (full public path)
|
||||||
|
# leg is down WHILE the internal (Traefik-only) leg is up =
|
||||||
|
# egress-specific failure, internal healthy. Reuses the existing
|
||||||
|
# t3-probe (no new infra). on() joins the two single-series legs.
|
||||||
|
expr: t3probe_connected{job="t3-probe", leg="cloudflare"} == 0 and on() t3probe_connected{job="t3-probe", leg="internal"} == 1
|
||||||
|
for: 3m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
subsystem: pfsense
|
||||||
|
annotations:
|
||||||
|
summary: "Egress-only failure: t3 cloudflare leg DOWN while internal leg UP (>3m)"
|
||||||
|
description: "The full-public-path (cloudflare) t3 probe leg is down but the internal (Traefik-only) leg is healthy — the exact 2026-06-27 egress-only signature (internet/NAT down, internal routing fine). A total outage would also drop the internal leg. Check pfSense WAN/egress; docs/runbooks/pfsense-egress.md."
|
||||||
|
- alert: PfSenseVMDown
|
||||||
|
# pfSense = Proxmox VMID 101 (proxmox-exporter pve_up). VM stopped
|
||||||
|
# while the host is up = total egress + inter-VLAN + DHCP + DNS loss
|
||||||
|
# (single point of failure, no HA). CAVEAT: a GUEST-INTERNAL reboot
|
||||||
|
# leaves pve_up==1 (it tracks the qemu process) — so this catches a
|
||||||
|
# true VM stop/crash, NOT the in-guest reboot seen on 2026-06-27.
|
||||||
|
expr: pve_up{id="qemu/101"} == 0 and on() pve_up{id="node/pve"} == 1
|
||||||
|
for: 2m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
subsystem: pfsense
|
||||||
|
annotations:
|
||||||
|
summary: "pfSense VM (Proxmox VMID 101) is DOWN while the host is up"
|
||||||
|
description: "The pfSense VM (qemu/101) reports down while the PVE host (node/pve) is up — the cluster has lost its single-point-of-failure gateway/NAT/DHCP/DNS (no HA). Start VM 101 from Proxmox immediately. Note: an in-guest reboot does NOT trip this (pve_up tracks the qemu process)."
|
||||||
|
|
||||||
extraScrapeConfigs: |
|
extraScrapeConfigs: |
|
||||||
# Alertmanager self-metrics. The bundled Alertmanager Service carries no
|
# Alertmanager self-metrics. The bundled Alertmanager Service carries no
|
||||||
|
|
@ -3369,6 +3452,58 @@ extraScrapeConfigs: |
|
||||||
- targets:
|
- targets:
|
||||||
- "t3-probe.t3code.svc.cluster.local:9108"
|
- "t3-probe.t3code.svc.cluster.local:9108"
|
||||||
metrics_path: '/metrics'
|
metrics_path: '/metrics'
|
||||||
|
# --- pfSense egress / WAN probes (added 2026-06-28 after the 2026-06-27
|
||||||
|
# WAN/egress black-hole incident). All three probe via blackbox-exporter from
|
||||||
|
# INSIDE the cluster (pod 10.10.x -> node -> pfSense NAT), so they exercise the
|
||||||
|
# exact egress path that failed. Alerts: WANGatewayUnreachable /
|
||||||
|
# InternetEgressDown / ExternalDNSResolutionDown (alerting_rules.yml, group
|
||||||
|
# "Egress / pfSense"). Module defs (icmp_egress, dns_external) + NET_RAW:
|
||||||
|
# authentik_walloff_probe.tf.
|
||||||
|
- job_name: 'wan-gateway-icmp'
|
||||||
|
scrape_interval: 30s
|
||||||
|
scrape_timeout: 10s
|
||||||
|
metrics_path: /probe
|
||||||
|
params:
|
||||||
|
module: [icmp_egress]
|
||||||
|
static_configs:
|
||||||
|
- targets: ["192.168.1.1"]
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__address__]
|
||||||
|
target_label: __param_target
|
||||||
|
- source_labels: [__param_target]
|
||||||
|
target_label: instance
|
||||||
|
- target_label: __address__
|
||||||
|
replacement: 'blackbox-exporter.monitoring.svc.cluster.local:9115'
|
||||||
|
- job_name: 'internet-egress-icmp'
|
||||||
|
scrape_interval: 30s
|
||||||
|
scrape_timeout: 10s
|
||||||
|
metrics_path: /probe
|
||||||
|
params:
|
||||||
|
module: [icmp_egress]
|
||||||
|
static_configs:
|
||||||
|
- targets: ["9.9.9.9", "1.1.1.1"]
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__address__]
|
||||||
|
target_label: __param_target
|
||||||
|
- source_labels: [__param_target]
|
||||||
|
target_label: instance
|
||||||
|
- target_label: __address__
|
||||||
|
replacement: 'blackbox-exporter.monitoring.svc.cluster.local:9115'
|
||||||
|
- job_name: 'internet-egress-dns'
|
||||||
|
scrape_interval: 60s
|
||||||
|
scrape_timeout: 10s
|
||||||
|
metrics_path: /probe
|
||||||
|
params:
|
||||||
|
module: [dns_external]
|
||||||
|
static_configs:
|
||||||
|
- targets: ["9.9.9.9", "1.1.1.1"]
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__address__]
|
||||||
|
target_label: __param_target
|
||||||
|
- source_labels: [__param_target]
|
||||||
|
target_label: instance
|
||||||
|
- target_label: __address__
|
||||||
|
replacement: 'blackbox-exporter.monitoring.svc.cluster.local:9115'
|
||||||
# rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera
|
# rpi-sofia: external Raspberry Pi 3 at the Sofia home site (Frigate camera
|
||||||
# DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter
|
# DNAT passthrough + solar inverter path + HA MQTT sensors). node_exporter
|
||||||
# installed via apt; the rpi_* metrics come from a vcgencmd textfile collector
|
# installed via apt; the rpi_* metrics come from a vcgencmd textfile collector
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue