monitoring: add pfSense WAN/egress alerting + probes

On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00 · 2026-06-28 16:46:30 +00:00 · 7fe2d9780e
commit 7fe2d9780e
parent 279b88d2bc
5 changed files with 271 additions and 0 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@ -235,6 +235,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`).
 - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction), AuthentikRootRouter5xxHigh (all-3-server-pods-NotReady cascade → 502/503/504 on the authentik `/` router). **The Traefik scrape keeps `traefik_router_requests_total`** (per-router `code` label) — the drop-regex in the `traefik` scrape job drops only the high-cardinality `*_duration_seconds_bucket` histogram, NOT the request counter, so per-router 429/5xx is queryable + alertable.
 - **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Brevo HTTP API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Inbound external traffic enters via pfSense HAProxy on `10.0.20.1:{25,465,587,993}`, which forwards to k8s `mailserver-proxy` NodePort (30125-30128) with `send-proxy-v2`. Mailserver pod runs alt PROXY-speaking listeners (2525/4465/5587/10993) alongside stock PROXY-free ones (25/465/587/993) for intra-cluster clients. Real client IPs recovered from PROXY v2 header despite kube-proxy SNAT (replaces pre-2026-04-19 MetalLB `10.0.20.202` ETP:Local scheme; see bd code-yiu + `docs/runbooks/mailserver-pfsense-haproxy.md`). Vault: `brevo_api_key` in `secret/viktor` (probe + relay).
 - **Authentik walling-off guard**: `blackbox-exporter` (monitoring ns, `stacks/monitoring/modules/monitoring/authentik_walloff_probe.tf`) probes each must-stay-public `auth = "none"` carve-out URL with `no_follow_redirects` and FAILS (`fail_if_header_matches` on `Location`) iff it 302s to Authentik. Catches a carve-out regressing (TF revert / deploy / `ingress_factory` `auth` default flipping back to `"required"`). Scrape job `blackbox-authentik-walloff` (1m) → alert `AuthentikWallingOffPublicPath` (`probe_failed_due_to_regex == 1`, for 10m, `lane=security` → posts to `#alerts` via the `slack-security` receiver, which keeps its `[SECURITY]` styling; the dedicated `#security` channel was abandoned 2026-06-25 — the shared webhook's app isn't a member of it). **To guard a new carve-out: add one line to `local.authentik_walloff_targets`** (a `service → URL` map; `valid_status_codes` includes 301/302 so legit redirects/404s stay green — only the Authentik `Location` fails the probe). `curl -sI '<url>'` must NOT show a Location to `authentik.viktorbarzin.me` before adding.
+- **pfSense egress / WAN monitoring** (added 2026-06-28 after the 2026-06-27 egress-only incident — pfSense VMID 101 stopped passing internet egress for ~20 min while internal routing + Unbound stayed up, and NOTHING alerted: no egress probe existed and the cloudflared replica metric stayed green): `blackbox-exporter` gained `icmp_egress` + `dns_external` modules (+ `NET_RAW` on the pod) in `authentik_walloff_probe.tf`. Three in-cluster probe jobs (`wan-gateway-icmp` → 192.168.1.1, `internet-egress-icmp` → 9.9.9.9/1.1.1.1, `internet-egress-dns` → cloudflare.com via both) traverse the pod→node→pfSense-NAT path that fails. Alerts (group `Egress / pfSense` in `alerting_rules.yml`): `WANGatewayUnreachable`, `InternetEgressDown` (`max()==0` = both providers dead, not a single-provider blip), `ExternalDNSResolutionDown`, `EgressOnlyDivergence` (t3-probe `cloudflare` leg down WHILE `internal` leg up — the incident signature, reuses the existing t3-probe), `PfSenseVMDown` (`pve_up{id="qemu/101"}==0` while host up — does NOT catch a guest-internal reboot, `pve_up` tracks the qemu process). Plus Loki ruler `CloudflaredTunnelConnLoss` (>20 edge-conn failures/5m; calibrated live: steady-state ~2/6h vs 37-85/5m in-incident; the cloudflared replica metric is blind to tunnel-connection loss). `WANGatewayUnreachable`/`InternetEgressDown` **inhibit** the downstream egress symptoms (ExternalDNSResolutionDown/EgressOnlyDivergence/CloudflaredTunnelConnLoss/Email*/ExternalAccessDivergence). Runbook: `docs/runbooks/pfsense-egress.md`. **Deferred (needs a live-pfSense change, not in this monitoring-only change):** point dpinger's monitor at the local gateway + widen thresholds, disable `gw_down_kill_states`, add a failover gateway group + auto-recovery watchdog, and ship pfSense system/gateway/routing syslog to Loki (today only filterlog → CrowdSec; those logs are NOT centrally queryable — id #6717). No Uptime-Kuma egress monitor was added (the `external-monitor-sync` is purpose-built for `*.viktorbarzin.me` Cloudflare-path discovery; the blackbox probes cover egress directly).

 ## Security Posture (Wave 1 — locked 2026-05-18)