infra/docs/runbooks/pfsense-egress.md
Viktor Barzin 7fe2d9780e
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
monitoring: add pfSense WAN/egress alerting + probes
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00

4.7 KiB
Raw Blame History

Runbook: pfSense WAN / egress outage

Scope: the cluster (and home) loses internet egress while pfSense is otherwise alive — internal VLAN routing and DNS keep working. This is the 2026-06-27 incident class: pfSense (Proxmox VMID 101) stopped passing IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound stayed up; recovery required a manual reboot, and nothing alerted (no egress probe existed; the cloudflared replica metric stayed green). The alerts + probes below close that gap. Incident detail: memory ids #6715#6723.

pfSense is a single point of failure (no HA): it is the k8s default gateway (10.0.20.1), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is static 192.168.1.2/24, upstream gateway WANGW = 192.168.1.1 (the TP-Link Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.

Alerts (all in stacks/monitoring/modules/monitoring/)

Alert Signal Means
WANGatewayUnreachable (critical) in-cluster ICMP to 192.168.1.1 fails >3m pfSense's upstream gateway is unreachable from the cluster
InternetEgressDown (critical) in-cluster ICMP to both 9.9.9.9 and 1.1.1.1 fails >2m internet egress through pfSense NAT is black-holed
ExternalDNSResolutionDown (warning) UDP/53 to both public resolvers fails >3m egress or external-DNS path broken
EgressOnlyDivergence (critical) t3-probe cloudflare leg down while internal leg up >3m egress-specific failure, internal healthy (the exact 2026-06-27 signature)
PfSenseVMDown (critical) pve_up{id="qemu/101"}==0 while host up >2m the pfSense VM stopped/crashed (host fine)
CloudflaredTunnelConnLoss (warning, Loki) >20 cloudflared edge-conn failures/5m tunnel/egress trouble (canary that fires first; replica metric is blind)

Probes run from inside the cluster (blackbox-exporter, pod → node → pfSense NAT), so they exercise the exact egress path that fails. WANGatewayUnreachable / InternetEgressDown inhibit the downstream egress symptoms so one root alert pages, not a storm.

PfSenseVMDown does not catch a guest-internal reboot — pve_up tracks the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was metric-invisible). CloudflaredTunnelConnLoss + the probe alerts cover that case.

Diagnose (read-only first)

  1. Confirm scope — is it egress-only or total?
    • kubectl -n monitoring Grafana → probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"} and t3probe_connected by leg.
    • Internal still up? pve_up{id="qemu/101"} should be 1; internal k8s DNS (10.0.20.1) still resolving = pfSense alive, egress-only.
  2. Capture pfSense on-box logs BEFORE rebooting (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
    ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1      # devvm wizard key (id #6784)
    clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss'   # dpinger gateway alarms
    clog /var/log/routing.log  | grep -iE 'default|route'              # default-route add/delete
    clog /var/log/system.log   | tail -200
    netstat -rn | head                                                 # is the default route present?
    ls -la /var/crash/                                                 # panic/textdump?
    
    (If SSH is rejected post-reboot, the reboot regenerated authorized_keys from config.xml — re-add the key via console or WebGUI; see id #6718.)
  3. Upstream check — is the TP-Link / ISP up? It held the same public IP with clean DHCP renewals through the 2026-06-27 event, so a sustained upstream fault is unlikely; a reboot fixing it points at pfSense-side state.

Recover

  • Fast path (known fix): reboot pfSense — re-adds the default route, re-arms dpinger, flushes pf state. Capture the logs above FIRST (a reboot wipes the volatile evidence needed to find the real mechanism).
  • Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways → WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it re-eval. Confirm netstat -rn shows the default route restored.

Prevent / harden (deferred, needs a live-pfSense change)

Not done in this monitoring change — tracked for a follow-up with hands-on pfSense access: point dpinger's monitor at the local gateway (192.168.1.1) instead of an external IP + widen thresholds; disable gw_down_kill_states for the single WAN; add a failover gateway group; a 60s auto-recovery watchdog; ship pfSense system/gateway/routing syslog to the cluster so these logs become centrally queryable.