infra/docs/runbooks/pfsense-egress.md
Viktor Barzin 7fe2d9780e
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
monitoring: add pfSense WAN/egress alerting + probes
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00

72 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Runbook: pfSense WAN / egress outage
**Scope:** the cluster (and home) loses **internet egress** while pfSense is
otherwise alive — internal VLAN routing and DNS keep working. This is the
**2026-06-27 incident class**: pfSense (Proxmox **VMID 101**) stopped passing
IPv4 egress for ~20 min (00:02→00:23 UTC) while LAN/OPT1 routing + Unbound
stayed up; recovery required a manual reboot, and **nothing alerted** (no egress
probe existed; the cloudflared replica metric stayed green). The alerts +
probes below close that gap. Incident detail: memory ids #6715#6723.
pfSense is a **single point of failure** (no HA): it is the k8s default gateway
(`10.0.20.1`), Kea DHCP, Unbound DNS, NAT, and the WireGuard hub. WAN is
**static** `192.168.1.2/24`, upstream gateway `WANGW = 192.168.1.1` (the TP-Link
Archer AX6000). The sole IPv4 default gateway, no gateway-group/failover.
## Alerts (all in `stacks/monitoring/modules/monitoring/`)
| Alert | Signal | Means |
|-------|--------|-------|
| `WANGatewayUnreachable` (critical) | in-cluster ICMP to `192.168.1.1` fails >3m | pfSense's upstream gateway is unreachable from the cluster |
| `InternetEgressDown` (critical) | in-cluster ICMP to **both** `9.9.9.9` and `1.1.1.1` fails >2m | internet egress through pfSense NAT is black-holed |
| `ExternalDNSResolutionDown` (warning) | UDP/53 to both public resolvers fails >3m | egress or external-DNS path broken |
| `EgressOnlyDivergence` (critical) | t3-probe `cloudflare` leg down **while** `internal` leg up >3m | egress-specific failure, internal healthy (the exact 2026-06-27 signature) |
| `PfSenseVMDown` (critical) | `pve_up{id="qemu/101"}==0` while host up >2m | the pfSense VM stopped/crashed (host fine) |
| `CloudflaredTunnelConnLoss` (warning, Loki) | >20 cloudflared edge-conn failures/5m | tunnel/egress trouble (canary that fires first; replica metric is blind) |
Probes run **from inside the cluster** (blackbox-exporter, pod → node → pfSense
NAT), so they exercise the exact egress path that fails. `WANGatewayUnreachable`
/ `InternetEgressDown` **inhibit** the downstream egress symptoms so one root
alert pages, not a storm.
`PfSenseVMDown` **does not** catch a *guest-internal* reboot — `pve_up` tracks
the qemu process, which survives an in-guest reboot (this is why 2026-06-27 was
metric-invisible). `CloudflaredTunnelConnLoss` + the probe alerts cover that case.
## Diagnose (read-only first)
1. **Confirm scope** — is it egress-only or total?
- `kubectl -n monitoring` Grafana → `probe_success{job=~"wan-gateway-icmp|internet-egress-icmp"}` and `t3probe_connected` by `leg`.
- Internal still up? `pve_up{id="qemu/101"}` should be `1`; internal k8s DNS (`10.0.20.1`) still resolving = pfSense alive, egress-only.
2. **Capture pfSense on-box logs BEFORE rebooting** (they persist on disk — no RAM-disk — and are the only source that proves the mechanism; they are NOT shipped to Loki):
```
ssh -i ~/.ssh/id_ed25519 admin@10.0.20.1 # devvm wizard key (id #6784)
clog /var/log/gateways.log | grep -iE 'WANGW|down|up|delay|loss' # dpinger gateway alarms
clog /var/log/routing.log | grep -iE 'default|route' # default-route add/delete
clog /var/log/system.log | tail -200
netstat -rn | head # is the default route present?
ls -la /var/crash/ # panic/textdump?
```
(If SSH is rejected post-reboot, the reboot regenerated `authorized_keys` from
config.xml — re-add the key via console or WebGUI; see id #6718.)
3. **Upstream check** — is the TP-Link / ISP up? It held the same public IP with
clean DHCP renewals through the 2026-06-27 event, so a *sustained* upstream
fault is unlikely; a reboot fixing it points at **pfSense-side state**.
## Recover
- **Fast path (known fix):** reboot pfSense — re-adds the default route, re-arms
dpinger, flushes pf state. **Capture the logs above FIRST** (a reboot wipes
the volatile evidence needed to find the real mechanism).
- Targeted (if logs show a dpinger gateway-down): System → Routing → Gateways →
WANGW; check the monitor IP + dpinger state; re-enable the gateway / let it
re-eval. Confirm `netstat -rn` shows the default route restored.
## Prevent / harden (deferred, needs a live-pfSense change)
Not done in this monitoring change — tracked for a follow-up with hands-on
pfSense access: point dpinger's monitor at the local gateway (`192.168.1.1`)
instead of an external IP + widen thresholds; disable `gw_down_kill_states` for
the single WAN; add a failover gateway group; a 60s auto-recovery watchdog;
ship pfSense system/gateway/routing syslog to the cluster so these logs become
centrally queryable.