Switch from restart-count based detection (increase restarts[1h] > 5) to
waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}).
Alert auto-resolves when pod recovers, making it clear whether the issue is active.
HomeAssistantVersionControl v1.2.0 installed on ha-sofia for git-based
config tracking. Auto-commits on file change, pushes hourly to private
GitHub repo ViktorBarzin/ha-sofia-config.
Root cause: PMTU black hole on WireGuard tunnel. The tunnel runs over
the HE IPv6 6in4 tunnel (gif0 MTU 1280). With WG overhead (~80 bytes),
effective inner MTU is 1200 — but both sides were configured at 1420.
SSH kex packets >1200 bytes were silently dropped.
Fix: Set tun_wg0 MTU to 1200 on pfSense + peer_855 MTU to 1200 on
London GL-iNet. Re-enabled London DHCP/ARP import in remote CronJob.
All 3 sites now fully automated:
- Sofia: Kea leases + ARP every 5min
- London: DHCP + ARP via pfSense→London SSH hop, hourly
- Valchedrym: DHCP + ARP via pfSense→OpenWRT SSH hop, hourly
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProxmoxMetricsMissing alert was firing because pve_* metrics were
excluded from the kubernetes-service-endpoints metric_relabel_configs
whitelist. The exporter was scraping successfully but metrics were
being dropped before ingestion.
- Sofia import (every 5min): Kea leases + pfSense ARP via SSH
- Remote import (hourly): Valchedrym DHCP/ARP via pfSense SSH hop
- London SSH (dropbear) hangs during kex on low-power router — disabled
for now, data imported manually. TODO: lightweight push agent
- Fixed SSH key filename (id_rsa, not id_ed25519) for RSA keys
- No more ping sweeping anywhere — all passive DHCP/ARP data
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document fixes from 2026-04-10 London network debugging session:
- pfSense WG now dual-stack (IPv4+IPv6 via HE tunnel gif0 pf rule)
- GL-iNet AllowedIPs must be single comma-separated UCI entry (parse bug)
- AdGuardHome/carrier-monitor must not use 1.1.1.1 (conntrack + rate limit)
- Expanded troubleshooting for site-to-site tunnel disconnects
Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.
Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.
Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
- CronJob now SSHs to Valchedrym OpenWRT (192.168.0.1) to pull DHCP leases + ARP table
- Parses /tmp/dhcp.leases for hostname + MAC, /proc/net/arp for additional devices
- London still uses ping sweep via pfSense WG tunnel (no SSH access to GL-iNet)
- 6 Valchedrym devices tracked: router, alarm, video, termoregulator, 2 clients
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All device discovery now handled by phpipam-pfsense-import CronJob
which queries Kea DHCP leases + pfSense ARP table every 5min.
No active scanning needed — pfSense sees all devices passively.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>