- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key
from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.
New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR
Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>