infra/stacks/monitoring
Viktor Barzin ca2680c189 fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14]
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
  rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
  eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:05:33 +00:00
..
modules/monitoring fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] 2026-04-14 18:05:33 +00:00
main.tf add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout 2026-03-23 02:24:39 +02:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
tiers.tf extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00