infra/stacks/monitoring
Viktor Barzin 16b3969ceb alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm
The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto
the alloy container. The block under `controller:` was silently dropped, so
the container ran with `resources: {}` and inherited the Kyverno LimitRange
`tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The
cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache
thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the
Proxmox host and rippling out to all VMs + NFS.

Fix:
- Move resources to `alloy.resources` (correct chart key).
- Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99%
  memory-request saturation cluster-wide; a 1Gi request blocks
  scheduling on node2/node3.
- Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS
  rolling update fits inside the helm timeout.
- Bump helm_release.alloy.timeout to 900s (default 300s was too short
  with occasional runc-stuck-Terminating on k8s-master).

Verified: all 4 alloy pods now show 1Gi/512Mi at the container level;
helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well
under the new limit).

Memory ID 2726.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:08:35 +00:00
..
modules/monitoring alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm 2026-05-26 02:08:35 +00:00
main.tf [forgejo] Tolerate missing Vault keys during Phase 0 bootstrap 2026-05-07 15:53:08 +00:00
secrets extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00
terragrunt.hcl extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] 2026-03-17 21:34:11 +00:00