alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm
The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto
the alloy container. The block under `controller:` was silently dropped, so
the container ran with `resources: {}` and inherited the Kyverno LimitRange
`tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The
cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache
thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the
Proxmox host and rippling out to all VMs + NFS.
Fix:
- Move resources to `alloy.resources` (correct chart key).
- Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99%
memory-request saturation cluster-wide; a 1Gi request blocks
scheduling on node2/node3.
- Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS
rolling update fits inside the helm timeout.
- Bump helm_release.alloy.timeout to 900s (default 300s was too short
with occasional runc-stuck-Terminating on k8s-master).
Verified: all 4 alloy pods now show 1Gi/512Mi at the container level;
helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well
under the new limit).
Memory ID 2726.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
b9ac942647
commit
16b3969ceb
2 changed files with 25 additions and 12 deletions
|
|
@ -1,4 +1,18 @@
|
||||||
alloy:
|
alloy:
|
||||||
|
# Resource limits for the alloy container itself.
|
||||||
|
# Must be under `alloy.resources` (NOT `controller.resources`) — the chart
|
||||||
|
# only maps THIS key onto the alloy container. Without it, the container gets
|
||||||
|
# `resources: {}` and inherits Kyverno LimitRange `tier-defaults` (256Mi),
|
||||||
|
# which is below Alloy's 400-450Mi steady state and caused page-cache
|
||||||
|
# thrashing → 185 MB/s sdc reads → host IO saturation (2026-05-26).
|
||||||
|
# Burstable QoS (request < limit) — workers are at 97-99% memory-request
|
||||||
|
# saturation; a 1Gi request blocks scheduling on node2/node3.
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 512Mi
|
||||||
|
limits:
|
||||||
|
memory: 1Gi
|
||||||
configMap:
|
configMap:
|
||||||
content: |-
|
content: |-
|
||||||
// Write your Alloy config here:
|
// Write your Alloy config here:
|
||||||
|
|
@ -183,6 +197,14 @@ alloy:
|
||||||
readOnly: true
|
readOnly: true
|
||||||
|
|
||||||
controller:
|
controller:
|
||||||
|
# Bump maxUnavailable above the chart default (1) so a 5-node DS finishes its
|
||||||
|
# rolling update inside the helm_release timeout. Log shipper tolerates the
|
||||||
|
# brief gap.
|
||||||
|
updateStrategy:
|
||||||
|
type: RollingUpdate
|
||||||
|
rollingUpdate:
|
||||||
|
maxUnavailable: 50%
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
extra:
|
extra:
|
||||||
- name: journal-run
|
- name: journal-run
|
||||||
|
|
@ -206,13 +228,3 @@ controller:
|
||||||
operator: "Exists"
|
operator: "Exists"
|
||||||
effect: "NoSchedule"
|
effect: "NoSchedule"
|
||||||
|
|
||||||
# Resource limits for DaemonSet pods
|
|
||||||
# Alloy tails logs from all containers on the node via K8s API and batches
|
|
||||||
# them to Loki. Memory scales with number of active log streams (~30-50 per node).
|
|
||||||
# 128Mi was OOMKilled; steady-state usage is ~400-450Mi per pod.
|
|
||||||
resources:
|
|
||||||
requests:
|
|
||||||
cpu: 50m
|
|
||||||
memory: 512Mi
|
|
||||||
limits:
|
|
||||||
memory: 1Gi
|
|
||||||
|
|
|
||||||
|
|
@ -28,8 +28,9 @@ resource "helm_release" "alloy" {
|
||||||
repository = "https://grafana.github.io/helm-charts"
|
repository = "https://grafana.github.io/helm-charts"
|
||||||
chart = "alloy"
|
chart = "alloy"
|
||||||
|
|
||||||
values = [file("${path.module}/alloy.yaml")]
|
values = [file("${path.module}/alloy.yaml")]
|
||||||
atomic = true
|
atomic = true
|
||||||
|
timeout = 900 # 5-pod DS rolling update + occasional runc-stuck-Terminating on k8s-master needs >300s default
|
||||||
|
|
||||||
depends_on = [helm_release.loki]
|
depends_on = [helm_release.loki]
|
||||||
}
|
}
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue