kyverno: disable reports-controller to stop etcd ephemeralreport load
All checks were successful
ci/woodpecker/push/default Pipeline was successful

Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).

Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.

Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-29 05:35:36 +00:00
parent cf42042cba
commit e43e64c666

View file

@ -36,8 +36,9 @@ resource "helm_release" "kyverno" {
forceFailurePolicyIgnore = {
enabled = true
}
# Reporting fully disabled (2026-06-12, etcd-load-reduction). policyReports
# were already off, so admission/aggregate/background reporting generated
# Reporting features disabled (2026-06-12, etcd-load-reduction); the
# reportsController itself is now disabled too (2026-06-28, see below).
# policyReports were already off, so admission/aggregate/background generated
# ephemeralreports + an hourly all-resource etcd re-scan for NO user-facing
# output. Admission enforcement (deny-* policies) and Keel mutation are
# independent of reporting; policy violations surface via Loki->Slack. This
@ -56,7 +57,19 @@ resource "helm_release" "kyverno" {
}
}
# Fully disable the reports controller (2026-06-28). The 2026-06-12 change
# turned off the report *features* (policy/admission/aggregate/background) but
# LEFT this controller running with its default --enableReporting +
# --validatingAdmissionPolicyReports=true, so it kept emitting ephemeralreports.
# The 2026-06-21 kyverno upgrade then produced a one-time pile of ~10.5k
# cluster/namespaced ephemeralreports (~114MB in etcd) that nothing reaps
# (aggregation off) and listing that range starves etcd's fdatasync hard
# enough to flap the apiserver (observed live 2026-06-28). Reports are not
# consumed (violations surface via Loki->Slack), so disable the controller
# outright; enforcement (deny-* policies) + Keel mutation are independent of
# it. Stale reports are cleared out-of-band (one-time, throttled).
reportsController = {
enabled = false
resources = {
limits = {
memory = "512Mi"