infra/docs/runbooks/apiserver-audit-logging.md
Viktor Barzin 551412488b
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
apiserver: enable audit logging (low-write Metadata) + ship to Loki
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.

Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 16:51:26 +00:00

3.9 KiB

Runbook: kube-apiserver Audit Logging

Status: enabled 2026-06-06 on k8s-master (10.0.20.100, the single control-plane node). Motivated by the novelapp incident — a workload was deleted with no way to attribute it, because apiserver audit logging had never been on (see post-incident note below).

What is configured

  • Audit policy: infra/scripts/k8s-apiserver-audit-policy.yaml (source of truth), deployed to /etc/kubernetes/audit-policy.yaml on k8s-master. Low-write by design: drops reads (get/list/watch), high-churn resources (events, leases, endpointslices, token/subjectaccess reviews), and probe URLs; logs everything else (create/update/patch/delete) at Metadata level (who/verb/resource/namespace/name/time/sourceIP — no bodies). omitStages: [RequestReceived] → one line per mutating request.
  • kube-apiserver static-pod manifest (/etc/kubernetes/manifests/kube-apiserver.yaml): --audit-policy-file=/etc/kubernetes/audit-policy.yaml, --audit-log-path=/var/log/kubernetes/audit/audit.log, --audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100 (≤1 GB on disk, 30-day rotation), plus the audit-policy (File, RO) and audit-logs (DirectoryOrCreate) hostPath volumes/mounts.
  • Persistence across kubeadm upgrade: the same flags + volumes are in the kubeadm-config ConfigMap (kube-system), ClusterConfiguration.apiServer.{extraArgs,extraVolumes} (v1beta4). Without this, a control-plane upgrade regenerates the manifest and silently drops audit (and oidc). The OIDC flags are recorded there too (see below).
  • Shipping to Loki: the Alloy DaemonSet (infra/stacks/monitoring/modules/monitoring/alloy.yaml) tails /var/log/kubernetes/audit/audit.log (it schedules on the control-plane node and mounts host /var/log). Query in Loki/Grafana with {job="kubernetes-audit"}.

How to attribute a change ("who deleted X, when")

# In Loki (Grafana Explore or logcli), last 24h:
{job="kubernetes-audit"} |= "delete" |= "<resource-name>"

Each entry is a JSON audit.k8s.io/v1 Event: user.username, verb, objectRef.{resource,namespace,name}, requestReceivedTimestamp, sourceIPs, userAgent. On-node fallback (Loki down): sudo grep <name> /var/log/kubernetes/audit/audit.log on k8s-master.

Note: direct kubectl/dashboard calls now show the real identity (user SA or OIDC email). Pre-2026-06-06 deletions are NOT recoverable (audit was off).

CRITICAL gotcha that blocked this (and OIDC) for weeks

kubelet runs every non-dotfile in its staticPodPath (/etc/kubernetes/manifests) as a static pod. A stray kube-apiserver.yaml.bak.<epoch> left in that directory (from an earlier manual edit) was a second manifest defining pod kube-apiserver. kubelet ran the older .bak copy and ignored edits to the real kube-apiserver.yaml — so newly added flags (the OIDC flags, then these audit flags) never reached the running process even though the file clearly had them. Symptom: the running apiserver's /proc/<pid>/cmdline (or crictl inspect args) is SHORTER than the manifest's command: list. Fix: move any *.bak/backup OUT of /etc/kubernetes/manifests/. Always back up control-plane manifests to a sibling dir (e.g. /etc/kubernetes/), never inside manifests/. This also un-blocked OIDC (memory id=4042) as a side effect.

Rollback

Backups live in /etc/kubernetes/apiserver-manifest-archive/ on k8s-master (the 27-arg pre-audit known-good, and the 36-arg desired). To disable audit: remove the --audit-* flags + audit volumes from the manifest (kubelet restarts the apiserver in ~30-40s), and remove them from kubeadm-config. A bad manifest edit only needs the known-good copied back over /etc/kubernetes/manifests/kube-apiserver.yaml.

Editing the apiserver manifest restarts the apiserver → ~30-40s API blip on this single-control-plane cluster. Always edit from a backup + watch curl -sk https://10.0.20.100:6443/livez before declaring success.