infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md
Viktor Barzin 279b88d2bc
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk)
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:25:10 +00:00

11 KiB

Post-Mortem: MetalLB ServiceL2Status Stuck Immutable → PG LB VIP Flap → Woodpecker CI Tier 1 Applies Broken

Field Value
Date 2026-05-16 (mitigated) / 2026-05-26 (closed)
Duration ~5 days of degraded CI (2026-04-21 first observed → 2026-05-16 mitigated). Symptom-only; no human-visible service downtime.
Severity SEV3 — Woodpecker CI default.yml apply step failed on Tier 1 (PG-backend) stacks. Drift-detection ran silently broken. Manual scripts/tg apply continued to work. No data loss, no app downtime.
Affected Services Woodpecker CI pipelines applying any of the 28+ Tier 1 stacks (monitoring, crowdsec, authentik, headscale, etc.). PostgreSQL backend itself was healthy.
Issue Beads code-aoxk (closed 2026-05-26).
Status Closed

Summary

Woodpecker CI surfaced as ERROR: Cannot read PG credentials from Vault. Run: vault login -method=oidc from scripts/tg whenever a pipeline tried to apply a Tier 1 stack. The error was misleading on two counts:

  1. Vault was healthy. A direct vault read database/static-creds/pg-terraform-state from inside a Woodpecker pipeline pod (using K8s SA JWT → auth/kubernetes/login role=ci) succeeded every time when run in isolation.
  2. The "Cannot read PG credentials" message in scripts/tg was a catch-all that fired for any Terraform/Terragrunt failure during PG state-lock acquire-release, including TCP RSTs against the PG LoadBalancer VIP.

Actual root cause: the MetalLB ServiceL2Status CR for the postgresql-lb service (dbaas namespace, VIP 10.0.20.200) had a stuck status.node field that the controller treated as immutable. The L2 speaker kept failing to update it with Invalid value: "k8s-nodeX": Value is immutable, so the leader-elected announcer flapped between k8s-node3 and k8s-node4 every few seconds. Each flap dropped open TCP connections (RST). Terraform's state-lock acquire → operation → release sequence straddled flaps and failed mid-operation. scripts/tg surfaced this as the misleading "Cannot read PG credentials" message.

Manual scripts/tg apply from the DevVM kept working because the developer's session happened to land on whichever node currently held the VIP and complete fast enough to not straddle a flap. CI pipelines, being slower (full stack walk), reliably straddled at least one flap.

Impact

  • CI degradation: Tier 1 stack changes pushed to master were NOT auto-applied. Required manual scripts/tg apply from DevVM after every push touching one of 28+ stacks.
  • Drift-detection broken: The daily drift-detection.yml Woodpecker pipeline silently failed on every Tier 1 stack — meaning unannounced manual changes to those stacks could have persisted undetected for the duration.
  • No user-facing outage: PG cluster itself, all apps that use PG, and all in-cluster traffic to 10.0.20.200 worked normally. Only the very specific acquire-state-lock → run operation → release-state-lock round-trip pattern from CI was unreliable.

Timeline (UTC)

Time Event
2026-04-21 First broken CI pipelines (#411, #412, #413). Drift-detection failures noticed. code-aoxk filed. Initial hypothesis: Vault auth/role mismatch.
2026-04-22 — 2026-05-15 Multiple investigation attempts. Verified Vault K8s auth/kubernetes/role/ci has correct policies (terraform-state, ci). Verified database/static-creds/pg-terraform-state exists, rotates on schedule, credentials valid. Could not reproduce the failure in isolated vault read from Woodpecker pods.
2026-05-16 (~12:14 UTC) pg-cluster-3 came up (third CNPG replica); endpoint set churn likely triggered MetalLB L2 announcer to attempt to update the existing ServiceL2Status CR (was l2-rgt9d). Update was rejected as immutable. Speaker kept retrying. VIP flapped.
2026-05-16 RCA breakthrough: noticed kubectl logs -n metallb-system -l component=speaker was full of Invalid value: "k8s-node…": Value is immutable on the postgresql-lb ServiceL2Status. Correlated with kubectl get servicel2status returning multiple stale entries for the same service.
2026-05-16 Mitigation: kubectl delete servicel2status.metallb.io l2-rgt9d -n metallb-system. Speaker recreated the CR cleanly (became l2-zj9ss). Flap stopped. PG connections stable. Manual CI re-runs of monitoring stack apply succeeded immediately.
2026-05-17 Audit: acceptance criteria 1 + 2 met implicitly. #3 (post-mortem) remained pending. Beads task reverted from in_progressopen.
2026-05-25 Node2 SCSI LUN remap → encrypted PVC emergency_ro → containerd boltdb corruption outage. Unrelated, but pulled Woodpecker server off node2. Subsequent server pod restart on k8s-node4.
2026-05-26 Verification: from a live Woodpecker pipeline pod (wp-01kshph6pa0w6ch0zf5x9bfqgr), vault write auth/kubernetes/login role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) succeeded. vault read database/static-creds/pg-terraform-state returned valid creds (username=terraform_state, last_vault_rotation 2026-05-21, TTL 58h). Live default.yml pipeline confirmed applying Tier 1 stacks: dbaas, authentik, crowdsec, monitoring, nvidia, cloudflared, kyverno, metallb — all OK. postgresql-lb ServiceL2Status currently single allocation (l2-sv9vv on k8s-node3, no flap). Beads task closed.

Root Cause

metallb-speaker reconciler in the deployed MetalLB version treats ServiceL2Status.status.node as immutable after first set. When the L2 announcer's leader-election picks a different node to announce a given VIP (which happens on speaker pod restart, node loss, endpoint set churn, or pod-anti-affinity reshuffles), the reconciler fails to patch the existing CR and gets stuck in a retry loop. Without manual deletion, the reconciler will not progress.

Why it manifested as Vault credential errors:

  1. CI's scripts/tg pre-flight runs vault read database/static-creds/pg-terraform-state (line 83 in current code) to get PG credentials. That call succeeds.
  2. CI then runs terragrunt apply against the Tier 1 stack. Terragrunt connects to 10.0.20.200:5432 for state-lock acquire (via pg_advisory_lock). The TCP connection lands on whichever node MetalLB last announced the VIP from.
  3. Mid-operation, MetalLB tries to re-announce from a different node, sends gratuitous ARPs, and the upstream switch updates its MAC table. Open TCP sessions on the previous announcer's node are immediately RST.
  4. Terragrunt's state-lock release (or any subsequent PG operation) fails with broken pipe / connection refused.
  5. scripts/tg interpreted the wrapper-level failure as "PG creds bad" because that's the most common failure mode it handles. The actual error from terragrunt was buried in 2>/dev/null suppression (since fixed — see Fix #1 below).

Detection

We did not have any of:

  • A direct alert for "MetalLB ServiceL2Status reconciler errors".
  • An alert for "PG LB VIP node changed N times in M minutes".
  • An end-to-end probe for the CI state-lock pattern (terragrunt against 10.0.20.200).

Detection mechanism was a human reading kubectl logs -n metallb-system for unrelated reasons. Took 25 days from first observed symptom to RCA.

Fixes & Mitigations

1. Surface real error from scripts/tg (DONE)

The original scripts/tg swallowed the real vault read / terragrunt error behind 2>/dev/null and printed a static "Cannot read PG credentials from Vault" message. Fixed in the script:

# scripts/tg lines 79-89 (current)
if ! command -v vault >/dev/null 2>&1; then
  echo "ERROR: vault CLI not found on PATH. Install it or use an image that includes it (ci/Dockerfile)." >&2
  exit 1
fi
VAULT_OUT=$(vault read -format=json database/static-creds/pg-terraform-state 2>&1) || {
  echo "ERROR: Cannot read PG credentials from Vault. Vault output follows:" >&2
  echo "$VAULT_OUT" >&2
  echo "" >&2
  echo "Hint: humans run 'vault login -method=oidc'; CI auths via K8s SA (role=ci)." >&2
  exit 1
}

Comment in the code explicitly references this incident.

2. Stuck-CR cleanup procedure (DOCUMENTED)

Reproduction check for future sessions (also in code-aoxk beads notes):

kubectl logs -n metallb-system -l component=speaker --tail=200 | grep -iE 'Invalid value.*immutable'
# If matches found → same root cause. Delete the stuck CR:
kubectl get servicel2status -n metallb-system
kubectl delete servicel2status.metallb.io <name> -n metallb-system

Speaker recreates the CR cleanly within seconds.

3. Long-term MetalLB controller fix (DEFERRED)

The underlying bug — speaker not recreating the CR when the immutable field needs to change — is upstream MetalLB behaviour. Two paths possible:

  • Upgrade MetalLB to a version where this is fixed (needs research — check changelogs).
  • File upstream issue / patch with reproducer.

Not done as part of this post-mortem; tracked separately. Risk acceptance: until then, the manual delete servicel2status workaround is the playbook, and is fast (<10s).

4. Alerting (DEFERRED)

Suggested but not implemented:

  • Prometheus alert on metallb_speaker_reconcile_errors_total{kind="ServiceL2Status"} rate.
  • Synthetic probe: a CronJob that does pg_advisory_lock + release against the PG VIP every 5min from CI namespace, alert if it ever fails.

Tracked as future hardening (no beads task yet — only worth filing if recurrence happens).

Lessons

  1. 2>/dev/null is a time-bomb. It hid the real error for weeks. Fix #1 already lands the principle; audit other places in scripts/ for the same anti-pattern next time we touch them.
  2. CRD status.* immutability is non-obvious failure mode. When debugging weird LB / VIP / endpoint behaviour, always grep speaker logs for immutable, cannot update, and reconciler errors. Add to cluster-health checks.
  3. Misleading wrapper errors cost weeks. scripts/tg claimed "Cannot read PG credentials" — that's what the operator believed. The actual vault read step worked. The real failure was three steps later in a completely different subsystem. When a wrapper script makes a definitive claim about which subsystem failed, distrust it; reproduce the subsystem in isolation before chasing the claim.
  4. CNPG primary changes / endpoint churn can trigger L2 announcer flap. The trigger (within the timeline) was likely the pg-cluster-3 pod coming up. Worth flagging for any future CNPG topology changes.

References

  • Beads: code-aoxk — closed 2026-05-26.
  • scripts/tg lines 65-95 — current pre-flight with explicit error surfacing.
  • kubectl get servicel2status -A — current state, single allocation per service.
  • This file: infra/docs/post-mortems/2026-05-16-metallb-l2-immutable-pg-vip-flap.md.