infra

Author	SHA1	Message	Date
Viktor Barzin	4cb2c157da	post-mortem 2026-04-22: full timeline — second regression + node4 reboot The initial recovery at 11:03 was premature; vault-1's audit writes over NFS started hanging ~15 min later and the cluster regressed to 503. Full recovery required rebooting node4 (to free vault-0's stuck NFS mount and shed PVE NFS thread contention) and a second reboot of node3 (to clear another round of kernel NFS client degradation). Final recovery at 11:43:28 UTC with vault-2 as active leader on the quorum vault-0 + vault-2. vault-1 remains stuck in ContainerCreating on node2 — a third node2 reboot is required for full 3/3 quorum, but 2/3 is operationally sufficient, so that's deferred.	2026-04-22 11:44:56 +00:00
Viktor Barzin	2f1f9107f8	vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that never exited because the default fsGroupChangePolicy (Always) walks every file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and a 1GB audit log, the recursive chown outlasted the deadline and restarted forever — blocking raft quorum recovery. OnRootMismatch makes chown a no-op when the volume root is already correct, which it always is after initial setup. The breakglass fix was applied live via kubectl patch at 10:54 UTC; this commit persists it in Terraform so the next apply doesn't revert. The post-mortem also documents the upstream raft stuck-leader pattern, NFS kernel client corruption after force-kill, and the path to migrate Vault off NFS to proxmox-lvm-encrypted.	2026-04-22 11:12:19 +00:00
Viktor Barzin	ac695dea38	[registry] bulk-clean 34 orphan manifests + beads-server image bump Registry integrity probe surfaced 38 broken manifest references (34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19 infra-ci incident). Deleted all via registry HTTP API + ran GC; reclaimed ~3GB blob storage. beads-server CronJobs were stuck ImagePullBackOff on claude-agent-service:0c24c9b6 for >6h — bumped variable default to 2fd7670d (canonical tag in claude-agent-service stack, already healthy in registry) so new ticks can fire. Rebuilt in-use broken tags: freedify:{latest,c803de02} and beadboard:{17a38e43,latest} on registry VM; priority-pass via Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly CronJob, next run 2026-05-01). Probe now reports 0/39 failures. RegistryManifestIntegrityFailure alert cleared. Closes: code-8hk Closes: code-jh3c Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:16:34 +00:00
Viktor Barzin	42961a5f58	[registry] fix-broken-blobs.sh — check revision-link, not blob data The original index-child scan checked if the child's blob data file existed under /blobs/sha256/<child>/data. That's wrong in a subtle way: registry:2 serves a per-repo manifest via the link file at <repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision links for its index's children also disappear — but the blob data survives (GC owns that, and runs weekly). Result: blob present, link absent, API 404 on HEAD — the exact 2026-04-19 failure mode. Live proof: the registry-integrity-probe CronJob just found 38 real orphan children (including 98f718c8 from the original incident) while the previous fix-broken-blobs.sh scan reported 0. After the fix, both tools agree. The probe had been authoritative all along; the scan was a false-negative because it was asking the wrong question. Post-mortem updated to reflect the true mechanism (link-file absence, not blob deletion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:43:35 +00:00
Viktor Barzin	7cb44d7264	[registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery Second identical registry incident on 2026-04-19 (first 2026-04-13): the infra-ci:latest image index resolved to child manifests whose blobs had been garbage-collected out from under the index. Pipelines P366→P376 all exited 126 "image can't be pulled". Hot fix (`a05d63e` / `6371e75` / `c113be4`) restored green CI but left the underlying bug unaddressed. Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at 02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly (distribution/distribution#3324 class). Nothing verified pushes end-to-end; nothing probed the registry for fetchability; nothing caught orphan indexes. Phase 1 — Detection: - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity step walks the just-pushed manifest (index + children + config + every layer blob) via HEAD and fails the pipeline on any non-200. Catches broken pushes at the source. - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and three alerts — RegistryManifestIntegrityFailure, RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the "registry serves 404 for a tag that exists" gap that masked the incident for 2+ hours. - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause, timeline, monitoring gaps, permanent fix. Phase 2 — Prevention: - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3 across all six registry services. Removes the floating-tag footgun. - modules/docker-registry/fix-broken-blobs.sh: new scan walks every _manifests/revisions/sha256/<digest> that is an image index and logs a loud WARNING when a referenced child blob is missing. Does NOT auto- delete — deleting a published image is a conscious decision. Layer-link scan preserved. Phase 3 — Recovery: - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds don't need a cosmetic Dockerfile edit (matches convention from pve-nfs-exports-sync.yml). - docs/runbooks/registry-rebuild-image.md: exact command sequence for diagnosing + rebuilding after an orphan-index incident, plus a fallback for building directly on the registry VM if Woodpecker itself is down. - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md: cross-references to the new runbook. Out of scope (verified healthy or intentionally deferred): - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s). - Registry HA/replication (single-VM SPOF is a known architectural choice; Synology offsite covers RPO < 1 day). - Diun exclude for registry:2 — not applicable; Diun only watches k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose. Verified locally: - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly flags both orphan layer links and orphan OCI-index children. - terraform fmt + validate on stacks/monitoring: success (only unrelated deprecation warnings). - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and modules/docker-registry/docker-compose.yml: both parse clean. Closes: code-4b8 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:08:28 +00:00
Viktor Barzin	a24cf8c689	[docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults LimitRange in authentik ns applies a default container memory limit of 256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count against the container's cgroup memory, so the container was OOM-killed (exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`. Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds, df -h /dev/shm reports 2.0G. Updates the post-mortem P1 row to capture this for future readers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:23:14 +00:00
Viktor Barzin	cacc282f1a	.gitignore: ignore terragrunt_rendered.json debug output Generated by `terragrunt render-json` for debugging. Not meant to be tracked — a stale one was sitting untracked in stacks/dbaas/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:18:05 +00:00
Viktor Barzin	b41528e564	[docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18) ## Context On 2026-04-18 all Authentik-protected .viktorbarzin.me sites returned HTTP 400 for all users. Reported first as a per-user issue affecting Emil since 2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached session stopped being enough. Duration: ~44h for the first-affected user, ~30 min from cluster-wide report to unblocked. ## Root cause The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB tmpfs) filled to 100% with ~44k `session_` files from gorilla/sessions FileStore. Every forward-auth request with no valid cookie creates one session-state file; with `access_token_validity=7d` and measured ~18 files/min, steady-state accumulation (~180k files) vastly exceeds the default tmpfs. Once full, every new `store.Save()` returned ENOSPC and the outpost replied HTTP 400 instead of the usual 302 to login. ## What's captured - Full timeline, impact, affected services - Root-cause chain diagram (request rate → retention → ENOSPC → 400) - Why diagnosis took 2 days (misattribution of a Viktor event to Emil, red-herring suspicion of the new Rybbit Worker, cached sessions masking the outage) - Contributing factors + detection gaps - Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream - Lessons learned (check outpost logs first; cookie-less `curl` disproves per-user symptoms fast; UI-managed Authentik config is invisible to git) ## Follow-ups not in this commit - Prometheus alert for outpost /dev/shm usage > 80% - Meta-alert for correlated Uptime Kuma external-monitor failures - Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction (see discussion in beads code-zru) Closes: code-zru Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:12:27 +00:00
Viktor Barzin	cf87a747d8	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit SHAs. Flag 3 Migration TODOs as needing human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:09:11 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	7674cf8c5c	docs: final E2E pipeline test	2026-04-14 17:43:38 +00:00
Viktor Barzin	91b97709b7	docs: trigger postmortem pipeline with TODO	2026-04-14 17:27:45 +00:00
Viktor Barzin	f336e5ed53	docs: E2E test postmortem pipeline with deep clone	2026-04-14 17:12:46 +00:00
Viktor Barzin	60c04e51b7		2026-04-14 17:10:45 +00:00
Viktor Barzin	933c562aa9	docs: trigger postmortem pipeline E2E test	2026-04-14 16:49:07 +00:00
Viktor Barzin	df95f52d08	docs: test postmortem with TODO for pipeline E2E	2026-04-14 16:45:44 +00:00
Viktor Barzin	b3cc5fcc32	test: trigger postmortem pipeline webhook	2026-04-14 16:44:11 +00:00
Viktor Barzin	777450cb19	docs: test post-mortem for pipeline E2E validation	2026-04-14 15:55:32 +00:00
Viktor Barzin	a703c6e84f	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328, Tier 1 (30s/3 retries). Investigation TODO flagged for human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 15:48:11 +00:00
Viktor Barzin	e832581caf	docs: update Apr 14 post-mortem with Phase 2 findings Key additions: - NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14) - All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE - DNS zone sync gap: secondary/tertiary had no custom zones - Converted one-time setup Job to recurring zone-sync CronJob - MySQL, Redis, Vault collateral damage and fixes - 3 new lessons learned (zone replication, NFS client state, operator rollout) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:26:11 +00:00
Viktor Barzin	4e059b138c	docs: consolidate all post-mortems under docs/post-mortems/ Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/. Update index.html with all 3 incidents (newest first). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:24:36 +00:00
Viktor Barzin	bdba15a387	docs: move post-mortems to docs/post-mortems/ Consolidate all outage reports under docs/ for better discoverability. Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/ (repo documentation). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:20:09 +00:00

22 commits