Commit graph

28 commits

Author SHA1 Message Date
Viktor Barzin
c72b839a2f nvidia: pin chart to v25.10.1 after v26.3.1 upgrade revealed missing ubuntu26.04 driver images
k8s-node1 was upgraded to Ubuntu 26.04 (kernel 7.0.0-15-generic) at some
point. NVIDIA has NOT published ubuntu26.04 driver images yet
(skopeo list-tags docker://nvcr.io/nvidia/driver returned 0 ubuntu26.04
tags vs 779 for ubuntu22.04 and 206 for ubuntu24.04).

Attempted fix today: bump gpu-operator chart v25.10.1 → v26.3.1 +
driver 570.195.03 → 580.105.08 + kernelModuleType=open. The chart
applied cleanly but the v26.3.1 operator auto-detects host OS via NFD
labels and constructs `<version>-ubuntu26.04` image tags, which 404 on
pull. Rolled back to chart v25.10.1 and pinned it explicitly here so
future `terraform apply` doesn't surface the same trap again.

Note: chart rollback alone does NOT restore GPU functionality on
k8s-node1. Both v25.10.1 and v26.3.1's operators now pick the
ubuntu26.04 suffix (the NFD label is sticky once detected). The actual
recovery path requires either (a) NVIDIA shipping ubuntu26.04 driver
images, or (b) rolling the host kernel back to 6.8.0-117-generic
(still installed in /boot, headers in /usr/src) + `apt-mark hold` to
prevent re-upgrade. That step needs explicit user authorization for a
node reboot — left as the next action item on code-8vr0.

Files:
  - stacks/nvidia/modules/nvidia/main.tf — explicit version pin,
    explanatory comment
  - stacks/nvidia/modules/nvidia/values.yaml — comment block
    documenting the situation; driver pinned at 570.195.03
  - docs/post-mortems/2026-05-17-gpu-driver-ubuntu2604-mismatch.md —
    full timeline, root causes, recovery procedure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
0480477f44 nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem)
Keel rolled csi-driver-nfs 4.13.1→4.13.2 today. The 4.13.2 chart dropped
control-plane exclusion from the controller Deployment, so both replicas
landed on k8s-master, fought for hostNetwork ports 19809/29653, and one
went CrashLoopBackOff. Helm rollback left orphan containerd sandboxes
holding the ports — only a kubelet restart on master cleared them.

- Pin helm_release.version = "4.13.1" so terraform apply can't drift to
  the broken chart (defense in depth; nfs-csi namespace is already in the
  Kyverno-Keel exclude list)
- Add controller.affinity: podAntiAffinity between replicas +
  nodeAffinity excluding node-role.kubernetes.io/control-plane
- docs/post-mortems/2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md
  captures the root cause + recovery procedure (kubelet restart via
  nsenter is the escalation path when crictl rmp -f fails)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
9476649539 docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16)
Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart
derived hostPath from configuration.rebootSentinel) and the companion
work to harden the rolling-reboot pipeline against single-replica
PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured
drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the
mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source
drift audit (benign).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
93ee45bd25 docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items
Update `.claude/reference/authentik-state.md`:
  - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session
    Duration table with the gotcha that the gorilla session store binds
    the value once at outpost startup (rollout restart needed).
  - Replace the "session storage moved to Postgres in 2025.10" note that
    falsely implied the migration was automatic — explain that the
    `Outpost.managed` field gates the postgres path and our outpost
    silently stayed on `FilesystemStore` until 2026-05-10.
  - Document the goauthentik 2026.2.2 service-selector bug
    (service.py:52) and the JSON-patch workaround.
  - Document that the standalone embedded-outpost deployment needs
    `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the
    `app.kubernetes.io/component=server` pod label.
  - Note the "Terraform doesn't expose `Outpost.managed`" assumption
    that holds the `managed=embedded` value in place across applies.

Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`:
  - P2 codify-in-Terraform: DONE.
  - P3 access_token_validity reduce: DONE-alt (we did the opposite —
    bumped to 4 weeks — because postgres backend mooted the storage
    concern).
  - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses
    the loss-of-state class on the embedded outpost itself).
2026-05-22 14:16:41 +00:00
Viktor Barzin
57250cfda2 mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.

/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.

The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 11:12:38 +00:00
Viktor Barzin
484b4c7190 vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.

Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.

Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.

Closes: code-gy7h
2026-04-25 17:10:00 +00:00
Viktor Barzin
4cb2c157da post-mortem 2026-04-22: full timeline — second regression + node4 reboot
The initial recovery at 11:03 was premature; vault-1's audit writes over
NFS started hanging ~15 min later and the cluster regressed to 503.
Full recovery required rebooting node4 (to free vault-0's stuck NFS
mount and shed PVE NFS thread contention) and a second reboot of node3
(to clear another round of kernel NFS client degradation). Final
recovery at 11:43:28 UTC with vault-2 as active leader on the quorum
vault-0 + vault-2.

vault-1 remains stuck in ContainerCreating on node2 — a third node2
reboot is required for full 3/3 quorum, but 2/3 is operationally
sufficient, so that's deferred.
2026-04-22 11:44:56 +00:00
Viktor Barzin
2f1f9107f8 vault: add fsGroupChangePolicy=OnRootMismatch + 2026-04-22 post-mortem
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.

The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.

The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
2026-04-22 11:12:19 +00:00
Viktor Barzin
ac695dea38 [registry] bulk-clean 34 orphan manifests + beads-server image bump
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.

beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.

Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).

Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.

Closes: code-8hk
Closes: code-jh3c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:34 +00:00
Viktor Barzin
42961a5f58 [registry] fix-broken-blobs.sh — check revision-link, not blob data
The original index-child scan checked if the child's blob data file
existed under /blobs/sha256/<child>/data. That's wrong in a subtle
way: registry:2 serves a per-repo manifest via the link file at
<repo>/_manifests/revisions/sha256/<child-digest>/link, NOT by blob
presence. When cleanup-tags.sh rmtrees a tag, the per-repo revision
links for its index's children also disappear — but the blob data
survives (GC owns that, and runs weekly). Result: blob present,
link absent, API 404 on HEAD — the exact 2026-04-19 failure mode.

Live proof: the registry-integrity-probe CronJob just found 38 real
orphan children (including 98f718c8 from the original incident) while
the previous fix-broken-blobs.sh scan reported 0. After the fix, both
tools agree. The probe had been authoritative all along; the scan was
a false-negative because it was asking the wrong question.

Post-mortem updated to reflect the true mechanism (link-file absence,
not blob deletion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:43:35 +00:00
Viktor Barzin
7cb44d7264 [registry] Stop recurring orphan OCI-index incidents — detection + prevention + recovery
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.

Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.

Phase 1 — Detection:
 - .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
   step walks the just-pushed manifest (index + children + config + every
   layer blob) via HEAD and fails the pipeline on any non-200. Catches
   broken pushes at the source.
 - stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
   three alerts — RegistryManifestIntegrityFailure,
   RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
   "registry serves 404 for a tag that exists" gap that masked the incident
   for 2+ hours.
 - docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
   timeline, monitoring gaps, permanent fix.

Phase 2 — Prevention:
 - modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
   across all six registry services. Removes the floating-tag footgun.
 - modules/docker-registry/fix-broken-blobs.sh: new scan walks every
   _manifests/revisions/sha256/<digest> that is an image index and logs a
   loud WARNING when a referenced child blob is missing. Does NOT auto-
   delete — deleting a published image is a conscious decision. Layer-link
   scan preserved.

Phase 3 — Recovery:
 - build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
   don't need a cosmetic Dockerfile edit (matches convention from
   pve-nfs-exports-sync.yml).
 - docs/runbooks/registry-rebuild-image.md: exact command sequence for
   diagnosing + rebuilding after an orphan-index incident, plus a fallback
   for building directly on the registry VM if Woodpecker itself is down.
 - docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
   cross-references to the new runbook.

Out of scope (verified healthy or intentionally deferred):
 - Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
 - Registry HA/replication (single-VM SPOF is a known architectural
   choice; Synology offsite covers RPO < 1 day).
 - Diun exclude for registry:2 — not applicable; Diun only watches
   k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.

Verified locally:
 - fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
   flags both orphan layer links and orphan OCI-index children.
 - terraform fmt + validate on stacks/monitoring: success (only unrelated
   deprecation warnings).
 - python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
   modules/docker-registry/docker-compose.yml: both parse clean.

Closes: code-4b8

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:08:28 +00:00
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
cf87a747d8 docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:09:11 +00:00
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
7674cf8c5c docs: final E2E pipeline test 2026-04-14 17:43:38 +00:00
Viktor Barzin
91b97709b7 docs: trigger postmortem pipeline with TODO 2026-04-14 17:27:45 +00:00
Viktor Barzin
f336e5ed53 docs: E2E test postmortem pipeline with deep clone 2026-04-14 17:12:46 +00:00
Viktor Barzin
60c04e51b7 2026-04-14 17:10:45 +00:00
Viktor Barzin
933c562aa9 docs: trigger postmortem pipeline E2E test 2026-04-14 16:49:07 +00:00
Viktor Barzin
df95f52d08 docs: test postmortem with TODO for pipeline E2E 2026-04-14 16:45:44 +00:00
Viktor Barzin
b3cc5fcc32 test: trigger postmortem pipeline webhook 2026-04-14 16:44:11 +00:00
Viktor Barzin
777450cb19 docs: test post-mortem for pipeline E2E validation 2026-04-14 15:55:32 +00:00
Viktor Barzin
a703c6e84f docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 15:48:11 +00:00
Viktor Barzin
e832581caf docs: update Apr 14 post-mortem with Phase 2 findings
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:26:11 +00:00
Viktor Barzin
4e059b138c docs: consolidate all post-mortems under docs/post-mortems/
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:24:36 +00:00
Viktor Barzin
bdba15a387 docs: move post-mortems to docs/post-mortems/
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00