infra/docs/post-mortems
Viktor Barzin 368560fd92 t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 16:08:44 +00:00
..
2026-03-16-kured-containerd-cascade-outage.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00
2026-03-16-nfs-csi-cascade-failure.md docs: move post-mortems to docs/post-mortems/ 2026-04-14 08:20:09 +00:00
2026-04-14-nfs-fsid0-dns-vault-outage.md docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] 2026-04-14 18:09:11 +00:00
2026-04-14-postmortem-pipeline-test.md fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
2026-04-18-authentik-outpost-shm-full.md docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items 2026-05-10 16:28:11 +00:00
2026-04-19-registry-orphan-index.md [registry] bulk-clean 34 orphan manifests + beads-server image bump 2026-04-19 23:16:34 +00:00
2026-04-22-vault-raft-leader-deadlock.md vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC 2026-04-25 17:10:00 +00:00
2026-05-09-io-pressure-stale-nfs.md mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB 2026-05-09 17:01:57 +00:00
2026-05-16-kured-stalled-and-anubis-ha.md docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16) 2026-05-16 12:17:26 +00:00
2026-05-17-gpu-driver-ubuntu2604-mismatch.md nvidia: fix driver install deadlock + extend startup probe 2026-05-25 11:53:44 +00:00
2026-05-17-nfs-csi-keel-upgrade-master-port-conflict.md nfs-csi: pin chart v4.13.1 + controller affinity (post-mortem) 2026-05-17 09:11:09 +00:00
2026-05-25-immich-anca-elements-io-storm.md docs(immich): cap server-side job concurrency to protect sdc + log recurrence 2026-06-01 15:15:26 +00:00
2026-05-30-redis-split-brain.md redis: revert 3-node Sentinel HA to single standalone instance [ci skip] 2026-05-30 17:49:43 +00:00
2026-05-31-kured-sentinel-gate-oom.md kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard 2026-05-31 14:49:04 +00:00
2026-06-01-cloudflared-stale-traefik-origin.md docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip] 2026-06-01 21:25:33 +00:00
2026-06-01-keel-match-tag-image-swap.md kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix) 2026-06-01 19:50:41 +00:00
2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog 2026-06-02 20:16:11 +00:00
2026-06-09-t3-nightly-autoupdate-auth-outage.md t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] 2026-06-09 16:08:44 +00:00
index.html docs: consolidate all post-mortems under docs/post-mortems/ 2026-04-14 08:24:36 +00:00