Commit graph

17 commits

Author SHA1 Message Date
Viktor Barzin
a24cf8c689 [docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha
Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults
LimitRange in authentik ns applies a default container memory limit of
256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count
against the container's cgroup memory, so the container was OOM-killed
(exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed
with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`.

Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same
kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds,
df -h /dev/shm reports 2.0G.

Updates the post-mortem P1 row to capture this for future readers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:23:14 +00:00
Viktor Barzin
cacc282f1a .gitignore: ignore terragrunt_rendered.json debug output
Generated by `terragrunt render-json` for debugging. Not meant to be
tracked — a stale one was sitting untracked in stacks/dbaas/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:18:05 +00:00
Viktor Barzin
b41528e564 [docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18)
## Context

On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP
400 for all users. Reported first as a per-user issue affecting Emil since
2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached
session stopped being enough. Duration: ~44h for the first-affected user,
~30 min from cluster-wide report to unblocked.

## Root cause

The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB
tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions
FileStore. Every forward-auth request with no valid cookie creates one
session-state file; with `access_token_validity=7d` and measured ~18
files/min, steady-state accumulation (~180k files) vastly exceeds the
default tmpfs. Once full, every new `store.Save()` returned ENOSPC and
the outpost replied HTTP 400 instead of the usual 302 to login.

## What's captured

- Full timeline, impact, affected services
- Root-cause chain diagram (request rate → retention → ENOSPC → 400)
- Why diagnosis took 2 days (misattribution of a Viktor event to Emil,
  red-herring suspicion of the new Rybbit Worker, cached sessions masking
  the outage)
- Contributing factors + detection gaps
- Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches
  on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream
- Lessons learned (check outpost logs first; cookie-less `curl` disproves
  per-user symptoms fast; UI-managed Authentik config is invisible to git)

## Follow-ups not in this commit

- Prometheus alert for outpost /dev/shm usage > 80%
- Meta-alert for correlated Uptime Kuma external-monitor failures
- Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction
  (see discussion in beads code-zru)

Closes: code-zru

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:12:27 +00:00
Viktor Barzin
cf87a747d8 docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:09:11 +00:00
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
7674cf8c5c docs: final E2E pipeline test 2026-04-14 17:43:38 +00:00
Viktor Barzin
91b97709b7 docs: trigger postmortem pipeline with TODO 2026-04-14 17:27:45 +00:00
Viktor Barzin
f336e5ed53 docs: E2E test postmortem pipeline with deep clone 2026-04-14 17:12:46 +00:00
Viktor Barzin
60c04e51b7 2026-04-14 17:10:45 +00:00
Viktor Barzin
933c562aa9 docs: trigger postmortem pipeline E2E test 2026-04-14 16:49:07 +00:00
Viktor Barzin
df95f52d08 docs: test postmortem with TODO for pipeline E2E 2026-04-14 16:45:44 +00:00
Viktor Barzin
b3cc5fcc32 test: trigger postmortem pipeline webhook 2026-04-14 16:44:11 +00:00
Viktor Barzin
777450cb19 docs: test post-mortem for pipeline E2E validation 2026-04-14 15:55:32 +00:00
Viktor Barzin
a703c6e84f docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 15:48:11 +00:00
Viktor Barzin
e832581caf docs: update Apr 14 post-mortem with Phase 2 findings
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:26:11 +00:00
Viktor Barzin
4e059b138c docs: consolidate all post-mortems under docs/post-mortems/
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:24:36 +00:00
Viktor Barzin
bdba15a387 docs: move post-mortems to docs/post-mortems/
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00