infra

Author	SHA1	Message	Date
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	117b99e28f	docs/authentik: document postgres session backend + close out 2026-04-18 post-mortem items Update `.claude/reference/authentik-state.md`: - Add `ProxyProvider.access_token_validity = "weeks=4"` to the Session Duration table with the gotcha that the gorilla session store binds the value once at outpost startup (rollout restart needed). - Replace the "session storage moved to Postgres in 2025.10" note that falsely implied the migration was automatic — explain that the `Outpost.managed` field gates the postgres path and our outpost silently stayed on `FilesystemStore` until 2026-05-10. - Document the goauthentik 2026.2.2 service-selector bug (service.py:52) and the JSON-patch workaround. - Document that the standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__*` env vars injected via JSON patch, plus the `app.kubernetes.io/component=server` pod label. - Note the "Terraform doesn't expose `Outpost.managed`" assumption that holds the `managed=embedded` value in place across applies. Close out post-mortem `2026-04-18-authentik-outpost-shm-full.md`: - P2 codify-in-Terraform: DONE. - P3 access_token_validity reduce: DONE-alt (we did the opposite — bumped to 4 weeks — because postgres backend mooted the storage concern). - P3 move-off-embedded-outpost: DONE-alt (postgres backend addresses the loss-of-state class on the embedded outpost itself).	2026-05-10 16:28:11 +00:00
Viktor Barzin	a24cf8c689	[docs] post-mortem: clarify the sizeLimit vs container memory limit gotcha Initial 2Gi sizeLimit didn't take effect because Kyverno's tier-defaults LimitRange in authentik ns applies a default container memory limit of 256Mi to pods with resources: {}. Writes to a memory-backed emptyDir count against the container's cgroup memory, so the container was OOM-killed (exit 137) at ~256 MiB even though the tmpfs sizeLimit said 2Gi. Confirmed with `dd if=/dev/zero of=/dev/shm/test bs=1M count=500`. Fix: also set `containers[0].resources.limits.memory: 2560Mi` via the same kubernetes_json_patches. Verified end-to-end — 1.5 GB file write succeeds, df -h /dev/shm reports 2.0G. Updates the post-mortem P1 row to capture this for future readers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:23:14 +00:00
Viktor Barzin	cacc282f1a	.gitignore: ignore terragrunt_rendered.json debug output Generated by `terragrunt render-json` for debugging. Not meant to be tracked — a stale one was sitting untracked in stacks/dbaas/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:18:05 +00:00
Viktor Barzin	b41528e564	[docs] Add post-mortem for Authentik outpost /dev/shm incident (2026-04-18) ## Context On 2026-04-18 all Authentik-protected .viktorbarzin.me sites returned HTTP 400 for all users. Reported first as a per-user issue affecting Emil since 2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached session stopped being enough. Duration: ~44h for the first-affected user, ~30 min from cluster-wide report to unblocked. ## Root cause The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB tmpfs) filled to 100% with ~44k `session_` files from gorilla/sessions FileStore. Every forward-auth request with no valid cookie creates one session-state file; with `access_token_validity=7d` and measured ~18 files/min, steady-state accumulation (~180k files) vastly exceeds the default tmpfs. Once full, every new `store.Save()` returned ENOSPC and the outpost replied HTTP 400 instead of the usual 302 to login. ## What's captured - Full timeline, impact, affected services - Root-cause chain diagram (request rate → retention → ENOSPC → 400) - Why diagnosis took 2 days (misattribution of a Viktor event to Emil, red-herring suspicion of the new Rybbit Worker, cached sessions masking the outage) - Contributing factors + detection gaps - Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream - Lessons learned (check outpost logs first; cookie-less `curl` disproves per-user symptoms fast; UI-managed Authentik config is invisible to git) ## Follow-ups not in this commit - Prometheus alert for outpost /dev/shm usage > 80% - Meta-alert for correlated Uptime Kuma external-monitor failures - Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction (see discussion in beads code-zru) Closes: code-zru Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:12:27 +00:00

6 commits