## Context On 2026-04-18 all Authentik-protected *.viktorbarzin.me sites returned HTTP 400 for all users. Reported first as a per-user issue affecting Emil since 2026-04-16 ~17:00 UTC, escalated to cluster-wide when Viktor's cached session stopped being enough. Duration: ~44h for the first-affected user, ~30 min from cluster-wide report to unblocked. ## Root cause The `ak-outpost-authentik-embedded-outpost` pod's /dev/shm (default 64 MB tmpfs) filled to 100% with ~44k `session_*` files from gorilla/sessions FileStore. Every forward-auth request with no valid cookie creates one session-state file; with `access_token_validity=7d` and measured ~18 files/min, steady-state accumulation (~180k files) vastly exceeds the default tmpfs. Once full, every new `store.Save()` returned ENOSPC and the outpost replied HTTP 400 instead of the usual 302 to login. ## What's captured - Full timeline, impact, affected services - Root-cause chain diagram (request rate → retention → ENOSPC → 400) - Why diagnosis took 2 days (misattribution of a Viktor event to Emil, red-herring suspicion of the new Rybbit Worker, cached sessions masking the outage) - Contributing factors + detection gaps - Prevention plan with P0 (done — 512Mi emptyDir via kubernetes_json_patches on the outpost config), P1 alerts, P2 Terraform codification, P3 upstream - Lessons learned (check outpost logs first; cookie-less `curl` disproves per-user symptoms fast; UI-managed Authentik config is invisible to git) ## Follow-ups not in this commit - Prometheus alert for outpost /dev/shm usage > 80% - Meta-alert for correlated Uptime Kuma external-monitor failures - Decision on tmpfs sizing vs restart cadence vs probe-frequency reduction (see discussion in beads code-zru) Closes: code-zru Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| architecture | ||
| plans | ||
| post-mortems | ||
| runbooks | ||
| README.md | ||
Infrastructure Documentation
This repository contains the configuration and documentation for a homelab Kubernetes cluster running on Proxmox. The infrastructure hosts 70+ services managed declaratively with Terraform and Terragrunt.
Quick Reference
Network Ranges
- Physical Network:
192.168.1.0/24- Physical devices and host network - Management VLAN 10:
10.0.10.0/24- Infrastructure VMs and management - Kubernetes VLAN 20:
10.0.20.0/24- Kubernetes cluster network
Key URLs
- Public:
viktorbarzin.me - Internal:
viktorbarzin.lan
Architecture Documentation
| Document | Description |
|---|---|
| Overview | Infrastructure overview, hardware specs, VM inventory, and service catalog |
| Networking | Network topology, VLANs, routing, and firewall rules |
| VPN | Headscale mesh VPN and Cloudflare Tunnel configuration |
| Storage | TrueNAS NFS, democratic-csi, and persistent volume management |
| Authentication | Authentik SSO, OIDC flows, and service integration |
| Security | CrowdSec IPS, Kyverno policies, and security controls |
| Monitoring | Prometheus, Grafana, Loki, and observability stack |
| Secrets Management | HashiCorp Vault integration and secret rotation |
| CI/CD | Woodpecker CI pipeline and deployment automation |
| Backup & DR | Backup strategy, disaster recovery, and restore procedures |
| Compute | Proxmox VMs, GPU passthrough, K8s resource management, and VPA |
| Databases | PostgreSQL, MySQL, Redis, and database operators |
| Multi-tenancy | Namespace isolation, tier system, and resource quotas |
Operations
- Runbooks - Step-by-step operational procedures
- Plans - Infrastructure change plans and rollout strategies
Getting Started
- Review the Overview for a high-level understanding
- Read the Networking doc to understand connectivity
- Check Compute for resource management patterns
- Explore individual architecture docs based on your area of interest