## Context Following the 2026-04-18 /dev/shm ENOSPC P0 and a 5-subagent research pass, this is Phase 1 of the authentik reliability + performance hardening epic (beads code-cwj). Scope: everything that is safe, additive, and does not require DB restart, architectural migration, or the 43-service auth path to go through a risky validation window. Five research findings drove the deltas: 1. **Server/worker at 2 replicas** conflicts with the documented convention "critical path services scaled to 3" in .claude/CLAUDE.md (Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared). PDB minAvailable was still 1 — a single-pod outage could take auth down. 2. **PgBouncer had no resource requests/limits** — silently capped at the Kyverno tier-defaults LimitRange (256Mi), no PDB, no probes. Pool failures undetected until connection timeouts. 3. **Authentik 2026.2 has no Redis** (the cache moved to Postgres in 2025.10). Persistent Django connections + longer flow/policy cache TTLs are the two knobs that move the needle most without DB tuning. Both are safe because PgBouncer runs in session mode. 4. **Gunicorn defaults** (2 workers × 4 threads on server, 1 process × 2 threads on worker) don't use the pod's 1.5 Gi headroom. Each worker preloads Django at ~500 MiB — bumping to 3 workers needs a memory bump to 2 Gi first. 5. **AUTHENTIK_WORKER__CONCURRENCY was renamed AUTHENTIK_WORKER__THREADS** in 2025.8 — the old name is aliased but the canonical config key changed. ## This change ### values.yaml - server.replicas 2 → 3 (PDB minAvailable 1 → 2) - worker.replicas 2 → 3 - server/worker limits.memory 1.5 Gi → 2 Gi (headroom for gunicorn workers) - authentik.postgresql.conn_max_age = 60 (persistent connections; safe with pgbouncer session mode, conn_max_age < server_idle_timeout=600s) - authentik.postgresql.conn_health_checks = true - authentik.cache.timeout_flows = 1800 (30 min; was 300) - authentik.cache.timeout_policies = 900 (15 min; was 300) - authentik.web.workers = 3, threads = 4 - authentik.worker.threads = 4 (was 2) ### pgbouncer.tf - container resources: requests cpu=50m/mem=128Mi, limits mem=512Mi (observed live usage is 1-3 m CPU, 2-4 MiB RSS — huge headroom, safely above Kyverno 256Mi tier-default cap) - readiness probe: TCP :6432, 10s period - liveness probe: TCP :6432, 30s period, 30s delay - kubernetes_pod_disruption_budget_v1.pgbouncer: minAvailable=2 (3 replicas; single drain rolls cleanly, two-node simultaneous outage correctly blocked) ## What is NOT in this change (deferred as Phase 2 follow-ups) - Codify outpost /dev/shm patch in Terraform (currently applied via Authentik API, not in code). Needs authentik_outpost resource. - Migrate embedded outpost → dedicated outpost Deployment with 2 replicas + sticky sessions. Only HA path per GH issue #18098; requires flow design because outpost sessions are in-process memory only. - PG max_connections 100 → 200 + shared_buffers 512MB → 768MB + CNPG pod memory 2Gi → 3Gi. Needs coordinated DB restart. - Enable pg_stat_statements on CNPG cluster for Authentik DB observability (currently shared_preload_libraries is empty). - PgBouncer pool_mode session → transaction + django_channels layer split. Needs atomic change + psycopg3 prepared-statement support. - authentik_tasks_tasklog 7-day retention (198k rows, unbounded). - Traefik forward-auth plugin caching via xabinapal/traefik-authentik-forward-plugin. - Grafana dashboard 14837 import + recording rule for authentik_flow_execution_duration (reported broken: values in ns while default buckets are seconds — upstream discussion #7156). ## Test plan ### Automated $ cd stacks/authentik && ../../scripts/tg plan Plan: 1 to add, 3 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive module.authentik.kubernetes_pod_disruption_budget_v1.pgbouncer: Creation complete after 0s module.authentik.kubernetes_deployment.pgbouncer: Modifications complete after 45s module.authentik.helm_release.authentik: Modifications complete after 2m47s Apply complete! Resources: 1 added, 3 changed, 0 destroyed. ### Manual Verification 1. **Pod topology and PDBs**: $ kubectl -n authentik get pods,pdb pod/goauthentik-server-5fc69b6cc6-ctvkp 1/1 Running 0 3m14s k8s-node2 pod/goauthentik-server-5fc69b6cc6-fkn8x 1/1 Running 0 3m45s k8s-node3 pod/goauthentik-server-5fc69b6cc6-jtjjd 1/1 Running 0 5m6s k8s-node1 pod/goauthentik-worker-5cfb7dc9bf-b2rlr 1/1 Running 0 3m44s k8s-node2 pod/goauthentik-worker-5cfb7dc9bf-fkfm4 1/1 Running 0 5m6s k8s-node1 pod/goauthentik-worker-5cfb7dc9bf-hxdg6 1/1 Running 0 3m3s k8s-node4 pod/pgbouncer-64746f955f-st567 1/1 Running 0 4m58s k8s-node4 pod/pgbouncer-64746f955f-xss9c 1/1 Running 0 5m11s k8s-node2 pod/pgbouncer-64746f955f-zvfkw 1/1 Running 0 4m45s k8s-node3 poddisruptionbudget/goauthentik-server 2 N/A 1 poddisruptionbudget/goauthentik-worker N/A 1 1 poddisruptionbudget/pgbouncer 2 N/A 1 All three workloads spread across 3+ nodes, PDBs allow 1 disruption. 2. **Authentik server health**: $ curl -sS -o /dev/null -w "%{http_code}\n" \ https://authentik.viktorbarzin.me/-/health/ready/ 200 3. **Forward-auth redirect on protected service**: $ curl -sS -o /dev/null -w "%{http_code}\n" -L \ https://wealthfolio.viktorbarzin.me/ 200 4. **Outpost /dev/shm still within sizeLimit** (patches from the 2026-04-18 post-mortem were not regressed): $ kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost \ -c proxy -- df -h /dev/shm tmpfs 2.0G 58M 2.0G 3% /dev/shm 5. **PgBouncer port reachable from other pods**: $ kubectl -n authentik exec deploy/pgbouncer -- nc -zv 127.0.0.1 6432 127.0.0.1 (127.0.0.1:6432) open ## Reproduce locally 1. `cd stacks/authentik && ../../scripts/tg plan` — expect 0/0/0 (No changes). 2. `kubectl -n authentik get pdb pgbouncer` — expect MIN AVAILABLE 2. 3. `kubectl -n authentik get deploy goauthentik-server -o jsonpath='{.spec.replicas}'` — expect 3. Closes: code-cwj |
||
|---|---|---|
| .beads | ||
| .claude | ||
| .git-crypt | ||
| .github | ||
| .planning | ||
| .woodpecker | ||
| ci | ||
| cli | ||
| diagram | ||
| docs | ||
| modules | ||
| playbooks | ||
| scripts | ||
| secrets | ||
| stacks | ||
| state/stacks | ||
| .gitattributes | ||
| .gitignore | ||
| .sops.yaml | ||
| AGENTS.md | ||
| config.tfvars | ||
| CONTRIBUTING.md | ||
| LICENSE.txt | ||
| MEMORY.md | ||
| README.md | ||
| setup-monitoring.sh | ||
| terragrunt.hcl | ||
| tiers.tf | ||
This repo contains my infra-as-code sources.
My infrastructure is built using Terraform, Kubernetes and CI/CD is done using Woodpecker CI.
Read more by visiting my website: https://viktorbarzin.me
Documentation
Full architecture documentation is available in docs/ — covering networking, storage, security, monitoring, secrets, CI/CD, databases, and more.
Adding a New User (Admin)
Adding a new namespace-owner to the cluster requires three steps — no code changes needed.
1. Authentik Group Assignment
In the Authentik admin UI, add the user to:
kubernetes-namespace-ownersgroup (grants OIDC group claim for K8s RBAC)Headscale Usersgroup (if they need VPN access)
2. Vault KV Entry
Add a JSON entry to secret/platform → k8s_users key in Vault:
"username": {
"role": "namespace-owner",
"email": "user@example.com",
"namespaces": ["username"],
"domains": ["myapp"],
"quota": {
"cpu_requests": "2",
"memory_requests": "4Gi",
"memory_limits": "8Gi",
"pods": "20"
}
}
usernamekey must match the user's Forgejo username (for Woodpecker admin access)namespaces— K8s namespaces to create and grant admin access todomains— subdomains underviktorbarzin.mefor Cloudflare DNS recordsquota— resource limits per namespace (defaults shown above)
3. Apply Stacks
vault login -method=oidc
cd stacks/vault && terragrunt apply --non-interactive
# Creates: namespace, Vault policy, identity entity, K8s deployer role
cd ../platform && terragrunt apply --non-interactive
# Creates: RBAC bindings, ResourceQuota, TLS secret, DNS records
cd ../woodpecker && terragrunt apply --non-interactive
# Adds user to Woodpecker admin list
What Gets Auto-Generated
| Resource | Stack |
|---|---|
| Kubernetes namespace | vault |
Vault policy (namespace-owner-{user}) |
vault |
| Vault identity entity + OIDC alias | vault |
| K8s deployer Role + Vault K8s role | vault |
| RBAC RoleBinding (namespace admin) | platform |
| RBAC ClusterRoleBinding (cluster read-only) | platform |
| ResourceQuota | platform |
| TLS secret in namespace | platform |
| Cloudflare DNS records | platform |
| Woodpecker admin access | woodpecker |
New User Onboarding
If you've been added as a namespace-owner, follow these steps to get started.
1. Join the VPN
# Install Tailscale: https://tailscale.com/download
tailscale login --login-server https://headscale.viktorbarzin.me
# Send the registration URL to Viktor, wait for approval
ping 10.0.20.100 # verify connectivity
2. Install Tools
Run the setup script to install kubectl, kubelogin, Vault CLI, Terraform, and Terragrunt:
# macOS
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=mac)
# Linux
bash <(curl -fsSL https://k8s-portal.viktorbarzin.me/setup/script?os=linux)
3. Authenticate
# Log into Vault (opens browser for SSO)
vault login -method=oidc
# Test kubectl (opens browser for OIDC login)
kubectl get pods -n YOUR_NAMESPACE
4. Deploy Your First App
# Clone the infra repo
git clone https://github.com/ViktorBarzin/infra.git && cd infra
# Copy the stack template
cp -r stacks/_template stacks/myapp
mv stacks/myapp/main.tf.example stacks/myapp/main.tf
# Edit main.tf — replace all <placeholders>
# Store secrets in Vault
vault kv put secret/YOUR_USERNAME/myapp DB_PASSWORD=secret123
# Submit a PR
git checkout -b feat/myapp
git add stacks/myapp/
git commit -m "add myapp stack"
git push -u origin feat/myapp
After review and merge, an admin runs cd stacks/myapp && terragrunt apply.
5. Set Up CI/CD (Optional)
Create .woodpecker.yml in your app's Forgejo repo:
steps:
- name: build
image: woodpeckerci/plugin-docker-buildx
settings:
repo: YOUR_DOCKERHUB_USER/myapp
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
username:
from_secret: dockerhub-username
password:
from_secret: dockerhub-token
platforms: linux/amd64
- name: deploy
image: hashicorp/vault:1.18.1
commands:
- export VAULT_ADDR=http://vault-active.vault.svc.cluster.local:8200
- export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login
role=ci jwt=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token))
- KUBE_TOKEN=$(vault write -field=service_account_token
kubernetes/creds/YOUR_NAMESPACE-deployer
kubernetes_namespace=YOUR_NAMESPACE)
- kubectl --server=https://kubernetes.default.svc
--token=$KUBE_TOKEN
--certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
-n YOUR_NAMESPACE set image deployment/myapp
myapp=YOUR_DOCKERHUB_USER/myapp:${CI_PIPELINE_NUMBER}
Useful Commands
# Check your pods
kubectl get pods -n YOUR_NAMESPACE
# View quota usage
kubectl describe resourcequota -n YOUR_NAMESPACE
# Store/read secrets
vault kv put secret/YOUR_USERNAME/myapp KEY=value
vault kv get secret/YOUR_USERNAME/myapp
# Get a short-lived K8s deploy token
vault write kubernetes/creds/YOUR_NAMESPACE-deployer \
kubernetes_namespace=YOUR_NAMESPACE
Important Rules
- All changes go through Terraform — never
kubectl apply/edit/patchdirectly - Never put secrets in code — use Vault:
vault kv put secret/YOUR_USERNAME/... - Always use a PR — never push directly to master
- Docker images: build for
linux/amd64, use versioned tags (not:latest)
git-crypt setup
To decrypt the secrets, you need to setup git-crypt.
- Install git-crypt.
- Setup gpg keys on the machine
git-crypt unlock
This will unlock the secrets and will lock them on commit