fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]

6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-09 08:45:33 +00:00
parent 6d224861c4
commit fd0f4a0365
1166 changed files with 358546 additions and 0 deletions

View file

@ -0,0 +1,203 @@
# Authentik Current State
> Snapshot of applications, groups, users, and flows. Use `authentik` skill for management tasks.
## Applications (11)
| Application | Provider Type | Auth Flow |
|-------------|--------------|-----------|
| Cloudflare Access | OAuth2/OIDC | explicit consent |
| Domain wide catch all | Proxy (forward auth) | implicit consent |
| Forgejo | OAuth2/OIDC | explicit consent |
| Grafana | OAuth2/OIDC | implicit consent |
| Headscale | OAuth2/OIDC | explicit consent |
| Immich | OAuth2/OIDC | explicit consent |
| Kubernetes | OAuth2/OIDC (public) | implicit consent |
| Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent |
| linkwarden | OAuth2/OIDC | explicit consent |
| wrongmove | OAuth2/OIDC | implicit consent |
> **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`):
> confidential client `k8s-dashboard`, built for seamless dashboard SSO via
> oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12), so the dashboard runs
> on forward-auth + token-paste instead and oauth2-proxy is unwired. Kept for a
> future SSO retry once apiserver OIDC is fixed.
>
> **admin-services-restriction** policy (TF-managed in
> `stacks/authentik/admin-services-restriction.tf`, adopted 2026-06-04): gates the
> 15 admin-only hostnames to `Home Server Admins`, with a carve-out admitting the
> `kubernetes-*` RBAC groups to `k8s.viktorbarzin.me` (dashboard login page).
## Groups (9)
| Group | Parent | Superuser | Purpose |
|-------|--------|-----------|---------|
| Allow Login Users | -- | No | Parent group for login-permitted users |
| authentik Admins | -- | Yes | Full admin access |
| Headscale Users | Allow Login Users | No | VPN access |
| Home Server Admins | Allow Login Users | No | Server admin access |
| Wrongmove Users | Allow Login Users | No | Real-estate app access |
| kubernetes-admins | -- | No | K8s cluster-admin RBAC |
| kubernetes-power-users | -- | No | K8s power-user RBAC |
| kubernetes-namespace-owners | -- | No | K8s namespace-owner RBAC |
| Task Submitters | -- | No | Task submission access |
## Users (8 real)
| Username | Name | Type | Groups |
|----------|------|------|--------|
| akadmin | authentik Default Admin | internal | authentik Admins, Home Server Admins, Headscale Users |
| vbarzin@gmail.com | Viktor Barzin | internal | authentik Admins, Home Server Admins, Wrongmove Users, Headscale Users |
| emil.barzin@gmail.com | Emil Barzin | internal | Home Server Admins, Headscale Users |
| ancaelena98@gmail.com | Anca Milea | external | Wrongmove Users, Headscale Users |
| vabbit81@gmail.com | GHEORGHE Milea | external | Headscale Users, kubernetes-namespace-owners, sops-vabbit81 |
| valentinakolevabarzina@gmail.com | Valentina | internal | Headscale Users |
| anca.r.cristian10@gmail.com | -- | internal | Wrongmove Users |
| kadir.tugan@gmail.com | Kadir | internal | Wrongmove Users |
## Login Sources
- **Google** (OAuth) -- user matching by identifier
- **GitHub** (OAuth) -- user matching by email_link
- **Facebook** (OAuth) -- user matching by email_link
- All sources use `invitation-enrollment` as enrollment flow (new users require invitation)
## Authorization Flows
- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen
- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects
## Invitation Enrollment Flow
Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1`
New users can only sign up via invitation link. Admins generate single-use invite links.
### Stages (in order)
| Order | Stage | Type | Purpose |
|-------|-------|------|---------|
| 10 | invitation-validation | Invitation | Validates `?itoken=` parameter, blocks without valid token |
| 20 | enrollment-identification | Identification | Shows social login (Google/GitHub/Facebook) + passkey |
| 30 | enrollment-prompt | Prompt | Collects name and email (pre-filled from social login) |
| 40 | enrollment-user-write | User Write | Creates user in `Allow Login Users` group |
| 50 | enrollment-login | User Login | Auto-login after signup (policy: `invitation-group-assignment` adds user to target group from invitation `fixed_data.group`) |
### Invitation Management
Script: `.claude/scripts/authentik-invite.sh`
```bash
# Create invitation (single-use, no expiry)
./authentik-invite.sh create "Headscale Users"
# Create invitation with expiry
./authentik-invite.sh create "Wrongmove Users" --days 7
# Add user to group after enrollment
./authentik-invite.sh assign <username> "Headscale Users"
# List pending invitations
./authentik-invite.sh list
```
Invited users sign up via social login (Google/GitHub/Facebook) or passkey. No username/password enrollment.
The target group (e.g. "Headscale Users") is auto-assigned on enrollment via the `invitation-group-assignment` expression policy. The `assign` command is available for manual post-enrollment group changes.
## Cleanup Log (2026-03-13)
### Deleted Flows
- `enrollment-inviation` (typo) -- previous invitation attempt
- `headscale-authentication` -- not used by any provider
- `headscale-authorization` -- not used by any provider
- `default-enrollment-flow` -- password-based, unused
- `oauth-enrollment` -- replaced by invitation-enrollment
### Deleted Stages
- `enrollment-invitation`, `enrollment-invitation-write` (from old invitation flow)
- `invitation` (unbound)
- `default-enrollment-prompt-first`, `default-enrollment-prompt-second` (from default enrollment)
- `default-enrollment-user-write`, `default-enrollment-email-verification`, `default-enrollment-user-login`
### Deleted Groups
- `authentik Read-only` -- 0 users, unused role
### Deleted Policies
- `map github username to email` -- unbound
- `Map Google Attributes` -- unbound
### Deleted Roles
- `authentik Read-only` -- no group assignment
## Policy Fix (2026-04-06)
### Unbound brute-force-protection Policy
The `brute-force-protection` ReputationPolicy (PK: `ac98cb11-31d3-46ab-8883-bf51e6b09a60`, `check_username=True`, `check_ip=True`, `threshold=-5`) was bound to 3 authentication flows, causing "Flow does not apply to current user" for all unauthenticated users (no username to evaluate → failure_result=false → flow denied).
Removed bindings from:
- `default-authentication-flow` (PK: `34618cf3`) — username/password login
- `webauthn` (PK: `0b60c2a5`) — passkey login
- `default-source-authentication` (PK: via policybindingmodel `1a779f24`) — Google/GitHub/Facebook OAuth
Policy still exists with 0 bindings. If brute-force protection is needed, bind to the **password stage** (not the flow level).
## Session Duration (2026-05-01)
Pinned via Terraform in `stacks/authentik/`:
| Knob | Value | Surface | Effect |
|------|-------|---------|--------|
| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. |
| `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. |
| `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. |
Notes:
- There is **no** `Brand.session_duration`; `UserLoginStage` is the only correct lever for authenticated session lifetime.
- Embedded outpost session storage: PostgreSQL table `authentik_providers_proxy_proxysession` in authentik 2025.10+ (PR #16628), but **only when `IsEmbedded()` returns true** (i.e. `Outpost.managed == "goauthentik.io/outposts/embedded"`). Our outpost record had `managed=null` until 2026-05-10, which silently kept it on the gorilla `FilesystemStore` at `/dev/shm` (TMPDIR) and re-exposed the 2026-04-18 mismatched-session-ID class on every pod restart. Fix landed 2026-05-10: see `authentik_outpost.embedded` in `authentik_provider.tf` and post-mortem `2026-04-18-authentik-outpost-shm-full.md`.
- The proxy outpost service has a known goauthentik 2026.2.2 bug (`internal/outpost/controllers/k8s/service.py:52`): for embedded outposts the controller sets the Service selector to `app.kubernetes.io/name=authentik` (the server pods), not `authentik-outpost-proxy`. We work around it via a `kubernetes_json_patches.service` patch on the outpost record (replaces `/spec/selector` with the outpost's own labels). Without this, endpoints are empty and Traefik forward-auth fails over to the Basic Auth realm `Emergency Access`.
- The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts).
- `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`.
- The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds.
- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced).
## Upgrade Validation Checklist
Run after **any** of these:
- Authentik chart version bump in `stacks/authentik/modules/authentik/main.tf` (the `version = "..."` line on `helm_release.authentik`).
- `goauthentik/authentik` Terraform provider version bump.
- Outpost pod recreation (kured reboot, eviction, manual `rollout restart`, scheduler move).
The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts.
```bash
# 1. Service routes to the outpost pod (NOT the server pods).
# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443.
kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost
# 2. Service selector still excludes the server pods. Expected: includes
# `app.kubernetes.io/name: authentik-outpost-proxy`. If it flips to
# `name: authentik`, the goauthentik upstream bug came back or our
# JSON patch was unset.
kubectl -n authentik get svc ak-outpost-authentik-embedded-outpost -o jsonpath='{.spec.selector}'
# 3. Outpost mode + session backend. Expected log lines on startup:
# {"embedded":true,"event":"Outpost mode",...}
# {"event":"using PostgreSQL session backend",...}
# If embedded=false or `using filesystem session backend`, the postgres
# fix is broken — likely `Outpost.managed` got cleared, or the upstream
# schema started exposing `managed` and TF reset it.
kubectl -n authentik logs deploy/ak-outpost-authentik-embedded-outpost | grep -E '"Outpost mode"|"session backend"' | head -3
# 4. /dev/shm is essentially empty (postgres backend = no filesystem use).
# A row count > a few dozen indicates filesystem fallback is firing.
kubectl -n authentik exec deploy/ak-outpost-authentik-embedded-outpost -- sh -c 'df -h /dev/shm; ls /dev/shm | wc -l'
# 5. Postgres session table is growing with traffic. Expected: rows with
# `expires` ~28 days out (matches access_token_validity = weeks=4).
kubectl -n authentik exec deploy/goauthentik-server -- ak shell -c "
from django.db import connection; c = connection.cursor()
c.execute('SELECT COUNT(*), MAX(expires) FROM authentik_providers_proxy_proxysession')
print(c.fetchone())"
# 6. Edge auth flow: should be 302 → authentik. NOT 401 with WWW-Authenticate.
curl -sS -o /dev/null -D - 'https://terminal.viktorbarzin.me/' -H 'User-Agent: Mozilla/5.0' \
| grep -iE '^HTTP|^location|x-auth-fallback|www-authenticate'
# 7. Terraform plan-to-zero on the whole authentik stack.
( cd stacks/authentik && /home/wizard/code/infra/scripts/tg plan ) | grep -E 'No changes|Plan:'
```
Steps 1, 3, 6 cover the failure modes the Prometheus alerts trigger on (`AuthentikForwardAuthFallbackActive`, `AuthentikOutpostForwardAuth400Spike`). Steps 4 and 5 cover the silent-regression case (filesystem fallback) where the alerts don't fire but the system loses its postgres-backed session persistence on the next pod restart.
If step 2 shows the controller restored `app.kubernetes.io/name=authentik`, watch goauthentik/authentik issue tracker for fixes around `internal/outpost/controllers/k8s/service.py:52` — the upstream patch might let us drop our `kubernetes_json_patches.service` workaround.

View file

@ -0,0 +1,31 @@
# GitHub API Reference
> Token locations and common API patterns.
## GitHub API
- **Username**: `ViktorBarzin`
- **Token**: `grep github_pat terraform.tfvars | cut -d'"' -f2` (git-crypt encrypted)
- **Scopes**: Full access (repo, admin:public_key, admin:repo_hook, delete_repo, admin:org, workflow, write:packages)
- **`gh` CLI**: Blocked by sandbox — use `curl` instead
```bash
GITHUB_TOKEN=$(grep github_pat terraform.tfvars | cut -d'"' -f2)
# List repos
curl -s -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/users/ViktorBarzin/repos?per_page=100"
# Create repo
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/user/repos" \
-d '{"name":"repo-name","private":true}'
# Add deploy key
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/keys" \
-d '{"title":"key-name","key":"ssh-ed25519 ...","read_only":false}'
# Create webhook
curl -s -X POST -H "Authorization: token $GITHUB_TOKEN" "https://api.github.com/repos/ViktorBarzin/<repo>/hooks" \
-d '{"config":{"url":"https://ci.viktorbarzin.me/hook","content_type":"json","secret":"..."},"events":["push","pull_request"]}'
```
## Capabilities
- **GitHub**: Create/delete repos, push code, manage SSH/deploy keys, manage webhooks, manage org settings, manage packages

View file

@ -0,0 +1,12 @@
# Known Issues (suppress in all agents)
## Permanent
- ha-london Uptime Kuma monitor down — external HA on Raspberry Pi, not in this cluster
- PVFillingUp for navidrome-music — Synology NAS volume, threshold is 95%, expected
## Intermittent
- CrowdSec Helm release stuck in pending-upgrade — known issue, workaround: helm rollback
- Resource usage >80% on nodes — WARN only, overcommit is by design (2x LimitRange ratio)
## How agents consume this file
Each agent definition includes: "Before reporting issues, read `.claude/reference/known-issues.md` and suppress any matches."

View file

@ -0,0 +1,115 @@
# Detailed Infrastructure Patterns
Reference file for patterns, procedures, and tables. Read on demand when the specific topic comes up.
## NFS Volume Pattern
Use the `nfs_volume` shared module for all NFS volumes (creates static PVs, CSI-backed, `soft,timeo=30,retrans=3`):
```hcl
module "nfs_data" {
source = "../../modules/kubernetes/nfs_volume" # ../../../../ for platform modules, ../../../ for sub-stacks
name = "<service>-data" # Must be globally unique (PV is cluster-scoped)
namespace = kubernetes_namespace.<service>.metadata[0].name
nfs_server = var.nfs_server # 192.168.1.127 (Proxmox host)
nfs_path = "/srv/nfs/<service>" # HDD NFS, or "/srv/nfs-ssd/<service>" for SSD
}
# In pod spec: persistent_volume_claim { claim_name = module.nfs_data.claim_name }
```
**Note**: Some legacy PVs still reference `/mnt/main/<service>` paths (from the TrueNAS era). These work via compatibility on the Proxmox host. New PVs should use `/srv/nfs/` or `/srv/nfs-ssd/`.
**DO NOT use inline `nfs {}` blocks** — they mount with `hard,timeo=600` defaults which hang forever.
## Adding NFS Exports
1. Create dir on Proxmox host: `ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service> && chmod 777 /srv/nfs/<service>"`
2. Edit `/etc/exports` on the Proxmox host — add the export entry
3. Reload exports: `ssh root@192.168.1.127 "exportfs -ra"`
4. Verify: `showmount -e 192.168.1.127`
## Static Site Hosting
Two patterns for serving a folder of static files (HTML/CSS/JS/media):
1. **Image-baked** (default for git-native content): bake files into an `nginx:*-alpine` image at build time, deploy like any owned app (CI builds + pushes, Keel/Woodpecker rolls out). Reference: `stacks/blog` (Hugo → nginx, `Website/Dockerfile`). Use when content lives in git and changes via commits.
2. **NFS-backed** (for externally-authored / large / non-git content): a stock `nginx:1.28-alpine` Deployment mounts an `nfs_volume` PVC **read-only** at `/usr/share/nginx/html`; a tiny ConfigMap supplies `/etc/nginx/conf.d/default.conf` (just `root` + `index <entry>.html`). Files are dropped on `/srv/nfs/<site>` out-of-band (Nextcloud "PVE NFS Pool" or rsync) — no rebuild, auto-backed-up by `nfs-mirror`. Reference: `stacks/stem95su` (established 2026-06-07). Use when content is authored outside git (e.g. exported tools), is large (avoids git/image bloat), or a non-dev updates it. **The export subdir on the PVE host must exist before the pod mounts** — the `nfs_volume` module does NOT create it (see "Adding NFS Exports"; a subdir under the already-exported `/srv/nfs` needs no new `/etc/exports` line).
Both front with `ingress_factory` (`auth="none"` for open public content → CrowdSec + ai-bot-block still apply; or chain `anubis_instance` for a PoW gate, as `blog` does).
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
## Anti-AI Scraping (4 Active Layers) (Updated 2026-05-10)
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
1. **Anubis PoW challenge** (per-site reverse proxy) — `modules/kubernetes/anubis_instance/`. Latest: `ghcr.io/techarohq/anubis:v1.25.0`. Difficulty 2 (~250 ms desktop / ~700 ms mobile), 30-day JWT cookie scoped to `viktorbarzin.me` so a single solve covers every Anubis-fronted subdomain. Active on: `viktorbarzin.me`, `kms.viktorbarzin.me`, `travel.viktorbarzin.me`. Add to a stack: `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://<svc>.<ns>.svc.cluster.local" }`, then point ingress_factory at `module.anubis.service_name` + `port = module.anubis.service_port` and set `anti_ai_scraping = false`. Shared ed25519 signing key in Vault `secret/viktor` -> `anubis_ed25519_key`. **Avoid putting Anubis in front of CLI/API/Git endpoints (Forgejo, APIs, WebDAV)** — clients without JS can't solve PoW.
2. **Bot blocking forwardAuth** (ForwardAuth → bot-block-proxy → poison-fountain) — global default for non-Anubis sites. `bot-block-proxy` (OpenResty in `traefik` ns) is fail-open with 100 ms connect / 200 ms read timeouts so a downed poison-fountain costs ≤200 ms per request. Source: `stacks/traefik/modules/traefik/main.tf`.
3. **X-Robots-Tag noai** — set by `traefik-anti-ai-headers` middleware. Anubis additionally serves a comprehensive `/robots.txt` (`SERVE_ROBOTS_TXT=true`) to well-behaved bots.
4. **Tarpit/poison content** (standalone at poison.viktorbarzin.me, `stacks/poison-fountain/`). Currently scaled to `replicas = 0` — fail-open path means no live traffic, no penalty.
Trap links (formerly a layer) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
Key files: `modules/kubernetes/anubis_instance/`, `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/traefik/modules/traefik/main.tf`
## Terragrunt Architecture
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
- Each stack: `stacks/<service>/main.tf`, state at `state/stacks/<service>/terraform.tfstate`
- Platform modules: `stacks/platform/modules/<service>/`, shared: `modules/kubernetes/`
- Syntax: `--non-interactive`, `terragrunt run --all -- <command>` (not `run-all`)
- Tiers auto-generated into `tiers.tf` — never add `locals { tiers = {} }` manually
## Factory Pattern (Multi-User Services)
Structure: `stacks/<service>/main.tf` + `factory/main.tf`. Examples: `actualbudget`, `freedify`.
To add a user: export NFS share, add Cloudflare route in tfvars, add module block calling factory.
## Node Rebuild Procedure
1. Drain: `kubectl drain k8s-nodeX --ignore-daemonsets --delete-emptydir-data`
2. Delete: `kubectl delete node k8s-nodeX`
3. Destroy VM (remove from `stacks/infra/main.tf`)
4. Get fresh join command: `ssh wizard@10.0.20.100 'sudo kubeadm token create --print-join-command'` (tokens expire 24h)
5. Update `k8s_join_command` in `terraform.tfvars`, add VM to `stacks/infra/main.tf`, apply
6. GPU node (k8s-node1): apply platform stack to re-apply GPU label/taint
## Kyverno Resource Governance
### LimitRange Defaults (injected when no explicit `resources {}`)
| Tier | Default Mem | Max Mem | Default CPU | Max CPU |
|------|------------|---------|-------------|---------|
| 0-core | 512Mi | 8Gi | 500m | 4 |
| 1-cluster | 512Mi | 4Gi | 500m | 2 |
| 2-gpu | 2Gi | 16Gi | 1 | 8 |
| 3-edge / 4-aux | 256Mi | 4Gi | 250m | 2 |
| No tier | 256Mi | 2Gi | 250m | 1 |
### ResourceQuota (opt-out: `resource-governance/custom-quota=true`)
| Tier | lim CPU | lim Mem | Pods |
|------|---------|---------|------|
| 0-core | 32 | 64Gi | 100 |
| 1-cluster | 16 | 32Gi | 30 |
| 2-gpu | 48 | 96Gi | 40 |
| 3-edge / 4-aux | 8-16 | 16-32Gi | 20-30 |
Custom quotas: authentik, monitoring (opted out), nvidia (opted out), nextcloud, onlyoffice.
LimitRange opt-out: `resource-governance/custom-limitrange=true` + custom `kubernetes_limit_range` in stack.
### Other Policies
- `inject-priority-class-from-tier` (CREATE only), `inject-ndots` (ndots:2), `sync-tier-label`
- `goldilocks-vpa-auto-mode`: VPA `off` globally — Terraform owns resources, Goldilocks observe-only
- Security policies ALL Audit mode: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries`
### Debugging Container Failures
1. **OOMKilled?**`kubectl describe limitrange tier-defaults -n <ns>`. edge/aux default = 256Mi.
2. **Won't schedule?**`kubectl describe resourcequota tier-quota -n <ns>`.
3. **Evicted?** → aux-tier pods (priority 200K, Never preempt) evicted first.
4. **Unexpected limits?** → LimitRange injects defaults. Always set explicit resources.
5. **Need more?** → Set explicit `resources {}` or add quota/limitrange opt-out labels.
## Authentik (Identity Provider)
- **URL**: `https://authentik.viktorbarzin.me` | **API**: `/api/v3/` | **Token**: `authentik_api_token` in tfvars
- 3 server + 3 worker + 3 PgBouncer + embedded outpost
- Forward auth: `protected = true` in ingress_factory
- OIDC for K8s: issuer `.../application/o/kubernetes/`, client `kubernetes` (public)
- See archived skills for management tasks and OIDC gotchas
## Archived Troubleshooting Runbooks
28 skills in `.claude/skills/archived/` — load when the specific issue arises.
Topics: authentik, bluestacks, clickhouse-nfs, coturn, crowdsec, fastapi-svelte-gpu,
grafana-datasource, helm-stuck, ingress-migration, image-caching, gpu-devices, hpa-storm,
nfs-mount, kubelet-manifest, llm-gpu, loki-helm, librespot, nextcloud-calendar, nfsv4-idmapd,
openclaw-deploy, pfsense-dnsmasq, pfsense-nat, proxmox-disk, python-sanitize, terraform-state,
traefik-helm, traefik-rewrite-body.

View file

@ -0,0 +1,130 @@
# Proxmox Inventory & Infrastructure
> Static reference for VMs, hardware, and network topology.
## Proxmox Host Hardware
- **Model**: Dell R730
- **CPU**: Intel Xeon E5-2699 v4 @ 2.20GHz (22 cores / 44 threads, single socket, CPU2 unpopulated)
- **RAM**: 272 GB DDR4-2400 ECC RDIMM (10 DIMMs, see Memory Layout below)
- **GPU**: NVIDIA Tesla T4 (PCIe passthrough to k8s-node1)
- **iDRAC**: 192.168.1.4 (root/calvin)
- **Disks**: 1.1TB RAID1 SAS (backup) + 931GB Samsung SSD + 10.7TB RAID1 HDD
- **NFS server**: Proxmox host serves NFS directly. HDD NFS: `/srv/nfs` on ext4 LV `pve/nfs-data` (2TB). SSD NFS: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB). Exports use `async` mode (safe with UPS + databases on block storage). TrueNAS (10.0.10.15) decommissioned.
- **Proxmox access**: `ssh root@192.168.1.127`
## Memory Layout (updated 2026-04-01)
### Physical DIMM Slot Map
```
╔══════════════════════════════════════════════════════════════════════════════╗
║ CPU1 DIMM SLOTS ║
║ ║
║ ┌─── WHITE (1st per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A1 │ │ A2 │ │ A3 │ │ A4 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40BB1-CRC (2R) ║
║ │ │██████│ │██████│ │██████│ │██████│ ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── BLACK (2nd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A5 │ │ A6 │ │ A7 │ │ A8 │ ║
║ │ │ 32G │ │ 32G │ │ 32G │ │ 32G │ Samsung M393A4K40CB1-CRC (2R) ║
║ │ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ │▓▓▓▓▓▓│ ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ ┌─── GREEN (3rd per channel) ───┐ ║
║ │ │ ║
║ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ║
║ │ │ A9 │ │ A10 │ │ A11 │ │ A12 │ ║
║ │ │ │ │ │ │ 8G │ │ 8G │ SK Hynix HMA81GR7AFR8N-UH (1R) ║
║ │ │ empty│ │ empty│ │░░░░░░│ │░░░░░░│ ║
║ │ └──────┘ └──────┘ └──────┘ └──────┘ ║
║ │ Ch 0 Ch 1 Ch 2 Ch 3 ║
║ └────────────────────────────────┘ ║
║ ║
║ B1-B12: All empty (requires CPU2) ║
║ ║
║ Legend: ██ = Samsung BB1 32G ▓▓ = Samsung CB1 32G ░░ = Hynix 8G ║
╚══════════════════════════════════════════════════════════════════════════════╝
```
### Channel Summary
```
Channel 0: A1 [32G] ──── A5 [32G] ──── A9 [ ] = 64 GB ✓ matched
Channel 1: A2 [32G] ──── A6 [32G] ──── A10[ ] = 64 GB ✓ matched
Channel 2: A3 [32G] ──── A7 [32G] ──── A11[ 8G ] = 72 GB ~ +8G bonus
Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB ~ +8G bonus
───────── ───────── ──────────
WHITE BLACK GREEN TOTAL: 272 GB
```
### DIMM Details
- **A1-A4**: Samsung M393A4K40BB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, original)
- **A5-A8**: Samsung M393A4K40CB1-CRC 32GB DDR4-2400 ECC RDIMM (2-rank, added 2026-04-01)
- **A11-A12**: SK Hynix HMA81GR7AFR8N-UH 8GB DDR4-2400 ECC RDIMM (1-rank, relocated from A5/A6)
- **A9-A10, B1-B12**: Empty (B-side requires CPU2)
- **Speed**: 2400 MHz (BIOS override — 3 DPC defaults to 1866 MHz, forced to 2400 via System BIOS > Memory Settings > Memory Frequency)
## Network Topology
```
10.0.10.0/24 - Management: Wizard (10.0.10.10)
10.0.20.0/24 - Kubernetes: pfSense GW (10.0.20.1), Registry (10.0.20.10),
k8s-master (10.0.20.100), DNS (10.0.20.101), MetalLB (10.0.20.102-200)
192.168.1.0/24 - Physical: Proxmox (192.168.1.127)
```
## Network Bridges
- **vmbr0**: Physical bridge on `eno1`, IP `192.168.1.127/24` — physical/home network
- **vmbr1**: Internal-only bridge, VLAN-aware — VLAN 10 (management) and VLAN 20 (kubernetes)
## VM Inventory
| VMID | Name | Status | CPUs | RAM | Network | Disk | Notes |
|------|------|--------|------|-----|---------|------|-------|
| 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall |
| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. |
| 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 |
| 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) |
| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) |
| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 |
| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker |
| 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) |
| 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM |
| ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` |
**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08)
## VM Templates
| VMID | Name | Purpose |
|------|------|---------|
| 1000 | ubuntu-2404-cloudinit-non-k8s-template | Base for non-K8s VMs |
| 1001 | docker-registry-template | Docker registry VM |
| 2000 | ubuntu-2404-cloudinit-k8s-template | Base for K8s nodes |
## PVE Host Systemd Services (Custom)
| Unit | Type | Schedule | Purpose |
|------|------|----------|---------|
| `lvm-pvc-snapshot.timer` | Timer | Daily 03:00 | LVM thin snapshots of all PVCs (7-day retention) |
| `daily-backup.timer` | Timer | Daily 05:00 | PVC file backup, auto SQLite backup, pfSense, PVE config |
| `offsite-sync-backup.timer` | Timer | Daily 06:00 | Two-step rsync to Synology (sda + NFS via inotify) |
| `nfs-change-tracker.service` | Service | Continuous | inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log` |
## GPU Node (currently k8s-node1)
- **VMID**: 201, **PCIe**: `0000:06:00.0` (NVIDIA Tesla T4) — physical passthrough, no Terraform pin
- **Taint**: `nvidia.com/gpu=true:PreferNoSchedule` (applied dynamically to every NFD-discovered GPU node)
- **Label**: `nvidia.com/gpu.present=true` (auto-applied by gpu-feature-discovery; also `feature.node.kubernetes.io/pci-10de.present=true` from NFD)
- GPU workloads need: `node_selector = { "nvidia.com/gpu.present" : "true" }` + nvidia toleration
- Taint applied via `null_resource.gpu_node_config` in `stacks/nvidia/modules/nvidia/main.tf`; node discovery keyed on the NFD `pci-10de.present` label so the taint follows the card to whichever host is carrying it

View file

@ -0,0 +1,164 @@
{
"github_repo_overrides": {
"ghcr.io/immich-app/immich-server": "immich-app/immich",
"ghcr.io/immich-app/immich-machine-learning": "immich-app/immich",
"docker.io/vaultwarden/server": "dani-garcia/vaultwarden",
"vaultwarden/server": "dani-garcia/vaultwarden",
"docker.io/mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
"mailserver/docker-mailserver": "docker-mailserver/docker-mailserver",
"docker.n8n.io/n8nio/n8n": "n8n-io/n8n",
"headscale/headscale": "juanfont/headscale",
"technitium/dns-server": "TechnitiumSoftware/DnsServer",
"ghcr.io/paperless-ngx/paperless-ngx": "paperless-ngx/paperless-ngx",
"ghcr.io/blakeblackshear/frigate": "blakeblackshear/frigate",
"ghcr.io/dgtlmoon/changedetection.io": "dgtlmoon/changedetection.io",
"ghcr.io/linkwarden/linkwarden": "linkwarden/linkwarden",
"ghcr.io/open-webui/open-webui": "open-webui/open-webui",
"ghcr.io/advplyr/audiobookshelf": "advplyr/audiobookshelf",
"ghcr.io/browserless/chromium": "browserless/chromium",
"ghcr.io/rybbit-io/rybbit-backend": "rybbit-io/rybbit",
"ghcr.io/rybbit-io/rybbit-client": "rybbit-io/rybbit",
"ghcr.io/gurucomputing/headscale-ui": "gurucomputing/headscale-ui",
"ghcr.io/dmunozv04/isponsorblocktv": "dmunozv04/iSponsorBlockTV",
"ghcr.io/gramps-project/grampsweb": "gramps-project/gramps-web",
"ghcr.io/project-osrm/osrm-backend": "Project-OSRM/osrm-backend",
"ghcr.io/flaresolverr/flaresolverr": "FlareSolverr/FlareSolverr",
"ghcr.io/therobbiedavis/listenarr": "therobbiedavis/listenarr",
"ghcr.io/immichframe/immichframe": "immichframe/ImmichFrame",
"lscr.io/linuxserver/qbittorrent": "linuxserver/docker-qbittorrent",
"lscr.io/linuxserver/lidarr": "linuxserver/docker-lidarr",
"lscr.io/linuxserver/prowlarr": "linuxserver/docker-prowlarr",
"lscr.io/linuxserver/readarr": "linuxserver/docker-readarr",
"lscr.io/linuxserver/speedtest-tracker": "linuxserver/docker-speedtest-tracker",
"privatebin/nginx-fpm-alpine": "PrivateBin/PrivateBin",
"freshrss/freshrss": "FreshRSS/FreshRSS",
"hackmdio/hackmd": "hackmdio/codimd",
"onlyoffice/documentserver": "ONLYOFFICE/DocumentServer",
"netboxcommunity/netbox": "netbox-community/netbox",
"stirlingtools/stirling-pdf": "Stirling-Tools/Stirling-PDF",
"phpipam/phpipam-www": "phpipam/phpipam",
"rhasspy/wyoming-whisper": "rhasspy/wyoming-addons",
"rhasspy/wyoming-piper": "rhasspy/wyoming-addons",
"clickhouse/clickhouse-server": "ClickHouse/ClickHouse",
"docker.io/athomasson2/ebook2audiobook": "athomasson2/ebook2audiobook",
"amruthpillai/reactive-resume": "AmruthPillworking/Reactive-Resume",
"dpage/pgadmin4": "pgadmin-org/pgadmin4",
"ghcr.io/yourok/torrserver": "YouROK/TorrServer",
"opentripplanner/opentripplanner": "opentripplanner/OpenTripPlanner",
"codeberg.org/forgejo/forgejo": "forgejo/forgejo",
"shlinkio/shlink": "shlinkio/shlink",
"shlinkio/shlink-web-client": "shlinkio/shlink-web-client",
"dgtlmoon/sockpuppetbrowser": "dgtlmoon/sockpuppetbrowser"
},
"helm_chart_repo_overrides": {
"https://charts.goauthentik.io/": "goauthentik/authentik",
"https://traefik.github.io/charts": "traefik/traefik-helm-chart",
"https://kyverno.github.io/kyverno/": "kyverno/kyverno",
"https://mysql.github.io/mysql-operator/": "mysql/mysql-operator",
"https://cloudnative-pg.github.io/charts": "cloudnative-pg/cloudnative-pg",
"https://charts.external-secrets.io": "external-secrets/external-secrets",
"https://metallb.github.io/metallb": "metallb/metallb",
"https://nextcloud.github.io/helm/": "nextcloud/helm",
"https://crowdsecurity.github.io/helm-charts": "crowdsecurity/helm-charts",
"https://helm.releases.hashicorp.com": "hashicorp/vault-helm",
"https://bitnami-labs.github.io/sealed-secrets": "bitnami-labs/sealed-secrets",
"https://grafana.github.io/helm-charts": "grafana/helm-charts",
"https://prometheus-community.github.io/helm-charts": "prometheus-community/helm-charts",
"https://democratic-csi.github.io/charts/": "democratic-csi/democratic-csi",
"https://stakater.github.io/stakater-charts": "stakater/Reloader",
"https://topolvm.github.io/pvc-autoresizer": "topolvm/pvc-autoresizer",
"https://kubernetes-sigs.github.io/descheduler/": "kubernetes-sigs/descheduler",
"https://kubernetes-sigs.github.io/metrics-server/": "kubernetes-sigs/metrics-server",
"https://charts.fairwinds.com/stable": "FairwindsOps/goldilocks",
"https://helm.ngc.nvidia.com/nvidia": "NVIDIA/gpu-operator",
"oci://ghcr.io/woodpecker-ci/helm": "woodpecker-ci/helm",
"oci://10.0.20.10:5000/bitnamicharts": "bitnami/charts"
},
"db_backed_services": {
"affine": { "type": "postgresql", "db_name": "affine", "shared": true },
"claude-memory": { "type": "postgresql", "db_name": "claude_memory", "shared": true },
"crowdsec": { "type": "postgresql", "db_name": "crowdsec", "shared": true },
"dawarich": { "type": "postgresql", "db_name": "dawarich", "shared": true },
"health": { "type": "postgresql", "db_name": "health", "shared": true },
"linkwarden": { "type": "postgresql", "db_name": "linkwarden", "shared": true },
"n8n": { "type": "postgresql", "db_name": "n8n", "shared": true },
"netbox": { "type": "postgresql", "db_name": "netbox", "shared": true },
"rybbit": { "type": "postgresql", "db_name": "rybbit", "shared": true },
"tandoor": { "type": "postgresql", "db_name": "tandoor", "shared": true },
"technitium": { "type": "postgresql", "db_name": "technitium", "shared": true },
"trading-bot": { "type": "postgresql", "db_name": "trading_bot", "shared": true },
"woodpecker": { "type": "postgresql", "db_name": "woodpecker", "shared": true },
"immich": { "type": "postgresql", "db_name": "immich", "dedicated": true, "backup_cronjob": "postgresql-backup", "backup_namespace": "immich" },
"authentik": { "type": "postgresql", "dedicated": true, "notes": "Uses PgBouncer, managed by Helm chart" },
"hackmd": { "type": "mysql", "db_name": "codimd", "shared": true },
"mailserver": { "type": "mysql", "db_name": "mailserver", "shared": true },
"monitoring": { "type": "mysql", "db_name": "monitoring", "shared": true, "notes": "Grafana backend" },
"nextcloud": { "type": "mysql", "db_name": "nextcloud", "shared": true },
"onlyoffice": { "type": "mysql", "db_name": "onlyoffice", "shared": true },
"paperless-ngx": { "type": "mysql", "db_name": "paperless_ngx", "shared": true },
"phpipam": { "type": "mysql", "db_name": "phpipam", "shared": true },
"real-estate-crawler": { "type": "mysql", "db_name": "wrongmove", "shared": true },
"speedtest": { "type": "mysql", "db_name": "speedtest", "shared": true },
"url": { "type": "mysql", "db_name": "shlink", "shared": true },
"vault": { "type": "mysql", "db_name": "vault", "shared": true }
},
"backup_infrastructure": {
"postgresql": {
"cronjob_name": "postgresql-backup",
"namespace": "dbaas",
"credential_secret": "pg-cluster-superuser",
"credential_key": "password",
"host": "pg-cluster-rw.dbaas",
"backup_pvc": "dbaas-postgresql-backup-host"
},
"mysql": {
"cronjob_name": "mysql-backup",
"namespace": "dbaas",
"credential_secret": "cluster-secret",
"credential_key": "ROOT_PASSWORD",
"host": "mysql.dbaas",
"backup_pvc": "dbaas-mysql-backup-host"
}
},
"version_jump_always_step": [
"authentik",
"nextcloud",
"immich"
],
"auto_detect_rules": {
"ghcr.io/{org}/{repo}": "Use org/repo directly, strip -server/-backend suffixes if repo 404s",
"docker.io/{org}/{repo}": "Try org/repo on GitHub",
"lscr.io/linuxserver/{app}": "Map to linuxserver/docker-{app}",
"quay.io/{org}/{repo}": "Try org/repo on GitHub",
"registry.gitlab.com/{org}/{repo}": "Try org/repo on GitHub (may be GitLab-only)"
},
"skip_image_patterns": [
"viktorbarzin/*",
"registry.viktorbarzin.me/*",
"ancamilea/*",
"mghee/*",
"*postgres*",
"*mysql*",
"*redis*",
"*clickhouse*",
"*etcd*",
"registry.k8s.io/*",
"quay.io/tigera/*",
"quay.io/metallb/*",
"nvcr.io/*",
"reg.kyverno.io/*"
],
"breaking_change_keywords": [
"breaking",
"BREAKING",
"migration required",
"schema change",
"database migration",
"manual intervention",
"action required",
"removed",
"deprecated",
"renamed",
"incompatible"
]
}