infra

Author	SHA1	Message	Date
Viktor Barzin	53657d9952	infra: document auth = "app\|none" tier on every legacy ingress Sweep through the 30+ stacks that predated the auth = "app" tier and were tagged auth = "none" without a comment explaining why they weren't behind Authentik. Each is now self-documenting at the call site, so the tg-level anti-exposure guard passes and future readers don't have to reverse-engineer the intent. Flipped 6 stacks from "none" to "app" — their backends have their own user auth and the new tier records that more accurately: - navidrome (Subsonic user/password) - ntfy (deny-all default + user.db tokens) - nextcloud (WebDAV/CalDAV/CardDAV app passwords) - vaultwarden (Bitwarden-compatible token auth) - headscale (OIDC + preauth keys for Tailscale nodes) - paperless-ngx (app-layer login + API tokens) Kept "none" with a comment on the rest — they're genuinely public, webhook receivers, native-protocol endpoints, OAuth callbacks, or Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt), claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api, fire-planner /api, forgejo (git/OCI native clients), frigate (HA integration), immich/frame, insta2spotify /api, instagram-poster (meta fetcher), k8s-portal, matrix (native bearer), monitoring×2 (HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT), owntracks (HTTP Basic), postiz, privatebin (client-side enc), rybbit (analytics tracker), send (E2E file drop), tuya-bridge (API key), vault (own auth + CLI), webhook_handler, woodpecker (forgejo webhooks + OAuth), xray (×3 VPN transports). real-estate-crawler/main.tf:400 already had its comment from a prior edit — not touched here. No live state changes — auth = "app" produces the same middleware chain as auth = "none" (verified earlier this session). This commit is purely documentation + intent-tagging.	2026-05-11 19:25:48 +00:00
root	65c4fc6c0b	Woodpecker CI deploy [CI SKIP]	2026-05-10 18:57:31 +00:00
Viktor Barzin	77d111f5fc	owntracks: explicit auth = "none" — Phase 5 audit completion The Phase 4 audit pass missed this site because the previous agent scoped out owntracks (it overrides the factory's middleware list via extra_annotations to use its own basic-auth middleware). Adding the explicit auth = "none" satisfies Phase 5's "every ingress has an explicit decision" goal and makes the intent visible — mobile OwnTracks clients post location data via HTTP basic-auth and can't follow Authentik forward-auth 302s. Closes the loop on Phase 5: 122/122 active ingress_factory call sites now carry an explicit auth = "..." decision (zero callers rely on the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 18:55:04 +00:00
Viktor Barzin	602103ede1	[owntracks] Strip face avatar from hook payload + drop orphan PVC Bundles two small follow-ups to the live bridge + port-fix work: ## Face avatar fix (dawarich-hook.lua) After the Recorder ran in production for a while it began enriching publish payloads with a `face` field — the base64-encoded user avatar uploaded via the Recorder's web UI (~120 KB). Our Lua hook builds a curl command that embeds the JSON payload as `-d '<payload>'`, which hit `E2BIG` / `Argument list too long` (os.execute reason=code=7) on Linux's `execve` argv limit (~128 KB). Every live POST stopped making it to Dawarich, even though the HTTP POST from the phone to Owntracks still returned 200 and the .rec write still happened. Fix: `data.face = nil` before serializing. Dawarich doesn't use it anyway (not persisted into any column — `raw_data` stored without it). Also upgraded the debug log: on failure we now emit `dawarich-bridge: FAIL tst=... reason=... code=... cmd=...` so any future variant of this problem (next big field surfaced upstream, etc.) is one log tail away from a diagnosis. ``` $ kubectl -n owntracks logs deploy/owntracks --tail=5 \| grep dawarich-bridge + dawarich-bridge: init + dawarich-bridge: ok tst=1776600238 ``` ## Orphan PVC removal (main.tf) `owntracks-data-proxmox` (1 Gi, proxmox-lvm, unencrypted) was a leftover from the encrypted-migration attempt; the Deployment has been mounting `owntracks-data-encrypted` the whole time. Verified `Used By: <none>` on the live PVC before removal. Removing the resource from Terraform destroys the PVC — harmless, no data loss. ## Test Plan ### Automated ``` $ ../../scripts/tg plan Plan: 0 to add, 1 to change, 1 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 1 changed, 1 destroyed. $ kubectl -n owntracks get pvc NAME STATUS VOLUME ... owntracks-data-encrypted Bound ... (owntracks-data-proxmox gone) ``` ### Manual Verification ``` $ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) $ TST=$(date +%s) $ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \ curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \ -H 'Content-Type: application/json' \ -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \ -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \ https://owntracks.viktorbarzin.me/pub HTTP 200 $ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \ psql -U postgres -d dawarich -tAc \ "SELECT ST_AsText(lonlat::geometry) FROM points WHERE user_id=1 AND timestamp=$TST" POINT(-0.1278 51.5074) ``` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:05:18 +00:00
Viktor Barzin	17a3e03e07	[owntracks] Bridge Recorder → Dawarich via Lua hook script ## Context Viktor wanted live forwarding from Owntracks to Dawarich so his map stays in sync without a periodic backfill. The original plan assumed ot-recorder honoured an `OTR_HTTPHOOK` environment variable — but Recorder 1.0.1 (latest on Docker Hub as of Aug 2025) has no such feature: ``` $ kubectl -n owntracks exec deploy/owntracks -- \ strings /usr/bin/ot-recorder \| grep -iE 'hook\|webhook\|http_post' (no matches) ``` Lua hooks, on the other hand, are first-class: `--lua-script` loads a file and calls the `otr_hook(topic, _type, data)` function for every publish. That is the pivot this commit makes. ## This change Mount a Lua script via ConfigMap and tell ot-recorder to load it: ``` Phone POST /pub ---> Traefik ---> Recorder pod \| \| handle_payload() writes .rec \| otr_hook(topic,_type,data) \| \| \| +---> os.execute("curl … &") \| \| \| v \| Dawarich /api/v1/owntracks/points \| +---> HTTP 200 to phone ``` Per-publish cost: one `curl` subprocess, `--max-time 5`, backgrounded with `&` so it doesn't block the HTTP response to the phone. A Dawarich 5xx drops exactly one point — the `.rec` write still happens, so the one-shot backfill Job can always re-play. `DAWARICH_API_KEY` is injected from K8s Secret `owntracks-secrets` (sourced from Vault `secret/owntracks.dawarich_api_key` via the existing `dataFrom.extract` ExternalSecret). The Lua reads it with `os.getenv()` so the key never lands in Terraform state. ### Key discoveries in the verification loop (why iteration count > 1) 1. The hook function must be named `otr_hook`, not `hook` (recorder's `luasupport.c` calls `lua_getglobal(L, "otr_hook")`). The recorder logs `cannot invoke otr_hook in Lua script` when missing — the plan's `hook()` naming was wrong. 2. Dawarich's `latitude`/`longitude` scalar columns are legacy and always NULL; the authoritative geometry is in the `lonlat` PostGIS column (`ST_AsText(lonlat::geometry)`). Early "it's broken" readings were me querying the wrong columns. 3. Default Recreate-strategy rollouts cause ~30s 502/503 windows on the ingress — tolerable, but every apply is visible as an outage to the phone. Batching edits is important. ## What is NOT in this change - Not OTR_HTTPHOOK. Removed with this commit (dead env var). - Not the one-shot backfill Job — that comes after the phone buffer has flushed to avoid racing against incoming hook POSTs (follow-up: code-h2r). - Not Anca's bridge — a second Recorder instance or a smarter hook is needed to route her posts under her own Dawarich api_key (follow-up: code-72g). - No Ingress or Service change — Commit 1 (``a21d4a44``) already landed those. ## Test Plan ### Automated ``` $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 1 added, 1 changed, 0 destroyed. $ kubectl -n owntracks logs deploy/owntracks --tail=5 + initializing Lua hooks from `/hook/dawarich-hook.lua' + dawarich-bridge: init + HTTP listener started on 0.0.0.0:8083, without browser-apikey ... + dawarich-bridge: tst=1 lat=0 lon=0 ok=true ``` ### Manual Verification ``` $ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) $ TST=$(date +%s) $ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \ curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \ -H 'Content-Type: application/json' \ -H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \ -d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \ https://owntracks.viktorbarzin.me/pub HTTP 200 $ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \ psql -U postgres -d dawarich -c \ "SELECT timestamp, ST_AsText(lonlat::geometry) FROM points \ WHERE user_id=1 AND timestamp=$TST" timestamp \| st_astext ------------+------------------------- 1776555707 \| POINT(-0.1278 51.5074) ``` Real phone traffic (from in-flight buffer flush) lands in Dawarich too: `traefik logs -l app.kubernetes.io/name=traefik \| grep 'POST /api/v1/owntracks/points'` shows ingress POSTs from `owntracks` namespace to `dawarich` backend with status 200. ### Reproduce locally 1. `vault login -method=oidc` 2. `kubectl -n owntracks logs deploy/owntracks --tail=20` — expect `dawarich-bridge: init` after the Lua loader line. 3. Do the curl above, poll the DB, expect `POINT(lon lat)`. Closes: code-z9b Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:47:22 +00:00
Viktor Barzin	a21d4a4424	[owntracks] Fix Service port scheme (https→http), unbreak phone POSTs ## Context iOS Owntracks app has been unable to upload for months — phone buffer now holds ~1200 pending points. Last successful `.rec` write was 2026-01-02T14:32:00Z, matching when the failures started. ### The 500 — verified in Traefik access log ``` 152.37.101.156 - viktor "POST /pub HTTP/1.1" 500 21 "-" "-" 47900 "owntracks-owntracks-owntracks-viktorbarzin-me@kubernetes" "https://10.10.107.194:8083" 84ms ``` Basic-auth + middleware chain (rate-limit, csp, crowdsec) all pass. Traefik then opens backend connection to `https://10.10.107.194:8083`. The Recorder pod listens plain HTTP on :8083 (`OTR_PORT=0` disables HTTPS in ot-recorder), so the TLS handshake never completes → 500. ### Root cause — Service port spec `kubernetes_service.owntracks` declared the port as: ``` name: https port: 443 targetPort: 8083 ``` Traefik's IngressClass scheme inference: if the Service port is named `https` OR numbered `443`, Traefik speaks HTTPS to that backend. Both were true here, pointing at a plain-HTTP socket. The name/number were purely cosmetic — a leftover from mirroring the external `:443` edge — and worked only while Traefik's default happened to be HTTP. A Traefik upgrade (or middleware-chain change) tightened inference and surfaced the mismatch. ## This change Rename port to `name=http, port=80` and update the matching Ingress backend `port.number` from 443 to 80. `targetPort` stays at 8083. ``` Phone -----> CF tunnel -----> Traefik (:443, TLS) -----> Service \ :80 (http) \ \| \ v ---------------> Pod :8083 (plain HTTP hop) (HTTP listener) ``` Deployment container port label also renamed `https` → `http` for consistency (no functional effect — just readability). ## What is NOT in this change - Not switching the Recorder pod to HTTPS natively. That would require mounting a cert + rotation plumbing. External TLS is already terminated at Cloudflare/Traefik; in-cluster hop to the pod is plain-HTTP by design. - Not enabling `OTR_HTTPHOOK` to bridge Recorder → Dawarich (follow-up: code-z9b). - Not backfilling historical `.rec` files into Dawarich (follow-up: code-h2r). - Incidental: `providers.tf` + `.terraform.lock.hcl` refreshed by `terraform init -upgrade` to pick up the goauthentik provider that the ingress_factory module recently started requiring. ## Test Plan ### Automated ``` $ ../../scripts/tg plan Plan: 0 to add, 3 to change, 0 to destroy. $ ../../scripts/tg apply --non-interactive Apply complete! Resources: 0 added, 3 changed, 0 destroyed. $ kubectl -n owntracks get svc owntracks -o=jsonpath='{.spec.ports[0]}' {"name":"http","port":80,"protocol":"TCP","targetPort":8083} $ kubectl -n owntracks get ingress owntracks -o=jsonpath='{.spec.rules[0].http.paths[0].backend}' {"service":{"name":"owntracks","port":{"number":80}}} ``` ### Manual Verification In-cluster auth'd POST through the full ingress chain: ``` VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks \| jq -r .viktor) kubectl -n owntracks run curltest --rm -i --image=curlimages/curl --restart=Never -- \ curl -s -o /dev/null -w "HTTP %{http_code}\n" -X POST -u "viktor:$VIKTOR_PW" \ -H "Content-Type: application/json" \ -d '{"_type":"location","lat":0,"lon":0,"tst":1000000000,"tid":"vb"}' \ https://owntracks.viktorbarzin.me/pub # HTTP 200 ``` (previously: HTTP 500 on identical request) ### Reproduce locally 1. `vault login -method=oidc` 2. `cd infra/stacks/owntracks && ../../scripts/tg plan` 3. Expected: `Plan: 0 to add, 3 to change, 0 to destroy.` (or empty if already applied) 4. Watch next iOS Owntracks POST → Traefik access log should show `200`, not `500`. Closes: code-nqd Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:24:25 +00:00
Viktor Barzin	327ce215b9	[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip] ## Context Wave 3A (commit `c9d221d5`) added the `# KYVERNO_LIFECYCLE_V1` marker to the 27 pre-existing `ignore_changes = [...dns_config]` sites so they could be grepped and audited. It did NOT address pod-owning resources that were simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18) found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec, and many other stacks showed perpetual `dns_config` drift every plan because their `kubernetes_deployment` / `kubernetes_stateful_set` / `kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all. Root cause (same as Wave 3A): Kyverno's admission webhook stamps `dns_config { option { name = "ndots"; value = "2" } }` on every pod's `spec.template.spec.dns_config` to prevent NxDomain search-domain flooding (see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes` on every Terraform-managed pod-owner, Terraform repeatedly tries to strip the injected field. ## This change Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`, `kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each carries the right `ignore_changes` path: - kubernetes_deployment / stateful_set / daemon_set / job_v1: `spec[0].template[0].spec[0].dns_config` - kubernetes_cron_job_v1: `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` (extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is one level deeper) Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2` inline so the suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`. Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`): 1. No existing `lifecycle {}`: inject a brand-new block just before the resource's closing `}`. 108 new blocks on 93 files. 2. Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag` from Wave 4, commit a62b43d1): extend its `ignore_changes` list with the dns_config path. Handles both inline (`= [x]`) and multiline (`= [\n x,\n]`) forms; ensures the last pre-existing list item carries a trailing comma so the extended list is valid HCL. 34 extensions. The script skips anything already mentioning `dns_config` inside an `ignore_changes`, so re-running is a no-op. ## Scale - 142 total lifecycle injections/extensions - 93 `.tf` files touched - 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones - Every Tier 0 and Tier 1 stack with a pod-owning resource is covered - Together with Wave 3A's 27 pre-existing markers → 169 greppable `KYVERNO_LIFECYCLE_V1` dns_config sites across the repo ## What is NOT in this change - `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … /`). Python script touched the file, reverted manually. - `_template/main.tf.example` skeleton — kept minimal on purpose; any future stack created from it should either inherit the Wave 3A one-line form or add its own on first `kubernetes_deployment`. - `terraform fmt` fixes to pre-existing alignment issues in meshcentral, nvidia/modules/nvidia, vault — unrelated to this commit. Left for a separate fmt-only pass. - Non-pod resources (`kubernetes_service`, `kubernetes_secret`, `kubernetes_manifest`, etc.) — they don't own pods so they don't get Kyverno dns_config mutation. ## Verification Random sample post-commit: ``` $ cd stacks/navidrome && ../../scripts/tg plan → No changes. $ cd stacks/f1-stream && ../../scripts/tg plan → No changes. $ cd stacks/frigate && ../../scripts/tg plan → No changes. $ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='.tf' --include='*.tf.example' \ \| awk -F: '{s+=$2} END {print s}' 169 ``` ## Reproduce locally 1. `git pull` 2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ \| wc -l` → 169+ 3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on the deployment's dns_config field. Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest annotation class handled separately in `8d94688d` for tls_secret) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:19:48 +00:00
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	a62b43d19e	[infra] Document intended ignore_changes drift-workarounds [ci skip] ## Context The infra repo has 31 `ignore_changes` blocks. Phase 1 of the state-drift consolidation audit classified 21 as legitimate (immutable fields, cloud-computed values) and 10 as intentional workarounds for known drift sources. The remaining 10 were indistinguishable from accidental/forgotten drift suppression without reading the surrounding context. This commit adds a uniform `# DRIFT_WORKAROUND: <reason>, reviewed 2026-04-18` marker above the 8 intended-workaround blocks (6 CI image-tag decoupling + 2 non-deterministic secret hashes) so they are easy to distinguish from accidental drift suppression during future audits. ## What is NOT in this change - Functional behavior — `ignore_changes` lists are byte-identical. - The Kyverno `dns_config` ignore paths (covered by Wave 3 shared module). - Workarounds being removed — the CI decoupling is intentional by user decision. ## Files touched CI image-tag decoupling (6): - stacks/k8s-portal/modules/k8s-portal/main.tf (also has dns_config for Kyverno) - stacks/novelapp/main.tf - stacks/claude-memory/main.tf - stacks/plotting-book/main.tf - stacks/trading-bot/main.tf (api deployment) - stacks/trading-bot/main.tf (workers deployment — 6 containers) Non-deterministic secret hashes (2): - stacks/owntracks/main.tf (htpasswd bcrypt) - stacks/mailserver/modules/mailserver/main.tf (postfix-accounts.cf) ## Test Plan ### Automated ``` $ rg DRIFT_WORKAROUND stacks/ \| wc -l 8 $ terraform fmt -recursive stacks/k8s-portal stacks/novelapp stacks/claude-memory \ stacks/plotting-book stacks/trading-bot stacks/owntracks stacks/mailserver (no output — already formatted) $ git diff --stat stacks/claude-memory/main.tf \| 1 + stacks/k8s-portal/modules/k8s-portal/main.tf \| 1 + stacks/mailserver/modules/mailserver/main.tf \| 3 ++- stacks/novelapp/main.tf \| 1 + stacks/owntracks/main.tf \| 1 + stacks/plotting-book/main.tf \| 1 + stacks/trading-bot/main.tf \| 2 ++ 7 files changed, 9 insertions(+), 1 deletion(-) ``` ### Manual Verification No apply required — HCL comments only, zero effect on plan output. ## Reproduce locally 1. `cd infra && git pull` 2. `rg "DRIFT_WORKAROUND.*reviewed 2026-04-18" stacks/ \| wc -l` → expect 8 3. `terraform fmt -check -recursive stacks/` → expect clean exit Closes: code-yrg Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:08:10 +00:00
Viktor Barzin	5319f03ebc	[storage] Fix owntracks + wealthfolio: switch to encrypted PVCs Some checks failed Build Custom DIUN Image / build (push) Has been cancelled Details Deploy Post-Mortems to GitHub Pages / deploy (push) Has been cancelled Details Both services were running against empty unencrypted PVCs after the proxmox-lvm-encrypted migration. Data copied from old Released PVs via LUKS-unlock on PVE host, deployments switched to encrypted PVCs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 20:29:57 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	1ea48c93e5	upgrade: owntracks 0.9.9 -> 1.0.1 Changelog summary: - 1.0.0: POI inline image support, deprecate google maps in vmap.html, packaging fixes - 1.0.1: ocat JSON array output fix, revgeo error messages, OpenBSD support, storage dir env fix Risk: CAUTION (major version 0→1, but changes are benign — no schema/config/API breaking changes) Breaking changes: none (deprecate keyword hit on vmap.html google maps — cosmetic only) DB backup: skipped (not DB-backed) Config changes applied: none required Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:34:29 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	39b3c51709	migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret Replaced data "vault_kv_secret_v2" with: 1. ExternalSecret (ESO syncs Vault KV → K8s Secret) 2. data "kubernetes_secret" (reads ESO-created secret at plan time) This removes the Vault provider dependency at plan time for these stacks — they now only need K8s API access, not a Vault token. Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection, coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama, owntracks, real-estate-crawler, servarr, ytdlp	2026-03-15 22:06:39 +00:00
Viktor Barzin	6f562b5da6	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-15 00:03:59 +00:00
Viktor Barzin	f7c2c06009	right-size memory: set requests=limits based on actual usage - Set memory requests = limits across 56 stacks to prevent overcommit - Right-sized limits based on actual pod usage (2x actual, rounded up) - Scaled down trading-bot (replicas=0) to free memory - Fixed OOMKilled services: forgejo, dawarich, health, meshcentral, paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse - Added startup+liveness probes to calibre-web - Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192) Post node2 OOM incident (2026-03-14). Previous kubelet config had no kubeReserved/systemReserved set, allowing pods to starve the kernel.	2026-03-14 21:01:24 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	197cef7f3f	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache - tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache	2026-03-06 23:55:57 +00:00
Viktor Barzin	79a2aa3784	[ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC Batch migration of all single-volume and simple multi-volume stacks. All services verified healthy after migration. Uses nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options to eliminate stale NFS mount hangs. Services: atuin, audiobookshelf, calibre, changedetection, diun, excalidraw, forgejo, freshrss, grampsweb, hackmd, health, isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama, onlyoffice, owntracks, paperless-ngx, poison-fountain, send, stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp	2026-03-02 00:15:39 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	c7c7047f1c	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	e2522ad9f1	[ci skip] Fix variable type mismatches in owntracks, ollama, tandoor stacks - owntracks_credentials: string -> map(string) - ollama_api_credentials: string -> map(string) - tandoor_email_password: add default="" (not in tfvars)	2026-02-22 14:07:33 +00:00
Viktor Barzin	a9ba8899be	[ci skip] Phase 3: Create 66 service stacks and migrate state Generated individual stack directories for all 66 services under stacks/. Each stack has terragrunt.hcl (depends on platform) and main.tf (thin wrapper calling existing module). Migrated all 64 active service states from root terraform.tfstate to individual state files. Root state is now empty. Verified with terragrunt plan on multiple stacks (no changes).	2026-02-22 13:56:34 +00:00

32 commits