Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
The Phase 4 audit pass missed this site because the previous agent scoped
out owntracks (it overrides the factory's middleware list via
extra_annotations to use its own basic-auth middleware). Adding the explicit
auth = "none" satisfies Phase 5's "every ingress has an explicit decision"
goal and makes the intent visible — mobile OwnTracks clients post location
data via HTTP basic-auth and can't follow Authentik forward-auth 302s.
Closes the loop on Phase 5: 122/122 active ingress_factory call sites now
carry an explicit auth = "..." decision (zero callers rely on the default).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bundles two small follow-ups to the live bridge + port-fix work:
## Face avatar fix (dawarich-hook.lua)
After the Recorder ran in production for a while it began enriching
publish payloads with a `face` field — the base64-encoded user avatar
uploaded via the Recorder's web UI (~120 KB). Our Lua hook builds a
curl command that embeds the JSON payload as `-d '<payload>'`, which
hit `E2BIG` / `Argument list too long` (os.execute reason=code=7) on
Linux's `execve` argv limit (~128 KB). Every live POST stopped making
it to Dawarich, even though the HTTP POST from the phone to Owntracks
still returned 200 and the .rec write still happened.
Fix: `data.face = nil` before serializing. Dawarich doesn't use it
anyway (not persisted into any column — `raw_data` stored without it).
Also upgraded the debug log: on failure we now emit
`dawarich-bridge: FAIL tst=... reason=... code=... cmd=...` so any
future variant of this problem (next big field surfaced upstream, etc.)
is one log tail away from a diagnosis.
```
$ kubectl -n owntracks logs deploy/owntracks --tail=5 | grep dawarich-bridge
+ dawarich-bridge: init
+ dawarich-bridge: ok tst=1776600238
```
## Orphan PVC removal (main.tf)
`owntracks-data-proxmox` (1 Gi, proxmox-lvm, unencrypted) was a leftover
from the encrypted-migration attempt; the Deployment has been mounting
`owntracks-data-encrypted` the whole time. Verified `Used By: <none>`
on the live PVC before removal. Removing the resource from Terraform
destroys the PVC — harmless, no data loss.
## Test Plan
### Automated
```
$ ../../scripts/tg plan
Plan: 0 to add, 1 to change, 1 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 1 changed, 1 destroyed.
$ kubectl -n owntracks get pvc
NAME STATUS VOLUME ...
owntracks-data-encrypted Bound ...
(owntracks-data-proxmox gone)
```
### Manual Verification
```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
-H 'Content-Type: application/json' \
-H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
-d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
https://owntracks.viktorbarzin.me/pub
HTTP 200
$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
psql -U postgres -d dawarich -tAc \
"SELECT ST_AsText(lonlat::geometry) FROM points WHERE user_id=1 AND timestamp=$TST"
POINT(-0.1278 51.5074)
```
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Viktor wanted live forwarding from Owntracks to Dawarich so his map
stays in sync without a periodic backfill. The original plan assumed
ot-recorder honoured an `OTR_HTTPHOOK` environment variable — but
Recorder 1.0.1 (latest on Docker Hub as of Aug 2025) has no such
feature:
```
$ kubectl -n owntracks exec deploy/owntracks -- \
strings /usr/bin/ot-recorder | grep -iE 'hook|webhook|http_post'
(no matches)
```
Lua hooks, on the other hand, are first-class: `--lua-script` loads a
file and calls the `otr_hook(topic, _type, data)` function for every
publish. That is the pivot this commit makes.
## This change
Mount a Lua script via ConfigMap and tell ot-recorder to load it:
```
Phone POST /pub ---> Traefik ---> Recorder pod
|
| handle_payload() writes .rec
| otr_hook(topic,_type,data)
| |
| +---> os.execute("curl … &")
| |
| v
| Dawarich /api/v1/owntracks/points
|
+---> HTTP 200 to phone
```
Per-publish cost: one `curl` subprocess, `--max-time 5`, backgrounded
with `&` so it doesn't block the HTTP response to the phone. A
Dawarich 5xx drops exactly one point — the `.rec` write still happens,
so the one-shot backfill Job can always re-play.
`DAWARICH_API_KEY` is injected from K8s Secret `owntracks-secrets`
(sourced from Vault `secret/owntracks.dawarich_api_key` via the
existing `dataFrom.extract` ExternalSecret). The Lua reads it with
`os.getenv()` so the key never lands in Terraform state.
### Key discoveries in the verification loop (why iteration count > 1)
1. The hook function must be named `otr_hook`, not `hook` (recorder's
`luasupport.c` calls `lua_getglobal(L, "otr_hook")`). The recorder
logs `cannot invoke otr_hook in Lua script` when missing — the
plan's `hook()` naming was wrong.
2. Dawarich's `latitude`/`longitude` scalar columns are legacy and
always NULL; the authoritative geometry is in the `lonlat` PostGIS
column (`ST_AsText(lonlat::geometry)`). Early "it's broken" readings
were me querying the wrong columns.
3. Default Recreate-strategy rollouts cause ~30s 502/503 windows on
the ingress — tolerable, but every apply is visible as an outage
to the phone. Batching edits is important.
## What is NOT in this change
- **Not** OTR_HTTPHOOK. Removed with this commit (dead env var).
- **Not** the one-shot backfill Job — that comes after the phone
buffer has flushed to avoid racing against incoming hook POSTs
(follow-up: code-h2r).
- **Not** Anca's bridge — a second Recorder instance or a smarter
hook is needed to route her posts under her own Dawarich api_key
(follow-up: code-72g).
- No Ingress or Service change — Commit 1 (`a21d4a44`) already landed
those.
## Test Plan
### Automated
```
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 1 added, 1 changed, 0 destroyed.
$ kubectl -n owntracks logs deploy/owntracks --tail=5
+ initializing Lua hooks from `/hook/dawarich-hook.lua'
+ dawarich-bridge: init
+ HTTP listener started on 0.0.0.0:8083, without browser-apikey
...
+ dawarich-bridge: tst=1 lat=0 lon=0 ok=true
```
### Manual Verification
```
$ VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
$ TST=$(date +%s)
$ kubectl -n owntracks run t --rm -i --image=curlimages/curl -- \
curl -s -w 'HTTP %{http_code}\n' -X POST -u "viktor:$VIKTOR_PW" \
-H 'Content-Type: application/json' \
-H 'X-Limit-U: viktor' -H 'X-Limit-D: iphone-15pro' \
-d "{\"_type\":\"location\",\"lat\":51.5074,\"lon\":-0.1278,\"tst\":$TST,\"tid\":\"vb\"}" \
https://owntracks.viktorbarzin.me/pub
HTTP 200
$ sleep 3 && kubectl -n dbaas exec pg-cluster-1 -c postgres -- \
psql -U postgres -d dawarich -c \
"SELECT timestamp, ST_AsText(lonlat::geometry) FROM points \
WHERE user_id=1 AND timestamp=$TST"
timestamp | st_astext
------------+-------------------------
1776555707 | POINT(-0.1278 51.5074)
```
Real phone traffic (from in-flight buffer flush) lands in Dawarich too:
`traefik logs -l app.kubernetes.io/name=traefik | grep 'POST /api/v1/owntracks/points'`
shows ingress POSTs from `owntracks` namespace to `dawarich` backend
with status 200.
### Reproduce locally
1. `vault login -method=oidc`
2. `kubectl -n owntracks logs deploy/owntracks --tail=20` — expect
`dawarich-bridge: init` after the Lua loader line.
3. Do the curl above, poll the DB, expect `POINT(lon lat)`.
Closes: code-z9b
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
iOS Owntracks app has been unable to upload for months — phone buffer
now holds ~1200 pending points. Last successful `.rec` write was
2026-01-02T14:32:00Z, matching when the failures started.
### The 500 — verified in Traefik access log
```
152.37.101.156 - viktor "POST /pub HTTP/1.1" 500 21 "-" "-" 47900
"owntracks-owntracks-owntracks-viktorbarzin-me@kubernetes"
"https://10.10.107.194:8083" 84ms
```
Basic-auth + middleware chain (rate-limit, csp, crowdsec) all pass.
Traefik then opens backend connection to `https://10.10.107.194:8083`.
The Recorder pod listens **plain HTTP** on :8083 (`OTR_PORT=0` disables
HTTPS in ot-recorder), so the TLS handshake never completes → 500.
### Root cause — Service port spec
`kubernetes_service.owntracks` declared the port as:
```
name: https
port: 443
targetPort: 8083
```
Traefik's IngressClass scheme inference: if the Service port is named
`https` OR numbered `443`, Traefik speaks HTTPS to that backend. Both
were true here, pointing at a plain-HTTP socket. The name/number were
purely cosmetic — a leftover from mirroring the external `:443` edge —
and worked only while Traefik's default happened to be HTTP. A Traefik
upgrade (or middleware-chain change) tightened inference and surfaced
the mismatch.
## This change
Rename port to `name=http, port=80` and update the matching Ingress
backend `port.number` from 443 to 80. `targetPort` stays at 8083.
```
Phone -----> CF tunnel -----> Traefik (:443, TLS) -----> Service
\ :80 (http)
\ |
\ v
---------------> Pod :8083
(plain HTTP hop) (HTTP listener)
```
Deployment container port label also renamed `https` → `http` for
consistency (no functional effect — just readability).
## What is NOT in this change
- **Not** switching the Recorder pod to HTTPS natively. That would
require mounting a cert + rotation plumbing. External TLS is already
terminated at Cloudflare/Traefik; in-cluster hop to the pod is
plain-HTTP by design.
- **Not** enabling `OTR_HTTPHOOK` to bridge Recorder → Dawarich
(follow-up: code-z9b).
- **Not** backfilling historical `.rec` files into Dawarich (follow-up:
code-h2r).
- Incidental: `providers.tf` + `.terraform.lock.hcl` refreshed by
`terraform init -upgrade` to pick up the goauthentik provider that
the ingress_factory module recently started requiring.
## Test Plan
### Automated
```
$ ../../scripts/tg plan
Plan: 0 to add, 3 to change, 0 to destroy.
$ ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 3 changed, 0 destroyed.
$ kubectl -n owntracks get svc owntracks -o=jsonpath='{.spec.ports[0]}'
{"name":"http","port":80,"protocol":"TCP","targetPort":8083}
$ kubectl -n owntracks get ingress owntracks -o=jsonpath='{.spec.rules[0].http.paths[0].backend}'
{"service":{"name":"owntracks","port":{"number":80}}}
```
### Manual Verification
In-cluster auth'd POST through the full ingress chain:
```
VIKTOR_PW=$(vault kv get -field=credentials secret/owntracks | jq -r .viktor)
kubectl -n owntracks run curltest --rm -i --image=curlimages/curl --restart=Never -- \
curl -s -o /dev/null -w "HTTP %{http_code}\n" -X POST -u "viktor:$VIKTOR_PW" \
-H "Content-Type: application/json" \
-d '{"_type":"location","lat":0,"lon":0,"tst":1000000000,"tid":"vb"}' \
https://owntracks.viktorbarzin.me/pub
# HTTP 200
```
(previously: HTTP 500 on identical request)
### Reproduce locally
1. `vault login -method=oidc`
2. `cd infra/stacks/owntracks && ../../scripts/tg plan`
3. Expected: `Plan: 0 to add, 3 to change, 0 to destroy.` (or empty if already applied)
4. Watch next iOS Owntracks POST → Traefik access log should show `200`, not `500`.
Closes: code-nqd
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both services were running against empty unencrypted PVCs after the
proxmox-lvm-encrypted migration. Data copied from old Released PVs
via LUKS-unlock on PVE host, deployments switched to encrypted PVCs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)
Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).