TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13.
The subagent-driven doc sweep in 5a0b24f5 covered the prose. This commit
removes the remaining in-code references:
- reverse-proxy: drop truenas Traefik ingress + Cloudflare record
(truenas.viktorbarzin.me was 502-ing since the VM stopped), drop
truenas_homepage_token variable.
- config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME
truenas`, and the commented-out `iscsi`/`zabbix` A records.
- dashy/conf.yml: remove Truenas dashboard entry (&ref_28).
- monitoring/loki.yaml: change storageClass from the decommissioned
`iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC
(Loki is currently disabled).
- actualbudget/main.tf + freedify/main.tf: update new-deployment
docstrings to cite Proxmox host NFS instead of TrueNAS.
- nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass
noting the name is historical — 48 bound PVs reference it, SC names
are immutable on PVs, rename not worth the churn.
Also cleaned out-of-band:
- Technitium DNS: deleted `truenas.viktorbarzin.lan` A and
`iscsi.viktorbarzin.lan` CNAME records.
- Vault: `secret/viktor` → removed `truenas_api_key` and
`truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed.
- Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas`
destroyed the 3 K8s/Cloudflare resources cleanly.
Deferred:
- VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit
user go-ahead.
- `nfs-truenas` StorageClass name retained (see nfs-csi comment above).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Actual Budget v26.4.0 (released 2026-04-05) re-introduces the Sankey
chart report for income/expense flow visualization (PR #7220). An earlier
experimental implementation was deleted in March 2024 (PR #2417) but a
proper reimplementation with "Other" grouping, date-range selection, and
percentage toggle is now shipped behind the experimental feature flag.
Viktor wanted Sankey visualization of budget cash flow; this is the lowest-
cost path since his existing Actual Budget deployment already holds all the
transaction data.
## This change
Bumps the `tag` input on all three factory module calls (viktor, anca, emo)
from `26.3.0` to `26.4.0`. No breaking changes, schema migrations, or config
changes per the 26.4.0 release notes.
## Rollout
Applied via `scripts/tg apply --non-interactive`. All three pods rolled
successfully to `actualbudget/actual-server:26.4.0` and passed readiness
probes. The http-api sidecars (`jhonderson/actual-http-api`) were untouched.
## Post-upgrade
Users need to toggle Settings → Experimental features → Sankey report to
access the chart, then Reports → new Sankey widget.
Closes: code-oof
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Two operational gaps surfaced during a healthcheck sweep today:
1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
`ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
registered for external probing — so outages like Immich going down externally were
invisible until a user complained. 99 of ~125 public ingresses had no external
monitor.
2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
at plan time. Blocked CI applies and drift reconciliation.
## This change
### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
nullable). Default is "follow dns_type" — enabled for any public DNS record
(`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
direct-A records are also monitored).
- Emits two annotations on the Ingress:
- `uptime.viktorbarzin.me/external-monitor = "true"`
- `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)
### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
`list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
instead of `kubernetes.default.svc` — the search-domain expansion failed in the
CronJob pod's DNS config. Verified working: CronJob now logs
`Loaded N external monitor targets (source=k8s-api)`.
### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
`proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
`data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
removed. State was rm'd + re-imported with matching UIDs, so no data was moved.
## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
(was 13 on the central list)
## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
`[ci skip]` here so those don't auto-apply; they will be fixed manually before the
next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
grafana, vault, forgejo) are annotated — separate PR.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged
\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
https://dawarich.viktorbarzin.me/https://nextcloud.viktorbarzin.me/ \\
https://budget-viktor.viktorbarzin.me/
200 302 200
\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor 1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted
\`\`\`
### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
\`\`\`
kubectl -n dawarich get ingress dawarich -o \\
jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
# Expected: "true"
\`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
(CronJob interval). For Immich specifically, it will appear after the immich stack
is re-applied.
3. Verify actualbudget plan is clean:
\`\`\`
cd stacks/actualbudget && scripts/tg plan --non-interactive
# Expected: no "Invalid count argument" errors
\`\`\`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.
Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
SQLite-backed services. Deployments updated to use new block storage
PVCs. Old NFS modules retained for 1-week rollback.
Services: ntfy, freshrss, insta2spotify, actualbudget (x3),
wealthfolio, navidrome (DB only), audiobookshelf config,
headscale, forgejo, uptime-kuma.
Also: set Recreate strategy on ntfy, forgejo, insta2spotify,
wealthfolio (required for RWO volumes).
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)
This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.
Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.
Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes
Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
- Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to
prevent recurring migration breakage from rolling nightly builds
- Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes,
relieving memory pressure on k8s-node4
- Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory)
- Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead
of patchStrategicMerge to fix toleration array being overwritten (caused
caretta DaemonSet pod to be unable to schedule on k8s-master)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Final batch: servarr (aiostreams, listenarr, readarr, soulseek,
prowlarr, qbittorrent, lidarr) and actualbudget factory.
All use ../../../modules/kubernetes/nfs_volume (3 levels deep).
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint
TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).