Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
proxmox-lvm migration), navidrome recreates fresh DB on startup
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
SQLite over NFS caused database corruption (malformed disk image).
Recovered the DB, migrated data to PostgreSQL via synapse_port_db,
and updated the deployment to use psycopg2 with an init container.
Database: matrix on postgresql.dbaas.svc.cluster.local
Scaled replicas from 0 to 1.
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).