WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
required_providers. Other stacks just don't instantiate a provider
block — harmless. Replaces the same-name override trick the infra
stack used to do, which stopped working under Terragrunt v0.77
("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
writes proxmox_provider.tf with the provider config; credentials
read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
= uncapped), wired into scsi0/scsi1 disk{} blocks as
mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.
WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
provider mangled proxmox-csi PVC slots on apply for vmid 202 and
203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
loss, K8s CSI reconciled PVC attachments within minutes. Removed
the 7 imports from state via `terraform state rm` and re-encrypted.
Tracked in beads code-xzbl: blocked on bpg/proxmox provider
migration (telmate has the same dynamic-disk defect that bit us on
iSCSI back in 2026-04-02; see memory id=539).
LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60
201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120
204 k8s-node4 150/120 220 docker-registry 40/40
(pfSense 101 BSD + Windows10 300 intentionally out of scope.)
PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
ever imported into TF state — confirmed against the SOPS-encrypted
state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
This commit leaves both declarations in place but does NOT import
them; that's part of the code-xzbl follow-up.
Closes: code-s9xr
## Resolves code-e2dp (Kyverno TF apply blocked)
Root cause: terraform-provider-kubernetes v3.1.0 panics on plan/refresh of
kubernetes_manifest resources holding Kyverno ClusterPolicy CRDs (large
CEL/foreach schemas). Workaround: swap to gavinbunney/kubectl_manifest which
treats manifests as opaque YAML strings.
## Migration mechanics
- Root terragrunt.hcl: added gavinbunney/kubectl provider declaration so all
stacks get it generated in providers.tf.
- stacks/kyverno/modules/kyverno/versions.tf (new): module-level provider source
declaration (required for kubectl_manifest in a child module).
- Converted 17 kubernetes_manifest resources across 7 files to kubectl_manifest
with yaml_body = yamlencode({...}). depends_on chains preserved.
- terraform state rm for all 17 old kubernetes_manifest entries.
- stacks/kyverno/imports.tf (new): TF 1.5+ import blocks mapping each
kubectl_manifest to its live cluster resource by apiVersion//Kind//name ID.
- One resource (policy_inject_keel_annotations) needed kubectl delete + recreate
because the kubectl provider couldn't patch it cleanly (resourceVersion=0
invalid for update — gotcha when adopting a resource previously
kubernetes_manifest-owned).
## W1.4 — security policies Audit → Enforce (LIVE)
Three policies flipped: deny-privileged-containers, deny-host-namespaces,
restrict-sys-admin. Verified live via kubectl. failurePolicy=Ignore preserved.
## Shared exclude list (35 namespaces)
local.security_policy_exclude_namespaces in security-policies.tf.
- 31 critical from memory id=1970 (Keel rollout list)
- + frigate (camera HW transcoding needs host access)
- + kured (privileged DaemonSet for node reboots)
- + default (etcd backup/defrag CronJobs use hostNetwork)
- + changedetection (uses SYS_ADMIN for chromium sandbox)
## W1.5 — require-trusted-registries stays Audit
Pattern */* allows anything-with-a-slash; Enforce would be a no-op for supply
chain. Tracked under beads code-8ywc as follow-up.
## TF import-blocks
The imports.tf file should be removed in a follow-up cleanup commit once
verified — TF doesn't auto-clean these.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes: code-e2dp
## Context
Wave 6a of the state-drift consolidation plan. The Domain wide catch all
Proxy Provider (pk=5) + its wrapping Application (slug=domain-wide-catch-all)
+ the embedded outpost (uuid 0eecac07-97c7-443c-8925-05f2f4fe3e47) have
run for a year as pure UI-created state. When the 2026-04-18 outpost SEV2
hit, it was harder to reason about the config than it should have been —
the only source of truth was the Authentik admin UI. Bringing the provider
+ application under Terraform means future changes are reviewable in PRs
and recoverable from git if the admin UI misbehaves.
## This change
Adds the `goauthentik/authentik` provider to the repo's central
`terragrunt.hcl` `required_providers` (side-effect: every stack can now
declare authentik resources; this stack is the only current consumer).
Stack-local `stacks/authentik/authentik_provider.tf` holds the provider
instance configuration + API token wiring + two resources + their flow
data-source lookups.
### Auth
- API token stored in Vault at `secret/authentik/tf_api_token`, identifier
`terraform-infra-stack`, intent=API, user=akadmin, no expiry. Rotatable
by rewriting the Vault KV + any running TF apply picks it up on next
plan.
### Imports (both landed zero-diff)
- `authentik_application.catchall` ← id `domain-wide-catch-all`
- `authentik_provider_proxy.catchall` ← id `5`
### Flow references
Authorization + invalidation flows are looked up via `data
"authentik_flow"` by slug (`default-provider-authorization-implicit-consent`
+ `default-provider-invalidation-flow`). Keeping them as data sources
rather than hardcoded UUIDs means a flow recreation (slug unchanged)
doesn't require an HCL edit.
### `lifecycle { ignore_changes }` scope
On `authentik_provider_proxy.catchall`:
- `property_mappings` (5 UUIDs), `jwt_federation_sources` (1 UUID) — the
live state references complex many-to-many relations that are easier
to manage from the Authentik UI than to serialise in HCL. Drift
suppressed.
- `skip_path_regex`, `internal_host`, all `basic_auth_*`,
`intercept_header_auth`, `access_token_validity` — either defaults or
UI-only tuning knobs that aren't part of Terraform's concern for this
catch-all provider.
On `authentik_application.catchall`:
- `meta_description`, `meta_launch_url`, `meta_icon`, `group`,
`backchannel_providers`, `policy_engine_mode`, `open_in_new_tab` —
cosmetic/non-functional attributes; the Authentik UI is the right
place to edit these and drift on them isn't interesting.
## What is NOT in this change
- Outpost-binding resource — the embedded outpost's provider list is a
single-row many-to-many that the Authentik UI manages cleanly; adding
TF there would fight the UI without reducing drift.
- Property mappings and JWT federation source — managed via UI, drift
suppressed. A future wave can bring them in when someone actually
wants to edit them through code review.
- Other Authentik entities (Flows, Stages, Groups, RBAC policies) —
same rationale: UI is the natural editing surface. Adopt incrementally
as they become interesting to code-review.
## Verification
```
$ cd stacks/authentik && ../../scripts/tg plan | grep Plan:
Plan: 0 to add, 1 to change, 0 to destroy.
# module.authentik.kubernetes_deployment.pgbouncer — pre-existing drift,
# unrelated to this commit (image_pull_policy Always -> IfNotPresent)
$ ../../scripts/tg state list | grep authentik_
authentik_application.catchall
authentik_provider_proxy.catchall
data.authentik_flow.default_authorization_implicit_consent
data.authentik_flow.default_provider_invalidation
```
## Reproduce locally
1. `git pull && cd stacks/authentik && ../../scripts/tg init`
2. Terraform pulls goauthentik/authentik provider (first time).
3. `tg plan` — expect only pgbouncer drift; authentik resources read-only.
Refs: Wave 6a of the state-drift consolidation (code-hl1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.
terragrunt.hcl now loads:
- config.tfvars (required, plaintext)
- terraform.tfvars (optional, git-crypt — backward compat)
- secrets.auto.tfvars.json (optional, SOPS-decrypted)
before_hook checks that at least one secrets source exists.
Use `scripts/tg` wrapper for SOPS-based workflow.
Old terraform.tfvars kept for reference and backward compatibility.