2026-03-17 18:11:53 +00:00
# DB as a service. Installs MySQL operator
variable " tls_secret_name " { }
variable " tier " { type = string }
variable " dbaas_root_password " { }
variable " cluster_master_service " {
default = " mysql "
}
variable " postgresql_root_password " { }
variable " pgadmin_password " { }
variable " prod " {
default = false
type = bool
}
variable " nfs_server " { type = string }
variable " kube_config_path " {
type = string
sensitive = true
}
[dbaas] Declare forgejo + roundcubemail MySQL users in Terraform
## Context
The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the
MySQL user table but scripted fresh passwords for every app user. Two apps
(forgejo, roundcubemail) store their DB password inside their own
application config — forgejo in `/data/gitea/conf/app.ini` (baked into the
PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the
mailserver stack (sourced from Vault `secret/platform`). Neither app
could be restarted with a new password without rewriting its own config.
Both apps silently broke with `Access denied for user 'X'@'%'` after the
migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED
BY '<app_password>'` to re-sync MySQL with what each app already has. With
nothing in Terraform managing those users, the next migration would break
them again — that's the gap this change closes.
## What this change does
Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same
`null_resource` + `local-exec` + `kubectl exec` pattern already used for
`pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives:
- `petoju/mysql` Terraform provider — no existing usage in the repo; would
be a net-new dependency. Module-level `for_each` over `mysql_user` +
`mysql_grant` is cleaner, but the added machinery (new provider block,
extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF
env vars, state-dependent password reads) outweighs the benefit for two
static users.
- K8s Job — adds lifecycle management for a one-shot resource; needs
secret mounts and is harder to retry. `local-exec` is exactly what the
existing PG bootstrap uses.
Idempotency contract:
CREATE DATABASE IF NOT EXISTS <db>;
CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
The `ALTER USER` on every re-run re-syncs the password if Vault was rotated
out-of-band (healing drift). The `sha256(password)` trigger also re-runs
the provisioner when the Vault password legitimately changes, so the
resource is responsive to both new and rotated passwords. `caching_sha2_password`
matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents
silent drift to `mysql_native_password`.
Flow (apply-time):
scripts/tg apply
│
├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password
│
▼
module.dbaas
│
├── mysql-standalone-0 (StatefulSet, already running)
│
├── null_resource.mysql_static_user["forgejo"]
│ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT
│
└── null_resource.mysql_static_user["roundcubemail"]
└── (same, for roundcubemail)
## Secrets
Two new keys added to Vault `secret/viktor`:
mysql_forgejo_password # bound to forgejo `[database]` in app.ini
mysql_roundcubemail_password # duplicates secret/platform
# mailserver_roundcubemail_db_password;
# secret/viktor is the personal vault of
# record per .claude/CLAUDE.md
Passwords are never written to the repo — both come from Vault via
`data "vault_kv_secret_v2" "viktor"` in the dbaas root module.
## What is NOT in this change
- PG-side users (managed by Vault DB engine static-roles already — see
MEMORY.md "Database rotation")
- Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink,
grafana, phpipam are all rotated by Vault DB engine; root users
excluded by design)
- Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4
cleanup tracked under the MySQL standalone migration work — still
pending)
## Test plan
### Automated
`terraform fmt -check -recursive stacks/dbaas` → exit 0
`scripts/tg plan` in stacks/dbaas →
Plan: 2 to add, 7 to change, 0 to destroy.
# module.dbaas.null_resource.mysql_static_user["forgejo"] will be created
# module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created
The 7 "update in-place" entries are pre-existing drift (Kyverno labels on
LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb,
Kyverno-injected `dns_config` on 4 CronJobs lacking the
`ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump
30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this
commit and all are benign (no data loss, no pod restart).
### Manual Verification
# 1. Sanity check pre-apply — users are in their current (manually-fixed) state.
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \
"SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"'
# Expected:
# forgejo % caching_sha2_password
# roundcubemail % caching_sha2_password
# 2. Apply and confirm the provisioner exits 0.
cd stacks/dbaas && ../../scripts/tg apply
# Expect: null_resource.mysql_static_user["forgejo"]: Creation complete
# null_resource.mysql_static_user["roundcubemail"]: Creation complete
# 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push)
# and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both
# must succeed.
# 4. Destructive test (run ONCE, off-hours):
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"'
cd stacks/dbaas && ../../scripts/tg apply
# Expected: apply recreates the user with the Vault password, forgejo UI
# recovers without touching /data/gitea/conf/app.ini.
### Reproduce locally
1. vault login -method=oidc
2. cd infra/stacks/dbaas
3. ../../scripts/tg plan
4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two
null_resource.mysql_static_user additions. 7 changes are pre-existing
drift unrelated to this commit.
Closes: code-6th
Closes: code-96w
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:05:21 +00:00
# MySQL static application users (not rotated by Vault DB engine; baked into
# each app's config). Codified here so future MySQL rebuilds cannot silently
# drop them.
variable " mysql_forgejo_password " {
type = string
sensitive = true
}
variable " mysql_roundcubemail_password " {
type = string
sensitive = true
}
2026-03-17 18:11:53 +00:00
resource " kubernetes_namespace " " dbaas " {
metadata {
name = " dbaas "
labels = {
tier = var . tier
" resource-governance/custom-quota " = " true "
}
}
[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip]
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:15:27 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [ metadata [ 0 ] . labels [ " goldilocks.fairwinds.com/vpa-update-mode " ] ]
}
2026-03-17 18:11:53 +00:00
}
2026-04-05 23:31:23 +03:00
# Override Kyverno tier-1-cluster LimitRange (max 4Gi) to allow MySQL 6Gi limit
resource " kubernetes_limit_range " " dbaas " {
metadata {
name = " tier-defaults "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
limit {
type = " Container "
default = {
memory = " 256Mi "
}
default_request = {
cpu = " 50m "
memory = " 256Mi "
}
max = {
memory = " 8Gi "
}
}
}
}
2026-03-17 18:11:53 +00:00
resource " kubernetes_resource_quota " " dbaas " {
metadata {
name = " dbaas-quota "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
hard = {
" requests.cpu " = " 8 "
2026-04-06 15:57:47 +03:00
" requests.memory " = " 40Gi "
" limits.memory " = " 40Gi "
2026-03-17 18:11:53 +00:00
pods = " 30 "
}
}
}
module " tls_secret " {
source = " ../../../../modules/kubernetes/setup_tls_secret "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
tls_secret_name = var . tls_secret_name
}
2026-04-16 19:01:06 +00:00
#### MYSQL — Standalone (migration target)
2026-04-16 13:45:04 +00:00
#
# Standalone MySQL without Group Replication. Eliminates ~95 GB/day of GR
# write overhead (binlog, relay log, XCom cache) for databases totaling ~35 MB.
# Binary logging disabled entirely (skip-log-bin) since no replication needed.
2026-04-16 19:01:06 +00:00
# Uses official mysql:8.4 image (Bitnami images deprecated by Broadcom Aug 2025).
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
resource " kubernetes_config_map " " mysql_standalone_cnf " {
metadata {
name = " mysql-standalone-cnf "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
data = {
" standalone.cnf " = < < - EOT
[ mysqld ]
skip - name - resolve
mysql - native - password =ON
skip - log - bin
max_connections =80
innodb_log_buffer_size =16777216
innodb_flush_log_at_trx_commit =2
innodb_io_capacity =100
innodb_io_capacity_max =200
innodb_redo_log_capacity =1073741824
innodb_buffer_pool_size =1073741824
innodb_flush_neighbors =1
innodb_lru_scan_depth =256
innodb_page_cleaners =1
innodb_adaptive_flushing_lwm =10
innodb_max_dirty_pages_pct =90
innodb_max_dirty_pages_pct_lwm =10
EOT
}
}
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
resource " kubernetes_stateful_set_v1 " " mysql_standalone " {
metadata {
name = " mysql-standalone "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
labels = {
" app.kubernetes.io/name " = " mysql "
" app.kubernetes.io/instance " = " mysql-standalone "
" app.kubernetes.io/component " = " primary "
2026-04-16 13:45:04 +00:00
}
2026-04-16 19:01:06 +00:00
}
spec {
service_name = " mysql-standalone "
replicas = 1
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
selector {
match_labels = {
" app.kubernetes.io/instance " = " mysql-standalone "
" app.kubernetes.io/component " = " primary "
}
2026-04-16 13:45:04 +00:00
}
2026-04-16 19:01:06 +00:00
template {
metadata {
labels = {
" app.kubernetes.io/name " = " mysql "
" app.kubernetes.io/instance " = " mysql-standalone "
" app.kubernetes.io/component " = " primary "
2026-04-16 13:45:04 +00:00
}
}
2026-04-16 19:01:06 +00:00
spec {
affinity {
node_affinity {
required_during_scheduling_ignored_during_execution {
node_selector_term {
match_expressions {
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
key = " nvidia.com/gpu.present "
2026-04-16 19:01:06 +00:00
operator = " NotIn "
gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
values = [ " true " ]
2026-04-16 19:01:06 +00:00
}
}
}
}
}
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
container {
2026-05-18 22:46:54 +00:00
name = " mysql "
# Pinned to 8.4.8 — 8.4.9 DD upgrade got stuck (no progress, no CPU)
# repeatedly across multiple attempts. 2026-05-18 recovery: wiping
# PVC + restoring from 2026-05-18 00:30 UTC mysqldump (taken while
# MySQL was healthy on 8.4.8). Image stays on 8.4.8 until we plan
# a proper upgrade window with maintenance + faster IO. Beads
# code-eme8 + code-k40p.
image = " mysql:8.4.8 "
2026-04-16 19:01:06 +00:00
port {
container_port = 3306
name = " mysql "
}
env {
name = " MYSQL_ROOT_PASSWORD "
value_from {
secret_key_ref {
name = kubernetes_secret . cluster - password . metadata [ 0 ] . name
key = " ROOT_PASSWORD "
}
}
}
resources {
requests = {
cpu = " 250m "
2026-05-09 17:01:57 +00:00
memory = " 3Gi "
2026-04-16 19:01:06 +00:00
}
limits = {
2026-05-09 17:01:57 +00:00
memory = " 4Gi "
2026-04-16 19:01:06 +00:00
}
}
volume_mount {
name = " data "
mount_path = " /var/lib/mysql "
}
volume_mount {
name = " config "
mount_path = " /etc/mysql/conf.d "
read_only = true
}
liveness_probe {
exec {
command = [ " mysqladmin " , " ping " , " -h " , " localhost " ]
}
initial_delay_seconds = 30
period_seconds = 10
timeout_seconds = 5
failure_threshold = 3
}
readiness_probe {
exec {
command = [ " mysqladmin " , " ping " , " -h " , " localhost " ]
}
initial_delay_seconds = 10
period_seconds = 10
timeout_seconds = 5
failure_threshold = 3
}
2026-04-16 13:45:04 +00:00
}
2026-04-16 19:01:06 +00:00
volume {
name = " config "
config_map {
name = kubernetes_config_map . mysql_standalone_cnf . metadata [ 0 ] . name
}
2026-04-16 13:45:04 +00:00
}
}
2026-04-16 19:01:06 +00:00
}
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
volume_claim_template {
metadata {
name = " data "
annotations = {
2026-05-10 19:56:16 +00:00
" resize.topolvm.io/threshold " = " 10% "
2026-04-16 19:01:06 +00:00
" resize.topolvm.io/increase " = " 100% "
2026-04-17 05:51:52 +00:00
" resize.topolvm.io/storage_limit " = " 50Gi "
2026-04-16 19:01:06 +00:00
}
}
spec {
access_modes = [ " ReadWriteOnce " ]
storage_class_name = " proxmox-lvm-encrypted "
resources {
requests = {
storage = " 5Gi "
2026-04-16 13:45:04 +00:00
}
}
}
}
2026-04-16 19:01:06 +00:00
}
2026-04-16 13:45:04 +00:00
2026-04-16 19:01:06 +00:00
lifecycle {
[infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
ignore_changes = [ spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ] # KYVERNO_LIFECYCLE_V1
2026-04-16 19:01:06 +00:00
}
2026-04-16 13:45:04 +00:00
}
2026-04-16 19:01:06 +00:00
# Compatibility service: mysql.dbaas.svc.cluster.local:3306
# Points at standalone MySQL (migrated from InnoDB Cluster 2026-04-16)
2026-03-17 18:11:53 +00:00
resource " kubernetes_service " " mysql " {
metadata {
name = var . cluster_master_service
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
selector = {
2026-04-16 19:01:06 +00:00
" app.kubernetes.io/instance " = " mysql-standalone "
" app.kubernetes.io/component " = " primary "
2026-03-17 18:11:53 +00:00
}
port {
port = 3306
target_port = 3306
}
}
2026-04-16 19:01:06 +00:00
depends_on = [ kubernetes_stateful_set_v1 . mysql_standalone ]
2026-03-17 18:11:53 +00:00
}
[dbaas] Declare forgejo + roundcubemail MySQL users in Terraform
## Context
The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the
MySQL user table but scripted fresh passwords for every app user. Two apps
(forgejo, roundcubemail) store their DB password inside their own
application config — forgejo in `/data/gitea/conf/app.ini` (baked into the
PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the
mailserver stack (sourced from Vault `secret/platform`). Neither app
could be restarted with a new password without rewriting its own config.
Both apps silently broke with `Access denied for user 'X'@'%'` after the
migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED
BY '<app_password>'` to re-sync MySQL with what each app already has. With
nothing in Terraform managing those users, the next migration would break
them again — that's the gap this change closes.
## What this change does
Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same
`null_resource` + `local-exec` + `kubectl exec` pattern already used for
`pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives:
- `petoju/mysql` Terraform provider — no existing usage in the repo; would
be a net-new dependency. Module-level `for_each` over `mysql_user` +
`mysql_grant` is cleaner, but the added machinery (new provider block,
extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF
env vars, state-dependent password reads) outweighs the benefit for two
static users.
- K8s Job — adds lifecycle management for a one-shot resource; needs
secret mounts and is harder to retry. `local-exec` is exactly what the
existing PG bootstrap uses.
Idempotency contract:
CREATE DATABASE IF NOT EXISTS <db>;
CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
The `ALTER USER` on every re-run re-syncs the password if Vault was rotated
out-of-band (healing drift). The `sha256(password)` trigger also re-runs
the provisioner when the Vault password legitimately changes, so the
resource is responsive to both new and rotated passwords. `caching_sha2_password`
matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents
silent drift to `mysql_native_password`.
Flow (apply-time):
scripts/tg apply
│
├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password
│
▼
module.dbaas
│
├── mysql-standalone-0 (StatefulSet, already running)
│
├── null_resource.mysql_static_user["forgejo"]
│ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT
│
└── null_resource.mysql_static_user["roundcubemail"]
└── (same, for roundcubemail)
## Secrets
Two new keys added to Vault `secret/viktor`:
mysql_forgejo_password # bound to forgejo `[database]` in app.ini
mysql_roundcubemail_password # duplicates secret/platform
# mailserver_roundcubemail_db_password;
# secret/viktor is the personal vault of
# record per .claude/CLAUDE.md
Passwords are never written to the repo — both come from Vault via
`data "vault_kv_secret_v2" "viktor"` in the dbaas root module.
## What is NOT in this change
- PG-side users (managed by Vault DB engine static-roles already — see
MEMORY.md "Database rotation")
- Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink,
grafana, phpipam are all rotated by Vault DB engine; root users
excluded by design)
- Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4
cleanup tracked under the MySQL standalone migration work — still
pending)
## Test plan
### Automated
`terraform fmt -check -recursive stacks/dbaas` → exit 0
`scripts/tg plan` in stacks/dbaas →
Plan: 2 to add, 7 to change, 0 to destroy.
# module.dbaas.null_resource.mysql_static_user["forgejo"] will be created
# module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created
The 7 "update in-place" entries are pre-existing drift (Kyverno labels on
LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb,
Kyverno-injected `dns_config` on 4 CronJobs lacking the
`ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump
30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this
commit and all are benign (no data loss, no pod restart).
### Manual Verification
# 1. Sanity check pre-apply — users are in their current (manually-fixed) state.
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \
"SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"'
# Expected:
# forgejo % caching_sha2_password
# roundcubemail % caching_sha2_password
# 2. Apply and confirm the provisioner exits 0.
cd stacks/dbaas && ../../scripts/tg apply
# Expect: null_resource.mysql_static_user["forgejo"]: Creation complete
# null_resource.mysql_static_user["roundcubemail"]: Creation complete
# 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push)
# and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both
# must succeed.
# 4. Destructive test (run ONCE, off-hours):
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"'
cd stacks/dbaas && ../../scripts/tg apply
# Expected: apply recreates the user with the Vault password, forgejo UI
# recovers without touching /data/gitea/conf/app.ini.
### Reproduce locally
1. vault login -method=oidc
2. cd infra/stacks/dbaas
3. ../../scripts/tg plan
4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two
null_resource.mysql_static_user additions. 7 changes are pre-existing
drift unrelated to this commit.
Closes: code-6th
Closes: code-96w
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:05:21 +00:00
# MySQL static application users — not rotated by Vault DB engine.
# Each app stores its password in its own config (forgejo app.ini, roundcube
# ROUNDCUBEMAIL_DB_PASSWORD env). During the 2026-04-16 InnoDB Cluster →
# standalone migration these users were accidentally dropped and recreated with
# mismatched passwords; this block codifies them so a future rebuild cannot
# silently break the apps.
#
# Pattern matches `null_resource.pg_terraform_state_db` below (local-exec into
# the DB pod). We CREATE IF NOT EXISTS + ALTER USER on every apply so a
# password rotation in Vault is re-synced on the next `scripts/tg apply`. The
# `password_hash` trigger re-runs the provisioner when the Vault password
# changes; the namespace/user triggers re-run if identifiers change.
locals {
mysql_static_users = {
forgejo = {
database = " forgejo "
password = var . mysql_forgejo_password
}
roundcubemail = {
database = " roundcubemail "
password = var . mysql_roundcubemail_password
}
}
}
resource " null_resource " " mysql_static_user " {
for_each = local . mysql_static_users
depends_on = [ kubernetes_stateful_set_v1 . mysql_standalone ]
triggers = {
username = each . key
database = each . value . data base
password_hash = sha256 ( each . value . password )
}
provisioner " local-exec " {
2026-04-17 22:34:12 +00:00
command = < < EOT
kubectl - - kubeconfig $ { var . kube_config_path } exec - i - n dbaas mysql - standalone -0 - c mysql - - sh - c ' exec mysql - uroot - p " $ MYSQL_ROOT_PASSWORD " ' < < ' SQL '
CREATE DATABASE IF NOT EXISTS ` $ { each . value . data base } ` ;
CREATE USER IF NOT EXISTS ' $ { each . key } ' @ ' % ' IDENTIFIED WITH caching_sha2_password BY ' $ { each . value . password } ' ;
ALTER USER ' $ { each . key } ' @ ' % ' IDENTIFIED WITH caching_sha2_password BY ' $ { each . value . password } ' ;
GRANT ALL PRIVILEGES ON ` $ { each . value . data base } ` . * TO ' $ { each . key } ' @ ' % ' ;
FLUSH PRIVILEGES ;
SQL
EOT
[dbaas] Declare forgejo + roundcubemail MySQL users in Terraform
## Context
The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the
MySQL user table but scripted fresh passwords for every app user. Two apps
(forgejo, roundcubemail) store their DB password inside their own
application config — forgejo in `/data/gitea/conf/app.ini` (baked into the
PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the
mailserver stack (sourced from Vault `secret/platform`). Neither app
could be restarted with a new password without rewriting its own config.
Both apps silently broke with `Access denied for user 'X'@'%'` after the
migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED
BY '<app_password>'` to re-sync MySQL with what each app already has. With
nothing in Terraform managing those users, the next migration would break
them again — that's the gap this change closes.
## What this change does
Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same
`null_resource` + `local-exec` + `kubectl exec` pattern already used for
`pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives:
- `petoju/mysql` Terraform provider — no existing usage in the repo; would
be a net-new dependency. Module-level `for_each` over `mysql_user` +
`mysql_grant` is cleaner, but the added machinery (new provider block,
extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF
env vars, state-dependent password reads) outweighs the benefit for two
static users.
- K8s Job — adds lifecycle management for a one-shot resource; needs
secret mounts and is harder to retry. `local-exec` is exactly what the
existing PG bootstrap uses.
Idempotency contract:
CREATE DATABASE IF NOT EXISTS <db>;
CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
The `ALTER USER` on every re-run re-syncs the password if Vault was rotated
out-of-band (healing drift). The `sha256(password)` trigger also re-runs
the provisioner when the Vault password legitimately changes, so the
resource is responsive to both new and rotated passwords. `caching_sha2_password`
matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents
silent drift to `mysql_native_password`.
Flow (apply-time):
scripts/tg apply
│
├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password
│
▼
module.dbaas
│
├── mysql-standalone-0 (StatefulSet, already running)
│
├── null_resource.mysql_static_user["forgejo"]
│ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT
│
└── null_resource.mysql_static_user["roundcubemail"]
└── (same, for roundcubemail)
## Secrets
Two new keys added to Vault `secret/viktor`:
mysql_forgejo_password # bound to forgejo `[database]` in app.ini
mysql_roundcubemail_password # duplicates secret/platform
# mailserver_roundcubemail_db_password;
# secret/viktor is the personal vault of
# record per .claude/CLAUDE.md
Passwords are never written to the repo — both come from Vault via
`data "vault_kv_secret_v2" "viktor"` in the dbaas root module.
## What is NOT in this change
- PG-side users (managed by Vault DB engine static-roles already — see
MEMORY.md "Database rotation")
- Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink,
grafana, phpipam are all rotated by Vault DB engine; root users
excluded by design)
- Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4
cleanup tracked under the MySQL standalone migration work — still
pending)
## Test plan
### Automated
`terraform fmt -check -recursive stacks/dbaas` → exit 0
`scripts/tg plan` in stacks/dbaas →
Plan: 2 to add, 7 to change, 0 to destroy.
# module.dbaas.null_resource.mysql_static_user["forgejo"] will be created
# module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created
The 7 "update in-place" entries are pre-existing drift (Kyverno labels on
LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb,
Kyverno-injected `dns_config` on 4 CronJobs lacking the
`ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump
30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this
commit and all are benign (no data loss, no pod restart).
### Manual Verification
# 1. Sanity check pre-apply — users are in their current (manually-fixed) state.
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \
"SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"'
# Expected:
# forgejo % caching_sha2_password
# roundcubemail % caching_sha2_password
# 2. Apply and confirm the provisioner exits 0.
cd stacks/dbaas && ../../scripts/tg apply
# Expect: null_resource.mysql_static_user["forgejo"]: Creation complete
# null_resource.mysql_static_user["roundcubemail"]: Creation complete
# 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push)
# and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both
# must succeed.
# 4. Destructive test (run ONCE, off-hours):
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"'
cd stacks/dbaas && ../../scripts/tg apply
# Expected: apply recreates the user with the Vault password, forgejo UI
# recovers without touching /data/gitea/conf/app.ini.
### Reproduce locally
1. vault login -method=oidc
2. cd infra/stacks/dbaas
3. ../../scripts/tg plan
4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two
null_resource.mysql_static_user additions. 7 changes are pre-existing
drift unrelated to this commit.
Closes: code-6th
Closes: code-96w
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:05:21 +00:00
}
}
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
module " nfs_mysql_backup_host " {
2026-03-17 18:11:53 +00:00
source = " ../../../../modules/kubernetes/nfs_volume "
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
name = " dbaas-mysql-backup-host "
2026-03-17 18:11:53 +00:00
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
nfs_server = " 192.168.1.127 "
nfs_path = " /srv/nfs/mysql-backup "
2026-03-17 18:11:53 +00:00
}
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
resource " kubernetes_persistent_volume_claim " " pgadmin_encrypted " {
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
wait_until_bound = false
metadata {
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
name = " dbaas-pgadmin-encrypted "
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
annotations = {
2026-05-10 19:56:16 +00:00
" resize.topolvm.io/threshold " = " 10% "
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
" resize.topolvm.io/increase " = " 100% "
" resize.topolvm.io/storage_limit " = " 5Gi "
}
}
spec {
access_modes = [ " ReadWriteOnce " ]
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
storage_class_name = " proxmox-lvm-encrypted "
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
resources {
requests = {
storage = " 1Gi "
}
}
}
2026-05-10 21:57:01 +00:00
lifecycle {
# The autoresizer expands requests.storage up to storage_limit and
# PVCs can't shrink. Without this, every TF apply tries to revert
# to the spec value, K8s rejects the shrink, and the PVC ends up
# in Terminating-but-in-use limbo.
ignore_changes = [ spec [ 0 ] . resources [ 0 ] . requests ]
}
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
}
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
module " nfs_postgresql_backup_host " {
2026-03-17 18:11:53 +00:00
source = " ../../../../modules/kubernetes/nfs_volume "
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
name = " dbaas-postgresql-backup-host "
2026-03-17 18:11:53 +00:00
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
nfs_server = " 192.168.1.127 "
nfs_path = " /srv/nfs/postgresql-backup "
2026-03-17 18:11:53 +00:00
}
resource " kubernetes_cron_job_v1 " " mysql-backup " {
metadata {
name = " mysql-backup "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
concurrency_policy = " Replace "
failed_jobs_history_limit = 5
2026-03-23 02:24:34 +02:00
schedule = " 30 0 * * * "
2026-03-17 18:11:53 +00:00
# schedule = "* * * * *"
starting_deadline_seconds = 10
successful_jobs_history_limit = 10
job_template {
metadata { }
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata { }
spec {
container {
name = " mysql-backup "
2026-03-19 23:26:05 +00:00
image = " docker.io/library/mysql:8.0 "
2026-03-19 20:23:59 +00:00
env {
name = " MYSQL_PWD "
value_from {
secret_key_ref {
name = " cluster-secret "
key = " ROOT_PASSWORD "
}
}
}
2026-03-17 18:11:53 +00:00
command = [ " /bin/bash " , " -c " , < < - EOT
set - euxo pipefail
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
_ t0 =$ ( date + % s )
2026-03-23 12:33:52 +02:00
_ rb0 =$ ( awk ' / ^ read_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
_ wb0 =$ ( awk ' / ^ write_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
2026-03-17 18:11:53 +00:00
export now =$ ( date + " %Y_%m_%d_%H_%M " )
2026-03-23 02:24:34 +02:00
mysqldump - - all - data bases - u root - - host mysql . dbaas . svc . cluster . local | gzip -9 > / backup / dump_ $ now . sql . gz
2026-03-17 18:11:53 +00:00
2026-03-23 02:24:34 +02:00
# Rotate — 14 day retention
2026-03-17 18:11:53 +00:00
cd / backup
2026-03-23 02:24:34 +02:00
find . - name " dump_*.sql.gz " - type f - mtime + 14 - delete
find . - name " dump_*.sql " - type f - mtime + 14 - delete # clean up old uncompressed
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
_ dur =$ ( ( $ ( date + % s ) - _ t0 ) )
2026-03-23 12:33:52 +02:00
_ rb1 =$ ( awk ' / ^ read_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
_ wb1 =$ ( awk ' / ^ write_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
echo " === Backup IO Stats === "
echo " duration: $ ${ _ dur } s "
echo " read: $ (( (_rb1 - _rb0) / 1048576 )) MiB "
echo " written: $ (( (_wb1 - _wb0) / 1048576 )) MiB "
echo " output: $ (ls -lh /backup/dump_ $ now.sql.gz | awk '{print $ 5}') "
2026-03-25 10:44:53 +02:00
_ out_bytes =$ ( stat - c % s / backup / dump_ $ now . sql . gz )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
curl - sf - - data - binary @ - " http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/mysql-backup " < < PGEOF | | true
backup_duration_seconds $ $ { _ dur }
backup_read_bytes $ ( ( _ rb1 - _ rb0 ) )
backup_written_bytes $ ( ( _ wb1 - _ wb0 ) )
2026-03-25 10:44:53 +02:00
backup_output_bytes $ $ { _ out_bytes }
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
backup_last_success_timestamp $ ( date + % s )
PGEOF
2026-03-17 18:11:53 +00:00
EOT
]
# To restore (from outside of the cluster):
# run kubectl port-forward to pod e.g.:
# > kb port-forward mysql-647cfd4969-46rmw --address 0.0.0.0 3307:3306
# run mysql import (and specify non-localhost address to avoid using unix socket): (password is in tfvars)
# > mysql -u root -p --host 10.0.10.10 --port 3307 < /mnt/nfs/2024_01_06_13_54.sql
volume_mount {
name = " mysql-backup "
mount_path = " /backup "
}
}
volume {
name = " mysql-backup "
persistent_volume_claim {
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
claim_name = module . nfs_mysql_backup_host . claim_name
2026-03-17 18:11:53 +00:00
}
}
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . job_template [ 0 ] . spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-03-17 18:11:53 +00:00
}
2026-04-14 22:39:33 +00:00
# Per-database MySQL backups (enables single-database restore without affecting others)
resource " kubernetes_cron_job_v1 " " mysql-backup-per-db " {
metadata {
name = " mysql-backup-per-db "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
concurrency_policy = " Replace "
failed_jobs_history_limit = 3
schedule = " 45 0 * * * "
2026-04-14 22:39:33 +00:00
starting_deadline_seconds = 10
successful_jobs_history_limit = 3
job_template {
metadata { }
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata { }
spec {
container {
name = " mysql-backup-per-db "
image = " docker.io/library/mysql:8.0 "
env {
name = " MYSQL_PWD "
value_from {
secret_key_ref {
name = " cluster-secret "
key = " ROOT_PASSWORD "
}
}
}
command = [ " /bin/bash " , " -c " , < < - EOT
set - euo pipefail
_ t0 =$ ( date + % s )
now =$ ( date + " %Y_%m_%d_%H_%M " )
MYSQL_HOST =mysql . dbaas . svc . cluster . local
failed =0
total =0
ok =0
# Discover all user databases
dbs =$ ( mysql - u root - - host $ MYSQL_HOST - N - e \
" SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT IN ('mysql','information_schema','performance_schema','sys','mysql_innodb_cluster_metadata'); " )
for db in $ dbs ; do
total =$ ( ( total + 1 ) )
mkdir - p / backup / per - db / $ db
echo " === Backing up $ db === "
if mysqldump - u root - - host $ MYSQL_HOST - - single - transaction - - set - gtid - purged =OFF " $ db " | gzip -9 > " /backup/per-db/ $ db/dump_ $ now.sql.gz " ; then
_ size =$ ( stat - c % s " /backup/per-db/ $ db/dump_ $ now.sql.gz " )
echo " OK — $ (( _size / 1024 )) KiB "
ok =$ ( ( ok + 1 ) )
else
echo " FAILED "
rm - f " /backup/per-db/ $ db/dump_ $ now.sql.gz "
failed =$ ( ( failed + 1 ) )
fi
done
# Rotate — 14 day retention per database
find / backup / per - db - name " dump_*.sql.gz " - type f - mtime + 14 - delete
_ dur =$ ( ( $ ( date + % s ) - _ t0 ) )
echo " === Per-DB Backup Summary === "
echo " databases: $ total (ok: $ ok, failed: $ failed) "
echo " duration: $ ${ _ dur } s "
curl - sf - - data - binary @ - " http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/mysql-backup-per-db " < < PGEOF | | true
backup_duration_seconds $ $ { _ dur }
backup_databases_total $ total
backup_databases_ok $ ok
backup_databases_failed $ failed
backup_last_success_timestamp $ ( date + % s )
PGEOF
EOT
]
volume_mount {
name = " mysql-backup "
mount_path = " /backup "
}
}
volume {
name = " mysql-backup "
persistent_volume_claim {
claim_name = module . nfs_mysql_backup_host . claim_name
}
}
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . job_template [ 0 ] . spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-04-14 22:39:33 +00:00
}
2026-03-17 18:11:53 +00:00
# resource "kubernetes_persistent_volume" "mysql" {
# metadata {
# name = "mysql-pv"
# }
# spec {
# capacity = {
# "storage" = "10Gi"
# }
# access_modes = ["ReadWriteOnce"]
# persistent_volume_source {
# iscsi {
# target_portal = "iscsi.viktorbarzin.lan:3260"
# iqn = "iqn.2020-12.lan.viktorbarzin:storage:dbaas:mysql"
# lun = 0
# fs_type = "ext4"
# }
# }
# }
# }
# resource "helm_release" "mysql" {
# namespace = kubernetes_namespace.dbaas.metadata[0].name
# create_namespace = false
# name = "mysql"
# repository = "https://presslabs.github.io/charts"
# chart = "mysql-operator"
# # version = "v0.5.0-rc.3"
# values = [templatefile("${path.module}/mysql_chart_values.yaml", { secretName = var.tls_secret_name })]
# atomic = true
# depends_on = [kubernetes_namespace.dbaas]
# }
# # resource "helm_release" "mysql" {
# # namespace = kubernetes_namespace.dbaas.metadata[0].name
# # create_namespace = false
# # name = "mysql-operator"
# # repository = "https://mysql.github.io/mysql-operator/"
# # chart = "mysql-operator"
# # atomic = true
# # depends_on = [kubernetes_namespace.dbaas]
# # }
# # resource "helm_release" "innodb-cluster" {
# # namespace = kubernetes_namespace.dbaas.metadata[0].name
# # create_namespace = false
# # name = var.cluster_master_service
# # repository = "https://mysql.github.io/mysql-operator/"
# # chart = "mysql-innodbcluster"
# # atomic = true
# # depends_on = [kubernetes_namespace.dbaas]
# # values = [templatefile("${path.module}/chart_values.tpl", { root_password = var.dbaas_root_password })]
# # }
# resource "kubernetes_persistent_volume" "mysql-operator" {
# metadata {
# name = "mysql-operator-pv"
# }
# spec {
# capacity = {
# "storage" = "1Gi"
# }
# access_modes = ["ReadWriteOnce"]
# persistent_volume_source {
# iscsi {
# target_portal = "iscsi.viktorbarzin.lan:3260"
# iqn = "iqn.2020-12.lan.viktorbarzin:storage:dbaas:operator"
# lun = 0
# fs_type = "ext4"
# }
# }
# }
# }
resource " kubernetes_secret " " cluster-password " {
metadata {
name = " cluster-secret "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
annotations = {
" reloader.stakater.com/match " = " true "
}
}
type = " Opaque "
data = {
" ROOT_PASSWORD " = var . dbaas_root_password
}
}
# resource "kubernetes_ingress_v1" "dbaas" {
# metadata {
# name = "orchestrator-ingress"
# namespace = kubernetes_namespace.dbaas.metadata[0].name
# annotations = {
# "kubernetes.io/ingress.class" = "nginx"
# "nginx.ingress.kubernetes.io/auth-tls-verify-client" = "on"
# "nginx.ingress.kubernetes.io/auth-tls-secret" = "default/ca-secret"
# }
# }
# spec {
# tls {
# hosts = ["db.viktorbarzin.me"]
# secret_name = var.tls_secret_name
# }
# rule {
# host = "db.viktorbarzin.me"
# http {
# path {
# path = "/"
# backend {
# service {
# name = "mysql-mysql-operator"
# port {
# number = 80
# }
# }
# }
# }
# }
# }
# }
# }
# PHPMyAdmin instance
resource " kubernetes_deployment " " phpmyadmin " {
metadata {
name = " phpmyadmin "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
labels = {
" app " = " phpmyadmin "
tier = var . tier
}
annotations = {
" reloader.stakater.com/search " = " true "
}
}
spec {
replicas = " 1 "
selector {
match_labels = {
" app " = " phpmyadmin "
}
}
template {
metadata {
labels = {
" app " = " phpmyadmin "
}
}
spec {
container {
name = " phpmyadmin "
image = " phpmyadmin/phpmyadmin:5.2.3 "
port {
container_port = 80
}
env {
name = " PMA_HOST "
value = var . cluster_master_service
}
env {
name = " PMA_PORT "
value = " 3306 "
}
env {
name = " MYSQL_ROOT_PASSWORD "
value_from {
secret_key_ref {
name = " cluster-secret "
key = " ROOT_PASSWORD "
}
}
}
env {
name = " UPLOAD_LIMIT "
value = " 300M "
}
resources {
requests = {
cpu = " 15m "
2026-04-06 11:54:45 +03:00
memory = " 100Mi "
2026-03-17 18:11:53 +00:00
}
limits = {
2026-04-06 11:54:45 +03:00
memory = " 100Mi "
2026-03-17 18:11:53 +00:00
}
}
}
dns_config {
option {
name = " ndots "
value = " 2 "
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-03-17 18:11:53 +00:00
}
resource " kubernetes_service " " phpmyadmin " {
metadata {
name = " pma "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
selector = {
" app " = " phpmyadmin "
}
port {
name = " web "
port = 80
}
}
}
module " ingress " {
2026-04-17 12:41:17 +00:00
source = " ../../../../modules/kubernetes/ingress_factory "
dns_type = " proxied "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
name = " pma "
tls_secret_name = var . tls_secret_name
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
auth = " required "
2026-04-17 12:41:17 +00:00
extra_annotations = { }
2026-03-17 18:11:53 +00:00
}
# resource "kubectl_manifest" "mysql-cluster" {
# yaml_body = <<-YAML
# apiVersion: mysql.presslabs.org/v1alpha1
# kind: MysqlCluster
# metadata:
# name: mysql-cluster
# namespace = kubernetes_namespace.dbaas.metadata[0].name
# spec:
# mysqlVersion: "5.7"
# replicas: 1
# secretName: cluster-secret
# mysqlConf:
# # read_only: 0 # mysql forms a single transaction for each sql statement, autocommit for each statement
# # automatic_sp_privileges: "ON" # automatically grants the EXECUTE and ALTER ROUTINE privileges to the creator of a stored routine
# # auto_generate_certs: "ON" # Auto Generation of Certificate
# # auto_increment_increment: 1 # Auto Incrementing value from +1
# # auto_increment_offset: 1 # Auto Increment Offset
# # binlog-format: "STATEMENT" # contains various options such ROW(SLOW,SAFE) STATEMENT(FAST,UNSAFE), MIXED(combination of both)
# # wait_timeout: 31536000 # 28800 number of seconds the server waits for activity on a non-interactive connection before closing it, You might encounter MySQL server has gone away error, you then tweak this value acccordingly
# # interactive_timeout: 28800 # The number of seconds the server waits for activity on an interactive connection before closing it.
# # max_allowed_packet: "512M" # Maximum size of MYSQL Network protocol packet that the server can create or read 4MB, 8MB, 16MB, 32MB
# # max-binlog-size: 1073741824 # binary logs contains the events that describe database changes, this parameter describe size for the bin_log file.
# # log_output: "TABLE" # Format in which the logout will be dumped
# # master-info-repository: "TABLE" # Format in which the master info will be dumped
# # relay_log_info_repository: "TABLE" # Format in which the relay info will be dumped
# volumeSpec:
# persistentVolumeClaim:
# accessModes:
# - ReadWriteOnce
# resources:
# requests:
# storage: 10Gi
# YAML
# depends_on = [helm_release.mysql]
# # manifest = {
# # apiVersion = "mysql.presslabs.org/v1alpha1"
# # kind = "MysqlCluster"
# # metadata = {
# # name = "mysql-cluster"
# # namespace = kubernetes_namespace.dbaas.metadata[0].name
# # }
# # spec = {
# # mysqlVersion = "5.7"
# # replicas = 1
# # secretName = "cluster-secret"
# # mysqlConf = {
# # read_only = 0
# # }
# # volumeSpec = {
# # persistentVolumeClaim = {
# # resources = {
# # requests = {
# # storage = "10Gi"
# # }
# # }
# # }
# # }
# # }
# # }
# }
# For some unknwown reason not all CRDs are installed. Add them manually
# resource "kubectl_manifest" "mysql-user" {
# yaml_body = <<-EOF
# apiVersion: apiextensions.k8s.io/v1
# kind: CustomResourceDefinition
# metadata:
# annotations:
# controller-gen.kubebuilder.io/version: v0.5.0
# helm.sh/hook: crd-install
# name: mysqlusers.mysql.presslabs.org
# labels:
# app: mysql-operator
# spec:
# group: mysql.presslabs.org
# names:
# kind: MysqlUser
# listKind: MysqlUserList
# plural: mysqlusers
# singular: mysqluser
# scope:namespace = kubernetes_namespace.dbaas.metadata[0].name
# versions:
# - additionalPrinterColumns:
# - description: The user status
# jsonPath: .status.conditions[?(@.type == 'Ready')].status
# name: Ready
# type: string
# - jsonPath: .spec.clusterRef.name
# name: Cluster
# type: string
# - jsonPath: .spec.user
# name: UserName
# type: string
# - jsonPath: .metadata.creationTimestamp
# name: Age
# type: date
# name: v1alpha1
# schema:
# openAPIV3Schema:
# description: MysqlUser is the Schema for the MySQL User API
# properties:
# apiVersion:
# description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
# type: string
# kind:
# description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
# type: string
# metadata:
# type: object
# spec:
# description: MysqlUserSpec defines the desired state of MysqlUserSpec
# properties:
# allowedHosts:
# description: AllowedHosts is the allowed host to connect from.
# items:
# type: string
# type: array
# clusterRef:
# description: ClusterRef represents a reference to the MySQL cluster. This field should be immutable.
# properties:
# name:
# description: 'Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names TODO: Add other useful fields. apiVersion, kind, uid?'
# type: string
# namespace = kubernetes_namespace.dbaas.metadata[0].name
# description:namespace = kubernetes_namespace.dbaas.metadata[0].name
# type: string
# type: object
# password:
# description: Password is the password for the user.
# properties:
# key:
# description: The key of the secret to select from. Must be a valid secret key.
# type: string
# name:
# description: 'Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names TODO: Add other useful fields. apiVersion, kind, uid?'
# type: string
# optional:
# description: Specify whether the Secret or its key must be defined
# type: boolean
# required:
# - key
# type: object
# permissions:
# description: Permissions is the list of roles that user has in the specified database.
# items:
# description: MysqlPermission defines a MySQL schema permission
# properties:
# permissions:
# description: Permissions represents the permissions granted on the schema/tables
# items:
# type: string
# type: array
# schema:
# description: Schema represents the schema to which the permission applies
# type: string
# tables:
# description: Tables represents the tables inside the schema to which the permission applies
# items:
# type: string
# type: array
# required:
# - permissions
# - schema
# - tables
# type: object
# type: array
# resourceLimits:
# additionalProperties:
# anyOf:
# - type: integer
# - type: string
# pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
# x-kubernetes-int-or-string: true
# description: 'ResourceLimits allow settings limit per mysql user as defined here: https://dev.mysql.com/doc/refman/5.7/en/user-resources.html'
# type: object
# user:
# description: User is the name of the user that will be created with will access the specified database. This field should be immutable.
# type: string
# required:
# - allowedHosts
# - clusterRef
# - password
# - user
# type: object
# status:
# description: MysqlUserStatus defines the observed state of MysqlUser
# properties:
# allowedHosts:
# description: AllowedHosts contains the list of hosts that the user is allowed to connect from.
# items:
# type: string
# type: array
# conditions:
# description: Conditions represents the MysqlUser resource conditions list.
# items:
# description: MySQLUserCondition defines the condition struct for a MysqlUser resource
# properties:
# lastTransitionTime:
# description: Last time the condition transitioned from one status to another.
# format: date-time
# type: string
# lastUpdateTime:
# description: The last time this condition was updated.
# format: date-time
# type: string
# message:
# description: A human readable message indicating details about the transition.
# type: string
# reason:
# description: The reason for the condition's last transition.
# type: string
# status:
# description: Status of the condition, one of True, False, Unknown.
# type: string
# type:
# description: Type of MysqlUser condition.
# type: string
# required:
# - lastTransitionTime
# - message
# - reason
# - status
# - type
# type: object
# type: array
# type: object
# type: object
# served: true
# storage: true
# subresources:
# status: {}
# EOF
# }
#### POSTGRESQL — CloudNativePG Cluster
#
# Migrated from single NFS-backed pod to CNPG on local-path storage.
# CNPG Cluster is managed via kubectl apply (not kubernetes_manifest)
# because the CNPG webhook mutates the spec on apply, causing
# Terraform provider "inconsistent result" errors.
#
# Rollback: apply old deployment yaml, revert service selector to app=postgresql.
# Ensure the CNPG cluster manifest exists (idempotent kubectl apply)
resource " null_resource " " pg_cluster " {
triggers = {
2026-05-16 12:06:30 +00:00
instances = " 3 "
2026-03-17 18:11:53 +00:00
image = " ghcr.io/cloudnative-pg/postgis:16 "
storage_size = " 20Gi "
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
storage_class = " proxmox-lvm-encrypted "
dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-11 19:20:54 +00:00
memory_limit = " 3Gi "
pg_params = " v3-shared1024-walcomp-workmem16-max200 "
2026-03-17 18:11:53 +00:00
}
provisioner " local-exec " {
command = < < - EOT
kubectl - - kubeconfig $ { var . kube_config_path } apply - f - < < ' EOF '
apiVersion : postgresql . cnpg . io / v1
kind : Cluster
metadata :
name : pg - cluster
namespace : dbaas
spec :
2026-05-16 12:06:30 +00:00
# 3 instances (1 primary + 2 replicas) so a single-node drain (e.g.
# kured's weekly OS-reboot wave) still leaves a primary candidate
# immediately available for switchover. Previously 2; CNPG would
# still failover with 2 but only if the lone replica was caught up
# — during a long WAL backlog the failover would stall the drain.
# Bumped 2026-05-16 ahead of Monday's first post-fix kured cycle.
instances : 3
2026-03-17 18:11:53 +00:00
imageName : ghcr . io / cloudnative - pg / postgis : 16
postgresql :
parameters :
search_path : ' " $ user " , public '
dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-11 19:20:54 +00:00
# Cluster grew past the 100-conn default ceiling (~90/100 idle
# steady-state in May 2026; authentik+matrix alone hold ~55).
# Bumped to 200 with shared_buffers/effective_cache_size/memory
# scaled proportionally. work_mem stays at 16MB — that's per
# sort/hash op, not per connection, so 16MB * 200 isn't the
# worst case.
max_connections : " 200 "
shared_buffers : " 1024MB "
effective_cache_size : " 2560MB "
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
work_mem : " 16MB "
wal_compression : " on "
random_page_cost : " 4 "
checkpoint_completion_target : " 0.9 "
2026-03-17 18:11:53 +00:00
enableAlterSystem : true
enableSuperuserAccess : true
2026-04-03 23:30:00 +03:00
inheritedMetadata :
annotations :
2026-05-10 19:58:11 +00:00
# threshold = free-space % below which autoresizer expands.
# 10% means "expand when 90% used" (the conventional knob).
resize . topolvm . io / threshold : " 10% "
2026-04-03 23:30:00 +03:00
resize . topolvm . io / increase : " 20% "
resize . topolvm . io / storage_limit : " 100Gi "
2026-03-17 18:11:53 +00:00
storage :
size : 20 Gi
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
storageClass : proxmox - lvm - encrypted
2026-03-17 18:11:53 +00:00
resources :
requests :
cpu : " 50m "
dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-11 19:20:54 +00:00
memory : " 3Gi "
2026-03-17 18:11:53 +00:00
limits :
dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-11 19:20:54 +00:00
memory : " 3Gi "
2026-03-17 18:11:53 +00:00
EOF
EOT
}
}
# Service that maintains the original postgresql.dbaas endpoint,
# now pointing at the CNPG primary pod instead of the old deployment.
resource " kubernetes_service " " postgresql " {
metadata {
name = " postgresql "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
selector = {
" cnpg.io/cluster " = " pg-cluster "
" cnpg.io/instanceRole " = " primary "
}
port {
name = " postgresql "
port = 5432
target_port = 5432
}
}
}
2026-04-16 19:01:06 +00:00
# LoadBalancer service for PG primary — accessible from DevVM (10.0.20.200:5432).
# Shares MetalLB IP with other non-conflicting services (Traefik, Dolt, etc.).
resource " kubernetes_service " " postgresql_lb " {
metadata {
name = " postgresql-lb "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
annotations = {
" metallb.universe.tf/loadBalancerIPs " = " 10.0.20.200 "
" metallb.io/allow-shared-ip " = " shared "
}
}
spec {
type = " LoadBalancer "
selector = {
" cnpg.io/cluster " = " pg-cluster "
" cnpg.io/instanceRole " = " primary "
}
port {
name = " postgresql "
port = 5432
target_port = 5432
}
}
}
# Create terraform_state database for remote TF state backend (pg backend).
# User password is managed by Vault Database Secrets Engine (static role rotation).
resource " null_resource " " pg_terraform_state_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " terraform_state "
username = " terraform_state "
}
provisioner " local-exec " {
command = < < - EOT
2026-05-10 15:03:16 +00:00
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
2026-04-16 19:01:06 +00:00
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'terraform_state' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE terraform_state WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'terraform_state' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE terraform_state OWNER terraform_state "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE terraform_state TO terraform_state "
'
EOT
}
}
[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context
New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.
Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.
## What
### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
`static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
`payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
design), `wait-for postgresql.dbaas:5432` annotation, init container runs
`alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
uid `payslips-pg`) reading password from the db-creds K8s Secret.
### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).
### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
(username `payslip_ingest`, 7d rotation).
### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
`pg-cluster-1`.
### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
JSON object matching the schema to stdout. No network, no file writes outside /tmp,
no markdown fences.
## Trade-offs / decisions
- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
initially described. The Alembic migration still creates a `payslip_ingest`
schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
`stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
auto-creator).
## Test Plan
### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.
terraform fmt -check -recursive
(exit 0)
python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```
### Manual Verification (post-merge)
Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.
Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
(first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.
End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.
Closes: code-do7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
# Create payslip_ingest database for the payslip-ingest webhook service.
# Role password is managed by Vault Database Secrets Engine (static role `pg-payslip-ingest`, 7d rotation).
resource " null_resource " " pg_payslip_ingest_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " payslip_ingest "
username = " payslip_ingest "
}
provisioner " local-exec " {
command = < < - EOT
2026-05-10 15:03:16 +00:00
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
[payslip-ingest] Deploy stack + Grafana dashboard + Vault DB role
## Context
New service `payslip-ingest` (code lives in `/home/wizard/code/payslip-ingest/`)
needs in-cluster deployment, its own Postgres DB + rotating user, a Grafana
datasource, a dashboard, and a Claude agent definition for PDF extraction.
Cluster-internal only — webhook fires from Paperless-ngx in a sibling namespace.
No ingress, no TLS cert, no DNS record.
## What
### New stack `stacks/payslip-ingest/`
- `kubernetes_namespace` payslip-ingest, tier=aux.
- ExternalSecret (vault-kv) projects PAPERLESS_API_TOKEN, CLAUDE_AGENT_BEARER_TOKEN,
WEBHOOK_BEARER_TOKEN into `payslip-ingest-secrets`.
- ExternalSecret (vault-database) reads rotating password from
`static-creds/pg-payslip-ingest` and templates `DATABASE_URL` into
`payslip-ingest-db-creds` with `reloader.stakater.com/match=true`.
- Deployment: single replica, Recreate strategy (matches single-worker queue
design), `wait-for postgresql.dbaas:5432` annotation, init container runs
`alembic upgrade head`, main container serves FastAPI on 8080, Kyverno
dns_config lifecycle ignore.
- ClusterIP Service :8080.
- Grafana datasource ConfigMap in `monitoring` ns (label `grafana_datasource=1`,
uid `payslips-pg`) reading password from the db-creds K8s Secret.
### Grafana dashboard `uk-payslip.json` (4 panels)
- Monthly gross/net/tax/NI (timeseries, currencyGBP).
- YTD tax-band progression with threshold lines at £12,570 / £50,270 / £125,140.
- Deductions breakdown (stacked bars).
- Effective rate + take-home % (timeseries, percent).
### Vault DB role `pg-payslip-ingest`
- Added to `allowed_roles` in `vault_database_secret_backend_connection.postgresql`.
- New `vault_database_secret_backend_static_role.pg_payslip_ingest`
(username `payslip_ingest`, 7d rotation).
### DBaaS — DB + role creation
- New `null_resource.pg_payslip_ingest_db` mirrors `pg_terraform_state_db`:
idempotent CREATE ROLE + CREATE DATABASE + GRANT ALL via `kubectl exec` into
`pg-cluster-1`.
### Claude agent `.claude/agents/payslip-extractor.md`
- Haiku-backed agent invoked by `claude-agent-service`.
- Decodes base64 PDF from prompt, tries pdftotext → pypdf fallback, emits a single
JSON object matching the schema to stdout. No network, no file writes outside /tmp,
no markdown fences.
## Trade-offs / decisions
- Own DB per service (convention), NOT a schema in a shared `app` DB as the plan
initially described. The Alembic migration still creates a `payslip_ingest`
schema inside the `payslip_ingest` DB for table organisation.
- Paperless URL uses port 80 (the Service port), not 8000 (the pod target port).
- Grafana datasource uses the primary RW user — separate `_ro` role is aspirational
and not yet a pattern in this repo.
- No ingress — webhook is cluster-internal; external exposure is unnecessary attack
surface.
- No Uptime Kuma monitor yet: the internal-monitor list is a static block in
`stacks/uptime-kuma/`; will add in a follow-up tied to code-z29 (internal monitor
auto-creator).
## Test Plan
### Automated
```
terraform init -backend=false && terraform validate
Success! The configuration is valid.
terraform fmt -check -recursive
(exit 0)
python3 -c "import json; json.load(open('uk-payslip.json'))"
(exit 0)
```
### Manual Verification (post-merge)
Prerequisites:
1. Seed Vault: `vault kv put secret/payslip-ingest webhook_bearer_token=$(openssl rand -hex 32)`.
2. Seed Vault: `vault kv patch secret/paperless-ngx api_token=<paperless token>`.
Apply:
3. `scripts/tg apply vault` → creates pg-payslip-ingest static role.
4. `scripts/tg apply dbaas` → creates payslip_ingest DB + role.
5. `cd stacks/payslip-ingest && ../../scripts/tg apply -target=kubernetes_manifest.db_external_secret`
(first-apply ESO bootstrap).
6. `scripts/tg apply payslip-ingest` (full).
7. `kubectl -n payslip-ingest get pods` → Running 1/1.
8. `kubectl -n payslip-ingest port-forward svc/payslip-ingest 8080:8080 && curl localhost:8080/healthz` → 200.
End-to-end:
9. Configure Paperless workflow (README in code repo has steps).
10. Upload sample payslip tagged `payslip` → row in `payslip_ingest.payslip` within 60s.
11. Grafana → Dashboards → UK Payslip → 4 panels render.
Closes: code-do7
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:07:05 +00:00
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'payslip_ingest' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE payslip_ingest WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'payslip_ingest' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE payslip_ingest OWNER payslip_ingest "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE payslip_ingest TO payslip_ingest "
'
EOT
}
}
2026-04-19 17:09:29 +00:00
# Create job_hunter database for the job-hunter scraper service.
# Role password is managed by Vault Database Secrets Engine (static role `pg-job-hunter`, 7d rotation).
resource " null_resource " " pg_job_hunter_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " job_hunter "
username = " job_hunter "
}
provisioner " local-exec " {
command = < < - EOT
2026-05-10 15:03:16 +00:00
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
2026-04-19 17:09:29 +00:00
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'job_hunter' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE job_hunter WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'job_hunter' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE job_hunter OWNER job_hunter "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE job_hunter TO job_hunter "
'
EOT
}
}
2026-05-10 15:03:16 +00:00
# Postiz: 3 databases (postiz, temporal, temporal_visibility) all owned by the
# `postiz` role. Bundled bitnami PostgreSQL was retired 2026-05-09 in favour of
# this CNPG cluster — covered by postgresql-backup-per-db automatically.
# Role password placeholder; Vault static role `pg-postiz` rotates 7d.
resource " null_resource " " pg_postiz_dbs " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
role = " postiz "
dbs = " postiz,temporal,temporal_visibility "
}
provisioner " local-exec " {
command = < < - EOT
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'postiz' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE postiz WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
for db in postiz temporal temporal_visibility ; do
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " ' $ db' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE $ db OWNER postiz "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE $ db TO postiz "
done
'
EOT
}
}
wealth: SQLite→PG ETL sidecar + new Grafana dashboard
Mirrors Wealthfolio's daily_account_valuation / accounts / activities
from SQLite into a new PG database (wealthfolio_sync) every hour, so
Grafana can chart net worth, contributions, and growth over time.
Components:
- dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG
cluster (dynamic primary lookup so it survives failover).
- vault: pg-wealthfolio-sync static role rotates the password every 7d.
- wealthfolio: ExternalSecret pulls the rotated password into the WF
namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client +
busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload
psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in
the monitoring namespace (uid: wealth-pg).
- monitoring: new Wealth dashboard (wealth.json, 10 panels) — current
net worth / contribution / growth / ROI% stats, then time-series
for net worth, contribution-vs-market, growth area, per-account
stacked area, cash-vs-invested, and a 100-row activity log.
Initial sync: 6 accounts, 10,798 daily valuations, 518 activities.
Verified PG totals match SQLite latest snapshot exactly.
2026-04-25 17:07:33 +00:00
# Create wealthfolio_sync database for the SQLite→PG ETL sidecar that mirrors
# Wealthfolio's daily_account_valuation/accounts/activities into PG so Grafana
# can chart net worth, contributions, and growth.
# Role password is managed by Vault Database Secrets Engine (static role `pg-wealthfolio-sync`, 7d rotation).
resource " null_resource " " pg_wealthfolio_sync_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " wealthfolio_sync "
username = " wealthfolio_sync "
}
provisioner " local-exec " {
command = < < - EOT
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'wealthfolio_sync' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE wealthfolio_sync WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'wealthfolio_sync' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE wealthfolio_sync OWNER wealthfolio_sync "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE wealthfolio_sync TO wealthfolio_sync "
'
EOT
}
}
fire-planner: add stack, Vault DB role, dashboard, DB
New stacks/fire-planner/ mirrors payslip-ingest layout:
- ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner
- DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner
- FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC,
scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service
- Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData
(cc56ba29 fix; otherwise Grafana 11.2+ hits "you do not have default database")
Plus:
- vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles
- dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role
- monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard
(NW timeseries, MC fan chart, success heatmap, lifetime tax bars,
years-to-ruin table, optimal leave-UK stat, ending wealth stat,
UK success-by-strategy bars, sequence-risk correlation table)
- monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder
Apply order:
1. vault stack — creates the static role
2. dbaas stack — creates the database & role
3. external-secrets stack picks up vault-database refs (no change needed)
4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret
before full apply, per the plan-time-data-source pattern
5. monitoring stack — picks up the new dashboard ConfigMap
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 17:27:19 +00:00
# Create fire_planner database for the FIRE retirement-planning service.
# Role password is managed by Vault Database Secrets Engine
# (static role `pg-fire-planner`, 7d rotation).
# fire_planner reads from payslip_ingest + wealthfolio_sync (read-only)
# and writes its own MC results into schema fire_planner.
resource " null_resource " " pg_fire_planner_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " fire_planner "
username = " fire_planner "
}
provisioner " local-exec " {
command = < < - EOT
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'fire_planner' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE fire_planner WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'fire_planner' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE fire_planner OWNER fire_planner "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE fire_planner TO fire_planner "
'
EOT
}
}
2026-05-10 16:37:33 +00:00
# Create instagram_poster database for the IG-curation pipeline. Initial use:
# benchmark_score table written by `instagram_poster.benchmark` CLI (vision-LLM
# scoring per Immich asset). Future: migrate story_queue/decision/ig_posted_media
# off the pod's sqlite PVC into this DB so the pod is fully stateless.
# Role password is managed by Vault Database Secrets Engine
# (static role `pg-instagram-poster`, 7d rotation).
resource " null_resource " " pg_instagram_poster_db " {
depends_on = [ null_resource . pg_cluster ]
triggers = {
db_name = " instagram_poster "
username = " instagram_poster "
}
provisioner " local-exec " {
command = < < - EOT
PRIMARY =$ ( kubectl - - kubeconfig $ { var . kube_config_path } get cluster - n dbaas pg - cluster - o jsonpath =' { . status . currentPrimary } ' )
kubectl - - kubeconfig $ { var . kube_config_path } exec - n dbaas $ PRIMARY - c postgres - - \
bash - c '
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = ' " ' " 'instagram_poster' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE ROLE instagram_poster WITH LOGIN PASSWORD ' " ' " 'changeme-vault-will-rotate' " ' " ' "
psql - U postgres - tc " SELECT 1 FROM pg_catalog.pg_database WHERE datname = ' " ' " 'instagram_poster' " ' " ' " | grep - q 1 | | \
psql - U postgres - c " CREATE DATABASE instagram_poster OWNER instagram_poster "
psql - U postgres - c " GRANT ALL PRIVILEGES ON DATABASE instagram_poster TO instagram_poster "
'
EOT
}
}
2026-03-17 18:11:53 +00:00
# Old PostgreSQL deployment — kept commented for rollback reference
# resource "kubernetes_deployment" "postgres" {
# metadata {
# name = "postgresql"
# namespace = kubernetes_namespace.dbaas.metadata[0].name
# labels = { tier = var.tier }
# }
# spec {
# replicas = 0 # scaled to 0 during CNPG migration
# selector { match_labels = { app = "postgresql" } }
# strategy { type = "Recreate" }
# template {
# metadata { labels = { app = "postgresql" } }
# spec {
# container {
# image = "viktorbarzin/postgres:16-master"
# name = "postgresql"
# env { name = "POSTGRES_PASSWORD"; value = var.postgresql_root_password }
# env { name = "POSTGRES_USER"; value = "root" }
# port { container_port = 5432; protocol = "TCP"; name = "postgresql" }
# volume_mount { name = "postgresql-persistent-storage"; mount_path = "/var/lib/postgresql/data" }
# }
# volume {
# name = "postgresql-persistent-storage"
# nfs { path = "/mnt/main/postgresql/data"; server = var.nfs_server }
# }
# }
# }
# }
# }
#### PGADMIN
resource " kubernetes_deployment " " pgadmin " {
metadata {
name = " pgadmin "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
annotations = {
" reloader.stakater.com/search " = " true "
}
labels = {
tier = var . tier
}
}
spec {
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
strategy {
type = " Recreate "
}
2026-03-17 18:11:53 +00:00
selector {
match_labels = {
app = " pgadmin "
}
}
template {
metadata {
labels = {
app = " pgadmin "
}
}
spec {
container {
image = " dpage/pgadmin4 "
name = " pgadmin "
env {
name = " PGADMIN_DEFAULT_EMAIL "
value = " me@viktorbarzin.me "
}
env {
name = " PGADMIN_DEFAULT_PASSWORD "
# Changed at startup
value = var . pgadmin_password
}
port {
container_port = 80
name = " web "
}
volume_mount {
name = " pgadmin "
mount_path = " /var/lib/pgadmin/ "
}
resources {
requests = {
cpu = " 25m "
2026-04-06 11:54:45 +03:00
memory = " 450Mi "
2026-03-17 18:11:53 +00:00
}
limits = {
2026-04-06 11:54:45 +03:00
memory = " 450Mi "
2026-03-17 18:11:53 +00:00
}
}
}
volume {
name = " pgadmin "
# config_map {
# name = "pgadmin-config"
# }
persistent_volume_claim {
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
claim_name = kubernetes_persistent_volume_claim . pgadmin_encrypted . metadata [ 0 ] . name
2026-03-17 18:11:53 +00:00
}
}
dns_config {
option {
name = " ndots "
value = " 2 "
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-03-17 18:11:53 +00:00
}
resource " kubernetes_service " " pgadmin " {
metadata {
name = " pgadmin "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
selector = {
" app " = " pgadmin "
}
port {
name = " pgadmin "
port = 80
}
}
}
module " ingress-pgadmin " {
source = " ../../../../modules/kubernetes/ingress_factory "
2026-04-16 13:45:04 +00:00
dns_type = " proxied "
2026-03-17 18:11:53 +00:00
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
name = " pgadmin "
tls_secret_name = var . tls_secret_name
ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.
ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
`protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
middleware → dedicated public outpost → guest auto-bind. Logged-in users
keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
ingresses don't need anti-AI noise; the auth flow already discourages bots).
Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true` → `auth = "required"`
- 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
reviewed individually:
* 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
homepage, wrongmove UI, privatebin) → `auth = "none"`
* 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
location ingestion, immich frame kiosk, headscale CP, send anonymous
drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
`auth = "none"`
* Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
k8s-portal API, insta2spotify callback.
Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.
Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.
Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 18:53:49 +00:00
auth = " required "
2026-03-17 18:11:53 +00:00
}
resource " kubernetes_cron_job_v1 " " postgresql-backup " {
metadata {
name = " postgresql-backup "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
concurrency_policy = " Replace "
failed_jobs_history_limit = 5
schedule = " 0 0 * * * "
# schedule = "* * * * *"
starting_deadline_seconds = 10
successful_jobs_history_limit = 10
job_template {
metadata { }
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata { }
spec {
container {
name = " postgresql-backup "
2026-03-19 23:26:05 +00:00
image = " docker.io/library/postgres:16.4-bullseye "
2026-03-17 18:11:53 +00:00
env {
name = " PGPASSWORD "
value_from {
secret_key_ref {
name = " pg-cluster-superuser "
key = " password "
}
}
}
command = [ " /bin/bash " , " -c " , < < - EOT
set - euxo pipefail
2026-03-25 10:44:53 +02:00
apt - get update - qq && apt - get install - yqq curl > / dev / null 2 > & 1 | | true
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
_ t0 =$ ( date + % s )
2026-03-23 12:33:52 +02:00
_ rb0 =$ ( awk ' / ^ read_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
_ wb0 =$ ( awk ' / ^ write_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
2026-03-17 18:11:53 +00:00
export now =$ ( date + " %Y_%m_%d_%H_%M " )
feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.
Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).
Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
PGPASSWORD =$ PGPASSWORD pg_dumpall - h pg - cluster - rw . dbaas - U postgres | gzip -9 > / backup / dump_ $ now . sql . gz
2026-03-17 18:11:53 +00:00
2026-03-23 02:24:34 +02:00
# Rotate — 14 day retention
2026-03-17 18:11:53 +00:00
cd / backup
2026-03-23 02:24:34 +02:00
find . - name " dump_*.sql.gz " - type f - mtime + 14 - delete
find . - name " dump_*.sql " - type f - mtime + 14 - delete # clean up old uncompressed
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
_ dur =$ ( ( $ ( date + % s ) - _ t0 ) )
2026-03-23 12:33:52 +02:00
_ rb1 =$ ( awk ' / ^ read_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
_ wb1 =$ ( awk ' / ^ write_bytes / { print $ 2 } ' / proc / $ $ / io 2 > / dev / null | | echo 0 )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
echo " === Backup IO Stats === "
echo " duration: $ ${ _ dur } s "
echo " read: $ (( (_rb1 - _rb0) / 1048576 )) MiB "
echo " written: $ (( (_wb1 - _wb0) / 1048576 )) MiB "
echo " output: $ (ls -lh /backup/dump_ $ now.sql.gz | awk '{print $ 5}') "
2026-03-25 10:44:53 +02:00
_ out_bytes =$ ( stat - c % s / backup / dump_ $ now . sql . gz )
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
curl - sf - - data - binary @ - " http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/postgresql-backup " < < PGEOF | | true
backup_duration_seconds $ $ { _ dur }
backup_read_bytes $ ( ( _ rb1 - _ rb0 ) )
backup_written_bytes $ ( ( _ wb1 - _ wb0 ) )
2026-03-25 10:44:53 +02:00
backup_output_bytes $ $ { _ out_bytes }
add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
backup_last_success_timestamp $ ( date + % s )
PGEOF
2026-03-17 18:11:53 +00:00
EOT
]
volume_mount {
name = " postgresql-backup "
mount_path = " /backup "
}
}
volume {
name = " postgresql-backup "
persistent_volume_claim {
truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
(etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV
Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
claim_name = module . nfs_postgresql_backup_host . claim_name
2026-03-17 18:11:53 +00:00
}
}
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . job_template [ 0 ] . spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-03-17 18:11:53 +00:00
}
2026-04-14 22:39:33 +00:00
# Per-database PostgreSQL backups (enables single-database restore without affecting others)
resource " kubernetes_cron_job_v1 " " postgresql-backup-per-db " {
metadata {
name = " postgresql-backup-per-db "
namespace = kubernetes_namespace . dbaas . metadata [ 0 ] . name
}
spec {
feat(storage): migrate all sensitive services to proxmox-lvm-encrypted
Reconcile Terraform with cluster state after manual encrypted PVC migrations
and complete the remaining unfinished migrations. All services storing
sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI
plugin.
## Context
Only Technitium DNS was using encrypted storage in Terraform. Many services
had been manually migrated to encrypted PVCs in the cluster, but Terraform
was never updated — creating dangerous state drift where a `tg apply` could
recreate unencrypted PVCs.
## This change
Phase 0 — Infrastructure:
- Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters)
- Add ExternalSecret for LUKS encryption passphrase to Terraform
- Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`)
with 1280Mi limit for LUKS2 Argon2id key derivation
Phase 1 — TF state reconciliation (zero downtime):
- Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import
- Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates
Phase 2 — Data migration (encrypted PVCs existed but unused):
- Headscale, Frigate, MeshCentral: rsync + switchover
- Nextcloud (20Gi): rsync + chart_values update
Phase 3 — New encrypted PVCs:
- Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover
Phase 4 — Cleanup:
- Deleted 5 orphaned unencrypted PVCs
## Services migrated (18 PVCs across 14 namespaces)
```
vaultwarden → vaultwarden-data-encrypted
dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted
mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted
nextcloud → nextcloud-data-encrypted
forgejo → forgejo-data-encrypted
matrix → matrix-data-encrypted
n8n → n8n-data-encrypted
affine → affine-data-encrypted
health → health-uploads-encrypted
hackmd → hackmd-data-encrypted
redis → redis-data-redis-node-{0,1}
headscale → headscale-data-encrypted
frigate → frigate-config-encrypted
meshcentral → meshcentral-{data,files}-encrypted
```
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 20:15:30 +00:00
concurrency_policy = " Replace "
failed_jobs_history_limit = 3
schedule = " 15 0 * * * "
2026-04-14 22:39:33 +00:00
starting_deadline_seconds = 10
successful_jobs_history_limit = 3
job_template {
metadata { }
spec {
backoff_limit = 3
ttl_seconds_after_finished = 10
template {
metadata { }
spec {
container {
name = " postgresql-backup-per-db "
image = " docker.io/library/postgres:16.4-bullseye "
env {
name = " PGPASSWORD "
value_from {
secret_key_ref {
name = " pg-cluster-superuser "
key = " password "
}
}
}
command = [ " /bin/bash " , " -c " , < < - EOT
set - euo pipefail
apt - get update - qq && apt - get install - yqq curl > / dev / null 2 > & 1 | | true
_ t0 =$ ( date + % s )
now =$ ( date + " %Y_%m_%d_%H_%M " )
PGHOST =pg - cluster - rw . dbaas
PGUSER =postgres
failed =0
total =0
ok =0
# Discover all user databases
dbs =$ ( PGPASSWORD =$ PGPASSWORD psql - h $ PGHOST - U $ PGUSER - t - A - c \
" SELECT datname FROM pg_database WHERE datistemplate = false AND datname != 'postgres' ORDER BY datname; " )
for db in $ dbs ; do
total =$ ( ( total + 1 ) )
mkdir - p / backup / per - db / $ db
echo " === Backing up $ db === "
if PGPASSWORD =$ PGPASSWORD pg_dump - Fc - h $ PGHOST - U $ PGUSER " $ db " > " /backup/per-db/ $ db/dump_ $ now.dump " ; then
_ size =$ ( stat - c % s " /backup/per-db/ $ db/dump_ $ now.dump " )
echo " OK — $ (( _size / 1024 )) KiB "
ok =$ ( ( ok + 1 ) )
else
echo " FAILED "
rm - f " /backup/per-db/ $ db/dump_ $ now.dump "
failed =$ ( ( failed + 1 ) )
fi
done
# Rotate — 14 day retention per database
find / backup / per - db - name " dump_*.dump " - type f - mtime + 14 - delete
_ dur =$ ( ( $ ( date + % s ) - _ t0 ) )
echo " === Per-DB Backup Summary === "
echo " databases: $ total (ok: $ ok, failed: $ failed) "
echo " duration: $ ${ _ dur } s "
curl - sf - - data - binary @ - " http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/postgresql-backup-per-db " < < PGEOF | | true
backup_duration_seconds $ $ { _ dur }
backup_databases_total $ total
backup_databases_ok $ ok
backup_databases_failed $ failed
backup_last_success_timestamp $ ( date + % s )
PGEOF
EOT
]
volume_mount {
name = " postgresql-backup "
mount_path = " /backup "
}
resources {
requests = {
memory = " 256Mi "
cpu = " 50m "
}
limits = {
memory = " 512Mi "
}
}
}
volume {
name = " postgresql-backup "
persistent_volume_claim {
claim_name = module . nfs_postgresql_backup_host . claim_name
}
}
}
}
}
}
}
[infra] Sweep dns_config ignore_changes across all pod-owning resources [ci skip]
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:19:48 +00:00
lifecycle {
# KYVERNO_LIFECYCLE_V1: Kyverno admission webhook mutates dns_config with ndots=2
ignore_changes = [ spec [ 0 ] . job_template [ 0 ] . spec [ 0 ] . template [ 0 ] . spec [ 0 ] . dns_config ]
}
2026-04-14 22:39:33 +00:00
}