## Context
The null_resource.mysql_static_user provisioner in commit 2033e767 used
a bash -c wrapper with nested single quotes (`'"$DB"'`-style injection)
to interpolate the app-specific database name and credentials. The outer
bash -c '...' single-quoted string was broken by the inner ' characters
long before reaching the container, so the local (tg) shell saw `$DB`
and `$USER` unset and produced an empty database name:
ERROR 1102 (42000) at line 1: Incorrect database name ''
Apply failed for both forgejo and roundcubemail.
## This change
Feed the SQL to mysql on the pod via stdin through `kubectl exec -i`:
- Outer command: `kubectl exec -i ... -- sh -c 'exec mysql -uroot -p"$MYSQL_ROOT_PASSWORD"'`
- Single-quoted shell heredoc (`<<'SQL'`) carries the SQL statements
- HCL interpolates `${each.key}`, `${each.value.database}`,
`${each.value.password}` into the heredoc body before the shell runs
- No nested quoting — one single-quote layer, one double-quote layer,
one heredoc layer
Plan/apply verified on the live stack: 2 added (forgejo + roundcubemail),
7 pre-existing drift items changed, 0 destroyed. Both users now log in
with their app-cached passwords.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the
MySQL user table but scripted fresh passwords for every app user. Two apps
(forgejo, roundcubemail) store their DB password inside their own
application config — forgejo in `/data/gitea/conf/app.ini` (baked into the
PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the
mailserver stack (sourced from Vault `secret/platform`). Neither app
could be restarted with a new password without rewriting its own config.
Both apps silently broke with `Access denied for user 'X'@'%'` after the
migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED
BY '<app_password>'` to re-sync MySQL with what each app already has. With
nothing in Terraform managing those users, the next migration would break
them again — that's the gap this change closes.
## What this change does
Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same
`null_resource` + `local-exec` + `kubectl exec` pattern already used for
`pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives:
- `petoju/mysql` Terraform provider — no existing usage in the repo; would
be a net-new dependency. Module-level `for_each` over `mysql_user` +
`mysql_grant` is cleaner, but the added machinery (new provider block,
extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF
env vars, state-dependent password reads) outweighs the benefit for two
static users.
- K8s Job — adds lifecycle management for a one-shot resource; needs
secret mounts and is harder to retry. `local-exec` is exactly what the
existing PG bootstrap uses.
Idempotency contract:
CREATE DATABASE IF NOT EXISTS <db>;
CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>';
GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%';
FLUSH PRIVILEGES;
The `ALTER USER` on every re-run re-syncs the password if Vault was rotated
out-of-band (healing drift). The `sha256(password)` trigger also re-runs
the provisioner when the Vault password legitimately changes, so the
resource is responsive to both new and rotated passwords. `caching_sha2_password`
matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents
silent drift to `mysql_native_password`.
Flow (apply-time):
scripts/tg apply
│
├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password
│
▼
module.dbaas
│
├── mysql-standalone-0 (StatefulSet, already running)
│
├── null_resource.mysql_static_user["forgejo"]
│ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT
│
└── null_resource.mysql_static_user["roundcubemail"]
└── (same, for roundcubemail)
## Secrets
Two new keys added to Vault `secret/viktor`:
mysql_forgejo_password # bound to forgejo `[database]` in app.ini
mysql_roundcubemail_password # duplicates secret/platform
# mailserver_roundcubemail_db_password;
# secret/viktor is the personal vault of
# record per .claude/CLAUDE.md
Passwords are never written to the repo — both come from Vault via
`data "vault_kv_secret_v2" "viktor"` in the dbaas root module.
## What is NOT in this change
- PG-side users (managed by Vault DB engine static-roles already — see
MEMORY.md "Database rotation")
- Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink,
grafana, phpipam are all rotated by Vault DB engine; root users
excluded by design)
- Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4
cleanup tracked under the MySQL standalone migration work — still
pending)
## Test plan
### Automated
`terraform fmt -check -recursive stacks/dbaas` → exit 0
`scripts/tg plan` in stacks/dbaas →
Plan: 2 to add, 7 to change, 0 to destroy.
# module.dbaas.null_resource.mysql_static_user["forgejo"] will be created
# module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created
The 7 "update in-place" entries are pre-existing drift (Kyverno labels on
LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb,
Kyverno-injected `dns_config` on 4 CronJobs lacking the
`ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump
30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this
commit and all are benign (no data loss, no pod restart).
### Manual Verification
# 1. Sanity check pre-apply — users are in their current (manually-fixed) state.
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \
"SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"'
# Expected:
# forgejo % caching_sha2_password
# roundcubemail % caching_sha2_password
# 2. Apply and confirm the provisioner exits 0.
cd stacks/dbaas && ../../scripts/tg apply
# Expect: null_resource.mysql_static_user["forgejo"]: Creation complete
# null_resource.mysql_static_user["roundcubemail"]: Creation complete
# 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push)
# and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both
# must succeed.
# 4. Destructive test (run ONCE, off-hours):
kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \
'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"'
cd stacks/dbaas && ../../scripts/tg apply
# Expected: apply recreates the user with the Vault password, forgejo UI
# recovers without touching /data/gitea/conf/app.ini.
### Reproduce locally
1. vault login -method=oidc
2. cd infra/stacks/dbaas
3. ../../scripts/tg plan
4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two
null_resource.mysql_static_user additions. 7 changes are pre-existing
drift unrelated to this commit.
Closes: code-6th
Closes: code-96w
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.
Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.
## This change:
- Replace helm_release.mysql_cluster service selector with raw
kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes
## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)
Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MySQL operator ignores podSpec.containers sidecar resource overrides,
always injecting 6Gi limit defaults. Added sidecar to CR spec for
documentation but raised quota from 32Gi to 40Gi as the practical fix.
Quota usage drops from 99% to 79%.
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
/proc/self/io inside $(awk ...) resolves to the awk subprocess PID,
not the parent bash shell. Use $$ (bash PID) to read the correct
process IO counters.
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
(etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
- Add docker.io/library/ prefix to mysql and postgres backup images
to satisfy Kyverno require-trusted-registries policy (both CronJobs
were blocked for 46h, triggering MySQLBackupStale alert)
- Document mysql-operator chart ignoring resources values key — the
LimitRange default (256Mi) was silently applied, putting the operator
at 97% memory. Patched live to 512Mi via kubectl.
- Increase vault-raft-backup backoff_limit to 6 for transient failures
(also fixed NFS export: vault-backup was a separate ZFS dataset not
in the TrueNAS NFS share — destroyed dataset, created directory)
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.