Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.
- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
Vault is now the sole source of truth for secrets. SOPS pipeline
removed entirely — auth via `vault login -method=oidc`.
Part A: SOPS removal
- vault/main.tf: delete 990 lines (93 vars + 43 KV write resources),
add self-read data source for OIDC creds from secret/vault
- terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook
- scripts/tg: remove SOPS decryption, keep -auto-approve logic
- .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl
- Delete secrets.sops.json, .sops.yaml
Part B: External Secrets Operator
- New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores
(vault-kv for KV v2, vault-database for DB engine)
Part C: Database secrets engine (in vault/main.tf)
- MySQL + PostgreSQL connections with static role rotation (24h)
- 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana)
- 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory)
Part D: Kubernetes secrets engine (in vault/main.tf)
- RBAC for Vault SA to manage K8s tokens
- Roles: dashboard-admin, ci-deployer, openclaw, local-admin
- New scripts/vault-kubeconfig helper for dynamic kubeconfig
K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.
Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes
Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
jsondecode() in locals blocks
Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.
Apply order: vault stack first (populates KV), then all others.
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
- Set loglevel=2 (warnings) and disable mail_smtpdebug via configs
- Enable opcache.enable_file_override for faster file checks
- Increase APCu shared memory from 32M to 128M
- Fix broken module.nfs_nextcloud_data reference in backup cron job
to use the iSCSI PVC directly
SQLite caused 4.7 CPU / 2GB usage, now MySQL uses ~95m / 95Mi.
Reduced limits from 16 CPU / 6Gi to 2 CPU / 1Gi.
Reduced requests from 100m / 1Gi to 50m / 256Mi.
Frees ~14 CPU cores and 5Gi memory for other workloads.
With SQLite, 50 workers caused all workers to block on DB locks.
On MySQL, CPU is ~20m and memory ~143Mi — no resource pressure.
The crash-looping was caused by hitting MaxRequestWorkers=50 limit
("server reached MaxRequestWorkers setting"), not by DB contention.
SQLite was causing constant crash-looping (138 restarts/day) due to
write lock contention with concurrent sync clients.
Migration required workarounds for multiple occ db:convert-type bugs:
- GR error 3100: SET GLOBAL sql_generate_invisible_primary_key = ON
- utf8mb3 column creation: stripped 4-byte emoji + invalid UTF-8 from
SQLite (F1 calendar events, filecache)
- SQLite index corruption: repaired via .dump + INSERT OR IGNORE reimport
- kubectl exec timeouts: used nohup + detached process
Verified: all 136 tables migrated, 100% row count match across 15 key
tables (users, files, calendars, contacts, shares, activity).
Also fixed typo: databse → database in chart values.
- Restored clean SQLite database from pre-migration backup
- Fixed severe database corruption (rowid ordering, page corruption, index issues)
- Applied conservative MaxRequestWorkers=15 for SQLite stability
- Database integrity now perfect, all health checks passing
- Ready for future MySQL migration with clean data
[ci skip]
- Increase MaxRequestWorkers from 10 to 25 for 4 CPU + 3Gi memory container
- Update Apache tuning for Redis + SQLite backend (not pure SQLite)
- Resolves CrashLoopBackOff caused by health probe timeouts
- Allows handling concurrent users without MaxRequestWorkers limit errors
[ci skip]
MySQL (3 OOM kills):
- Cap group_replication_message_cache_size to 128MB (default 1GB caused OOM)
- Reduce innodb_log_buffer_size from 64MB to 16MB
- Lower max_connections from 151 to 80 (peak usage ~40)
- Increase memory limit from 3Gi to 4Gi for headroom
Nextcloud (30+ apache2 OOM kills per incident):
- Reduce MaxRequestWorkers from 50 to 10 to prevent fork bomb
when SQLite locks cause request pileup
- Lower StartServers/MinSpare/MaxSpare proportionally
Uptime Kuma (Node.js memory leak):
- Increase memory limit from 256Mi to 512Mi
- Increase CPU limit from 200m to 500m
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.
- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
on SQLite locks, making liveness probes fail even faster (131 restarts
vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested
under the nextcloud: key. Also mount to mods-available/ (the actual file)
not mods-enabled/ (which is a symlink).
Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint
TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).