Nextcloud persists dbpassword in config.php on its PVC and ignores
MYSQL_PASSWORD env var after initial install. When Vault rotates the
MySQL password, config.php goes stale causing HTTP 500 crash loops.
Adds a before-starting hook that patches config.php with the current
MYSQL_PASSWORD on every pod start. Combined with Stakater Reloader
annotation, the full rotation chain is now automated:
Vault rotates → ESO syncs Secret → Reloader restarts pod → hook
patches config.php → Nextcloud connects with new password.
Also fixes stale existingClaim (nextcloud-data-iscsi → nextcloud-data-proxmox).
MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this
config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in
queries, causing SQLSTATE[42000] errors that broke occ commands and sync.
Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the
DB filecache (stored as `Work .csv`), causing permanent sync errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.
- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.
Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes
Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
- Set loglevel=2 (warnings) and disable mail_smtpdebug via configs
- Enable opcache.enable_file_override for faster file checks
- Increase APCu shared memory from 32M to 128M
- Fix broken module.nfs_nextcloud_data reference in backup cron job
to use the iSCSI PVC directly
SQLite caused 4.7 CPU / 2GB usage, now MySQL uses ~95m / 95Mi.
Reduced limits from 16 CPU / 6Gi to 2 CPU / 1Gi.
Reduced requests from 100m / 1Gi to 50m / 256Mi.
Frees ~14 CPU cores and 5Gi memory for other workloads.
SQLite was causing constant crash-looping (138 restarts/day) due to
write lock contention with concurrent sync clients.
Migration required workarounds for multiple occ db:convert-type bugs:
- GR error 3100: SET GLOBAL sql_generate_invisible_primary_key = ON
- utf8mb3 column creation: stripped 4-byte emoji + invalid UTF-8 from
SQLite (F1 calendar events, filecache)
- SQLite index corruption: repaired via .dump + INSERT OR IGNORE reimport
- kubectl exec timeouts: used nohup + detached process
Verified: all 136 tables migrated, 100% row count match across 15 key
tables (users, files, calendars, contacts, shares, activity).
Also fixed typo: databse → database in chart values.
SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.
- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested
under the nextcloud: key. Also mount to mods-available/ (the actual file)
not mods-enabled/ (which is a symlink).
Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Renamed from stacks/nextcloud/module/chart_values.yaml (Browse further)