Viktor Barzin
5ad2c273fc
stabilize Nextcloud: relax probes, reduce resources for 2-client SQLite workload
...
SQLite locks cause slow responses under concurrent access, triggering
liveness probe failures and restarts. With only 2 sync clients:
- Liveness: period 30→60s, timeout 10→30s, failures 6→10 (tolerates 10min)
- Readiness: period 30→60s, timeout 10→30s, failures 3→5
- Startup: timeout 10→30s
- Resources: CPU 16→4, memory 6Gi→3Gi (10 workers × 200MB = 2GB max)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 10:01:20 +00:00
Viktor Barzin
8c920bd496
migrate Nextcloud data volume from NFS to iSCSI for fsync support
...
SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.
- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 23:24:03 +00:00
Viktor Barzin
ce95a13bda
fix: mount Apache MPM config under nextcloud.extraVolumes (not top-level)
...
The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested
under the nextcloud: key. Also mount to mods-available/ (the actual file)
not mods-enabled/ (which is a symlink).
Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.
2026-03-08 21:37:39 +00:00
Viktor Barzin
e2473fe8a6
tune Nextcloud Apache/PHP to fix constant crash-looping (50 restarts/6d)
...
Root cause: Apache prefork with 150 MaxRequestWorkers (each ~220MB RSS)
on SQLite DB causes worker exhaustion + lock contention → Apache hangs →
aggressive liveness probe (3 failures × 10s) kills container.
Fixes:
- Apache: MaxRequestWorkers 150→25, MaxConnectionsPerChild 0→200,
StartServers 5→3 (via ConfigMap mount over mpm_prefork.conf)
- PHP: max_execution_time 0→300s, max_input_time 300s (prevent zombie workers)
- Liveness probe: period 10s→30s, failureThreshold 3→6, timeout 5s→10s
(180s tolerance vs 30s before)
- Readiness probe: period 10s→30s, timeout 5s→10s
2026-03-08 21:33:27 +00:00
Viktor Barzin
0e324df545
[ci skip] complete NFS CSI migration: complex stacks + platform modules
...
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
beec5acbc7
[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota
...
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
2026-03-01 17:41:18 +00:00
Viktor Barzin
79af6fff47
[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory
...
- dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD
list/watch needed by kopf framework in sidecar containers
- kyverno: restrict inject-priority-class-from-tier to CREATE
operations only (was blocking pod patches with immutable spec error)
- kyverno: add resource-governance/custom-limitrange label opt-out
to LimitRange generation policy (mirrors existing custom-quota)
- nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange
with 8Gi max, opt out of Kyverno-managed LimitRange
2026-03-01 17:16:03 +00:00
Viktor Barzin
e22d81275b
[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe
...
- Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling)
- Memory request: 512Mi → 1Gi
- Startup probe: 30s delay, 10s timeout, 60 failures (10min total)
Previous 5min window was too short for NFS-backed SQLite init
2026-02-28 23:32:28 +00:00
Viktor Barzin
419581727f
[ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi
2026-02-28 20:21:00 +00:00
Viktor Barzin
de4dffbab7
[ci skip] nextcloud: increase resource limits to prevent OOM crash loop
...
Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and
getting OOM killed under sync client load, causing 500s and blank web UI.
2026-02-28 16:26:19 +00:00
Viktor Barzin
2d919c4d34
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
...
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
b692eb0c34
[ci skip] Flatten module wrappers into stack roots
...
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.
- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure
Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00