Commit graph

8 commits

Author SHA1 Message Date
Viktor Barzin
0e324df545
[ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
beec5acbc7
[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota
CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota
opt-out label and ResourceQuota allowing 32 CPU limits to accommodate
the 16 CPU container limit plus sidecar defaults.
2026-03-01 17:41:18 +00:00
Viktor Barzin
79af6fff47
[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory
- dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD
  list/watch needed by kopf framework in sidecar containers
- kyverno: restrict inject-priority-class-from-tier to CREATE
  operations only (was blocking pod patches with immutable spec error)
- kyverno: add resource-governance/custom-limitrange label opt-out
  to LimitRange generation policy (mirrors existing custom-quota)
- nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange
  with 8Gi max, opt out of Kyverno-managed LimitRange
2026-03-01 17:16:03 +00:00
Viktor Barzin
e22d81275b
[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe
- Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling)
- Memory request: 512Mi → 1Gi
- Startup probe: 30s delay, 10s timeout, 60 failures (10min total)
  Previous 5min window was too short for NFS-backed SQLite init
2026-02-28 23:32:28 +00:00
Viktor Barzin
419581727f
[ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi 2026-02-28 20:21:00 +00:00
Viktor Barzin
de4dffbab7
[ci skip] nextcloud: increase resource limits to prevent OOM crash loop
Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and
getting OOM killed under sync client load, causing 500s and blank web UI.
2026-02-28 16:26:19 +00:00
Viktor Barzin
2d919c4d34
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
b692eb0c34
[ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Renamed from stacks/nextcloud/module/chart_values.yaml (Browse further)