Viktor Barzin
58e93ae4f0
Add HAProxy for Redis HA master-only routing
...
The Redis K8s Service was load-balancing across both master and replica
nodes, causing READONLY errors when clients hit the replica. This broke
Nextcloud (DAV 500s, liveness probe timeouts, crash loops) and
potentially other services.
Replace the direct Service with HAProxy (2 replicas) that health-checks
each Redis node via INFO replication and only routes to role:master.
On Sentinel failover, HAProxy detects the new master within ~9 seconds.
2026-03-18 08:03:58 +00:00
Viktor Barzin
63fb6201c8
[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI
...
- Redis: local-path → iscsi-truenas (master + replica persistence)
- Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data)
- Loki: NFS PV → dynamic iSCSI via storageClass in Helm values
- Deleted 2 orphaned Released iSCSI PVs (31Gi freed)
2026-03-06 20:50:55 +00:00
Viktor Barzin
0e324df545
[ci skip] complete NFS CSI migration: complex stacks + platform modules
...
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).
Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)
Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler
Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
ca648ff9bb
[ci skip] right-size all pod resources based on VPA + live metrics audit
...
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.
Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi
Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
immich-postgresql, osrm-foot
GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU
Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage
Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
Viktor Barzin
a8da2e3790
[ci skip] redis: pin service to master pod to fix read-only errors
...
The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas).
Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only
replicas, causing write failures. Pin the service to redis-node-0 (master).
2026-03-01 17:13:25 +00:00
Viktor Barzin
5685a84c9f
[ci skip] tune resource limits and requests across 10 services
...
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)
Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)
Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)
New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
0cce9d350a
[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA
...
- Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2
- Architecture: replication with Sentinel (1 master + 1 replica + sentinels)
- Automatic failover via Sentinel (quorum=2, masterSet=mymaster)
- Service 'redis.redis' always points at current master (transparent to clients)
- 120 clients connected immediately after deployment
- Sentinel confirmed tracking redis-node-0 as master
- Local-path PVCs for persistence (2Gi per node)
- Auth disabled (matches previous setup)
- Hourly RDB backup CronJob to NFS preserved
- OCI chart pulled via pull-through cache (10.0.20.10:5000)
2026-02-28 19:59:58 +00:00
Viktor Barzin
fdd4e3e467
[ci skip] Phase 2: migrate Redis from NFS to local disk
...
- Switch from redis/redis-stack:latest to redis:7-alpine
(modules were completely unused — zero module commands in stats)
- Move data from NFS (/mnt/main/redis) to local-path PVC
(RDB saves: 39s on NFS → <1s on local disk)
- Start fresh (old RDB had redis-stack module data incompatible with plain redis;
all Redis data is transient — queues and caches rebuild automatically)
- Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline
- Remove RedisInsight UI ingress (port 8001, only in redis-stack)
- Add redis-backup to NFS exports
- 110 clients reconnected immediately after switchover
- Memory savings: ~100MB from dropping unused modules
2026-02-28 19:44:08 +00:00
Viktor Barzin
85f88bf167
[ci skip] platform: add ndots=2 dns_config to all deployment pod specs
...
Prevents Terraform from reverting the Kyverno inject-ndots mutation
on every apply. 23 pod specs across 19 platform module files.
2026-02-23 22:43:05 +00:00
Viktor Barzin
e982a8ad81
[ci skip] fix redis OOMKilled: increase memory limits to 2Gi
...
Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was
insufficient for 650MB RDB dataset plus redis-stack modules.
2026-02-23 22:37:56 +00:00
Viktor Barzin
2d919c4d34
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
...
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
e225e81ebf
[ci skip] Move Terraform modules into stack directories
...
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:
- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/
This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.
All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00