Commit graph

16 commits

Author SHA1 Message Date
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
194281e527 right-size cluster memory: reduce overprovisioned, fix under-provisioned services
Phase 1 - Quick wins (~4.5 Gi saved):
- democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default)
- caretta: 768Mi → 600Mi (VPA upper 485Mi)
- immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin)
- onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi)

Phase 2 - Safety fixes (prevent OOMKills):
- frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom)
- openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement)

Phase 3 - Additional right-sizing:
- authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi)
- shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase)

Phase 4 - Burstable QoS for lower tiers:
- tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit
- tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit

Phase 5 - Monitoring:
- Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m)
- Add ContainerNearOOM alert (>85% limit, 30m)
- Add PodUnschedulable alert (5m, critical)

Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.
2026-03-15 15:30:18 +00:00
Viktor Barzin
23019da8e5 equalize memory req=lim across 70+ containers using Prometheus 7d max data
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.

Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes

Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
2026-03-14 21:46:49 +00:00
Viktor Barzin
2102cb2d73 Right-size CPU requests cluster-wide and remove missed CPU limits
Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m,
clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m).

Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each,
prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m,
shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m,
technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU.

Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.
2026-03-14 09:22:24 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
a9daf50142 [ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant
Wire homepage_credentials tokens through platform stack to enable
live widgets for Authentik, Shlink (URL shortener), and Home Assistant
London. Update SOPS with new credential entries.
2026-03-07 20:39:54 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
78c0956ab5 [ci skip] add Authentik PDB (minAvailable=2) 2026-03-01 14:24:47 +00:00
Viktor Barzin
f64c979ba5 [ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
eb32190461 [ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik
- osrm-bicycle: 1Gi limit (loads 403MB routing graph)
- aiostreams: 768Mi limit (loads 44K anime entries)
- listenarr: 1Gi limit (.NET + Playwright/Chromium)
- authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn)
- servarr: pass nfs_server variable to all submodules
2026-02-28 17:03:33 +00:00
Viktor Barzin
0a1d53b6dd [ci skip] platform: add ndots=2 dns_config to all deployment pod specs
Prevents Terraform from reverting the Kyverno inject-ndots mutation
on every apply. 23 pod specs across 19 platform module files.
2026-02-23 22:43:05 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
5501b5cfbf [ci skip] Increase authentik ResourceQuota limits
Authentik is a critical auth service that was at 83% CPU/memory
quota utilization. Double all limits to prevent throttling.
2026-02-22 17:28:41 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00