Commit graph

9 commits

Author SHA1 Message Date
Viktor Barzin
a04335d0f3 right-size 14 services and scale down GPU-heavy workloads [ci skip]
Memory right-sizing based on VPA upperBound analysis:
- Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi,
  dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi,
  navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi,
  ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi,
  mysql-operator 384→512Mi
- Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi,
  dcgm-exporter 2560→1536Mi
- Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure)

Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved,
all ResourceQuota pressures cleared, GPU health green.
2026-03-15 23:00:49 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
52bce849d7 fix cluster health: resolve 21/23 failures from healthcheck
- nvidia: change GPU taint NoSchedule -> PreferNoSchedule to allow
  overflow scheduling on k8s-node1 (frees ~7Gi capacity)
- kyverno: increase reports-controller memory 256Mi -> 512Mi (OOMKilled)
- speedtest: add missing DB_PORT=3306 env var (nc: service "" unknown)
- realestate-crawler: increase API memory 64Mi -> 256Mi (OOMKilled)
- calibre: increase liveness probe timeout 1s -> 5s (false restarts)
2026-03-15 02:33:46 +00:00
Viktor Barzin
6f562b5da6 add vaultwarden daily backup CronJob to NFS
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
2026-03-15 00:03:59 +00:00
Viktor Barzin
23019da8e5 equalize memory req=lim across 70+ containers using Prometheus 7d max data
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.

Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes

Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
2026-03-14 21:46:49 +00:00
Viktor Barzin
22223ec0fd [ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi)
- Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default)
- Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default)
- Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)
2026-03-02 21:39:14 +00:00
Viktor Barzin
7bc975aa16 [ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies
- Scale admission controller to 2 replicas with topology spread across nodes
- Rewrite inject-priority-class-from-tier: use namespaceSelector instead of
  API call per pod admission (eliminates Kyverno→API server round-trip)
- Rewrite sync-tier-label-from-namespace: same namespaceSelector approach
- Extract governance_tiers local to DRY up tier definitions
2026-02-24 23:09:56 +00:00
Viktor Barzin
e7e4faa57a [ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart
Admission controller was restarting every ~5min due to API server timeouts
causing leader election loss. failurePolicy:Fail meant the webhook blocked
all pod creation cluster-wide when Kyverno was unavailable.
2026-02-24 23:00:45 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Renamed from modules/kubernetes/kyverno/main.tf (Browse further)