Commit graph

14 commits

Author SHA1 Message Date
Viktor Barzin
06a0d0599a regenerate providers.tf: remove vault_root_token variable [ci skip] 2026-03-15 21:21:01 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
32bd30f56e [ci skip] fix invalid Homepage dashboard icons for 9 services
Use correct dashboard-icons names where available (changedetection,
gramps-web), Material Design Icons for custom apps (city-guesser,
plotting-book, resume, tuya-bridge, trading-bot, poison-fountain),
and Simple Icons for F1 Stream.
2026-03-07 21:14:17 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
79a2aa3784 [ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC
Batch migration of all single-volume and simple multi-volume stacks.
All services verified healthy after migration. Uses nfs-truenas
StorageClass with soft,timeo=30,retrans=3 mount options to eliminate
stale NFS mount hangs.

Services: atuin, audiobookshelf, calibre, changedetection, diun,
excalidraw, forgejo, freshrss, grampsweb, hackmd, health,
isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama,
onlyoffice, owntracks, paperless-ngx, poison-fountain, send,
stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp
2026-03-02 00:15:39 +00:00
Viktor Barzin
fd0c85c9cc [ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress) 2026-03-01 13:57:54 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
f1d90ff840 [ci skip] poison-fountain: fix single point of failure causing transient service outages
- Scale to 2 replicas with RollingUpdate (maxUnavailable=0)
- Add topology spread constraint to place pods on different nodes
- Switch from single-threaded to ThreadingMixIn HTTP server so tarpit
  slow-drip requests no longer block /auth and /healthz endpoints
2026-02-25 21:05:14 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
c4db6a9fdb [ci skip] Fix poison fetcher: use HTTP/1.1 for upstream (HTTP/2 hangs)
The Poison Fountain upstream (rnsaffn.com/poison2/) doesn't respond
properly over HTTP/2. Force HTTP/1.1 for reliable content fetching.
Also fixed NFS directory permissions for non-root curl container.
2026-02-22 20:42:53 +00:00
Viktor Barzin
36cec7c83f [ci skip] Add poison-fountain Terraform stack (deployment, service, ingress, CronJob) 2026-02-22 19:50:57 +00:00
Viktor Barzin
8a1636b931 [ci skip] Add poison fountain Python service and fetcher script 2026-02-22 19:46:43 +00:00