Commit graph

20 commits

Author SHA1 Message Date
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
ee39dd2fc9 feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
SQLite-backed services. Deployments updated to use new block storage
PVCs. Old NFS modules retained for 1-week rollback.

Services: ntfy, freshrss, insta2spotify, actualbudget (x3),
wealthfolio, navidrome (DB only), audiobookshelf config,
headscale, forgejo, uptime-kuma.

Also: set Recreate strategy on ntfy, forgejo, insta2spotify,
wealthfolio (required for RWO volumes).
2026-04-04 16:26:59 +03:00
Viktor Barzin
21cfa8c072 bump memory limits for OOM-prone services
FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi)
Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi)
n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi)
Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi)
Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)
2026-03-21 11:12:12 +00:00
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
39b3c51709 migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret
Replaced data "vault_kv_secret_v2" with:
1. ExternalSecret (ESO syncs Vault KV → K8s Secret)
2. data "kubernetes_secret" (reads ESO-created secret at plan time)

This removes the Vault provider dependency at plan time for these
stacks — they now only need K8s API access, not a Vault token.

Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection,
coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama,
owntracks, real-estate-crawler, servarr, ytdlp
2026-03-15 22:06:39 +00:00
Viktor Barzin
6f562b5da6 add vaultwarden daily backup CronJob to NFS
SQLite backup via Online Backup API + copy of RSA keys,
attachments, sends, and config. 30-day retention with rotation.
Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.
2026-03-15 00:03:59 +00:00
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
57eed07370 [ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma
Add API credentials to SOPS and wire homepage_credentials through
stacks. Re-add Uptime Kuma widget with new "infra" status page slug.
2026-03-07 20:39:55 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
197cef7f3f [ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache
- tiers.tf: Terragrunt-generated tier locals for all standalone stacks
- .planning/: resource audit research and plans
- docs/plans/: cluster hardening design doc
- redis-25.3.2.tgz: Bitnami Redis Helm chart cache
2026-03-06 23:55:57 +00:00
Viktor Barzin
79a2aa3784 [ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC
Batch migration of all single-volume and simple multi-volume stacks.
All services verified healthy after migration. Uses nfs-truenas
StorageClass with soft,timeo=30,retrans=3 mount options to eliminate
stale NFS mount hangs.

Services: atuin, audiobookshelf, calibre, changedetection, diun,
excalidraw, forgejo, freshrss, grampsweb, hackmd, health,
isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama,
onlyoffice, owntracks, paperless-ngx, poison-fountain, send,
stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp
2026-03-02 00:15:39 +00:00
Viktor Barzin
9e4fb23b10 [ci skip] right-size all pod resources based on VPA + live metrics audit
Full cluster resource audit: cross-referenced Goldilocks VPA recommendations,
live kubectl top metrics, and Terraform definitions for 100+ containers.

Critical fixes:
- dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit
- stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit
- traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi

Added explicit resources to ~40 containers that had none:
- audiobookshelf, changedetection, cyberchef, dawarich, diun, echo,
  excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n,
  navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor,
  tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver,
  cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard,
  k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server,
  immich-postgresql, osrm-foot

GPU containers: added CPU/mem alongside GPU limits:
- ollama: removed CPU/mem limits (models vary in size), keep GPU only
- frigate: req 500m/2Gi, lim 4/8Gi + GPU
- immich-ml: req 100m/1Gi, lim 2/4Gi + GPU

Right-sized ~25 over-provisioned containers:
- kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi)
- onlyoffice: CPU 8 → 2 (VPA upper 45m)
- realestate-crawler-api: CPU 2000m → 250m
- blog/travel-blog/webhook-handler: 500m → 100m
- coturn/health/plotting-book: reduced to match actual usage

Conservative methodology: limits = max(VPA upper * 2, live usage * 2)
2026-03-01 19:18:50 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
c7c7047f1c [ci skip] Flatten module wrappers into stack roots
Remove the module "xxx" { source = "./module" } indirection layer
from all 66 service stacks. Resources are now defined directly in
each stack's main.tf instead of through a wrapper module.

- Merge module/main.tf contents into stack main.tf
- Apply variable replacements (var.tier -> local.tiers.X, renamed vars)
- Fix shared module paths (one fewer ../ at each level)
- Move extra files/dirs (factory/, chart_values, subdirs) to stack root
- Update state files to strip module.<name>. prefix
- Update CLAUDE.md to reflect flat structure

Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.
2026-02-22 15:13:55 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00