Commit graph

8 commits

Author SHA1 Message Date
Viktor Barzin
21bb3036af state(dbaas): update encrypted state 2026-03-19 20:23:59 +00:00
Viktor Barzin
90b7d6ebb5 etcd defrag cronjob: add --command-timeout=60s
Default 5s timeout causes defrag to fail on fragmented DBs.
Discovered during manual defrag that took ~7s.
2026-03-15 17:24:24 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
0abae33c71 [ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00