Commit graph

12 commits

Author SHA1 Message Date
Viktor Barzin
6101fb99f9 Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip]
- Prometheus: persist metric whitelist (keep rules) to Helm template, preventing
  regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w.
- MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0,
  doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners.
- etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency.
- VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module.
- Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress).
- Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3.
- Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 19:01:21 +00:00
Viktor Barzin
c2f9ca0d13 modules: improve create-vm with additional config options and cloud-init updates 2026-04-06 11:57:55 +03:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
67d1ce453c add /sentinel dir to cloud-init for kured reboot gating
The kured sentinel gate DaemonSet requires /sentinel to exist on
all nodes. Without it, kured pods get stuck in ContainerCreating
with hostPath mount failure. Previously created manually; now
provisioned automatically for new nodes.
2026-03-19 19:57:27 +00:00
Viktor Barzin
c034adab5f mitigate cluster instability during terraform applies
- Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf)
- Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno)
  to prevent memory request surge overwhelming scheduler
- Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup
- Disable Kyverno policy reports (ephemeral report cleanup)
- Cloud-init: journald persistence + 4Gi swap for worker nodes
- Kubelet: LimitedSwap behavior for memory pressure relief
2026-03-15 17:23:39 +00:00
Viktor Barzin
0638e2cc2e [ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup
- Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas
- Add democratic-csi iSCSI driver module for TrueNAS
- Add open-iscsi to cloud-init VM template
- Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0)
- Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh)
- Fix cluster healthcheck CronJob: always exit 0 to prevent circular
  JobFailed alerts (reporting via Slack, not exit codes)
- Fix Uptime Kuma nested list handling in cluster-health.sh
- Add health probes to: audiobookshelf, immich ML, ntfy, headscale,
  uptime-kuma, vaultwarden, rybbit (clickhouse + server + client),
  shlink, shlink-web
- Add iSCSI storage documentation to CLAUDE.md
2026-03-06 19:54:21 +00:00
Viktor Barzin
946b5b1745 [ci skip] add qemu-guest-agent to VM templates and enable agent by default 2026-03-01 01:58:46 +00:00
Viktor Barzin
3b7d295119 add nginx reverse proxy to serialize registyr requests for the same path to avoid race conditions [ci skip] 2025-12-29 20:16:13 +00:00
Viktor Barzin
45e74bedc6 update vm creation tempaltes [ci skip] 2025-12-14 09:50:15 +00:00
Viktor Barzin
b15246a2cb add docker registry vm and allow multiple provisioning cmds in templates [ci skip] 2025-10-12 18:54:29 +00:00
Viktor Barzin
1968f353a2 add module to create a k8s worker [ci skip] 2025-10-11 20:40:34 +00:00
Viktor Barzin
51a94faff4 add template vm in proxmox [ci skip] 2025-10-11 17:07:47 +00:00