infra

Author	SHA1	Message	Date
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	c8be07c403	resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s - MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss) - Descheduler: increase frequency from hourly to every 5 min for faster rebalancing - Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]	2026-04-06 00:25:49 +03:00
Viktor Barzin	ae0585048a	fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit. Also added custom LimitRange in dbaas stack (for when Terraform manages it directly).	2026-04-05 23:31:23 +03:00
Viktor Barzin	4da8f0242f	fix: right-size service memory after PVE RAM upgrade (142→272GB) - MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit) - Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled) - Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled) - Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled - Navidrome: 128Mi/128Mi → 256Mi/384Mi - Matrix: add explicit 256Mi/512Mi resources - Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled - Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi - Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi - Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections	2026-04-05 23:02:50 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	d20c5e5535	add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard - All 7 backup CronJobs now push backup_output_bytes (file size after backup) - Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes - Grafana dashboard: new Output (MiB) table column, Output Size Trend panel, Write Throughput panel, Cloud Sync Transfer Volume bargauge - All timeseries panels use points-only draw style (discrete backup snapshots) - etcd backup restructured: init_container for etcdctl (distroless image), busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS - Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG) - Fixed grep -oP not available in alpine/busybox (cloud sync monitor)	2026-03-25 10:44:53 +02:00
Viktor Barzin	42eb85c578	fix: rybbit init port, mysql memory limit, metallb alert selector - rybbit-client: fix Kyverno wait-for port 3001 → 80 (service port, not targetPort) - dbaas: increase MySQL memory limit 4Gi → 5Gi (mysql-cluster-1 at 95.9%) - dbaas: bump ResourceQuota limits.memory 24Gi → 27Gi to accommodate - monitoring: fix MetalLBControllerDown alert selector for v0.15 (controller → metallb-controller)	2026-03-24 18:55:07 +02:00
Viktor Barzin	a95d434ff1	fix backup IO stats: use /proc/$$/io instead of /proc/self/io /proc/self/io inside $(awk ...) resolves to the awk subprocess PID, not the parent bash shell. Use $$ (bash PID) to read the correct process IO counters.	2026-03-23 12:33:52 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00
Viktor Barzin	e463281205	optimize backup schedules: compress dumps, stagger to weekly, extend retention - dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed - infra-maintenance: etcd backup daily→weekly Sunday 1am - redis: backup hourly→weekly Sunday 3am, retention 7→28 days - vault: raft backup daily→weekly Sunday 2am	2026-03-23 02:24:34 +02:00
Viktor Barzin	e823b795f7	fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory - Add docker.io/library/ prefix to mysql and postgres backup images to satisfy Kyverno require-trusted-registries policy (both CronJobs were blocked for 46h, triggering MySQLBackupStale alert) - Document mysql-operator chart ignoring resources values key — the LimitRange default (256Mi) was silently applied, putting the operator at 97% memory. Patched live to 512Mi via kubectl. - Increase vault-raft-backup backoff_limit to 6 for transient failures (also fixed NFS export: vault-backup was a separate ZFS dataset not in the TrueNAS NFS share — destroyed dataset, created directory)	2026-03-19 23:26:05 +00:00
Viktor Barzin	e54bc016ba	reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator - ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster) - ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit) - NodeMemoryPressureTrending: 85% → 92% - HighService4xxRate: exclude claude-memory (401s from unauth requests are expected) - mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)	2026-03-19 20:25:36 +00:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	12a51c4ffa	right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip] - nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop inflating GPU operator init containers; saves ~2.5Gi on GPU node - nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi) - monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi) - onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi) - immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default) - dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init container (no explicit resources), wasting ~2.5Gi scheduling overhead on the GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.	2026-03-17 22:35:54 +00:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00

17 commits