infra

Author	SHA1	Message	Date
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	f9e85964ce	traefik: add middleware and platform traefik config updates	2026-04-06 11:57:52 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	c8be07c403	resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s - MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss) - Descheduler: increase frequency from hourly to every 5 min for faster rebalancing - Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]	2026-04-06 00:25:49 +03:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00
Viktor Barzin	3cd560d4d9	fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip] The Terraform Helm provider's YAML diff comparison silently ignores rules containing {{ $labels.job }} in annotations, preventing the alerts from being applied. Also syncs alerts to platform stack tpl.	2026-04-05 20:07:51 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00
Viktor Barzin	aceea7db94	increase global rate limit from 10/50 to 50/200 HA frontend loads 30-50 JS bundles on page load, exhausting the burst. iOS Companion app reconnections also trigger bursts. 172 rate-limited (429) requests found in Traefik logs causing intermittent connectivity failures for ha-sofia iOS app.	2026-03-28 23:40:10 +02:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	42eb85c578	fix: rybbit init port, mysql memory limit, metallb alert selector - rybbit-client: fix Kyverno wait-for port 3001 → 80 (service port, not targetPort) - dbaas: increase MySQL memory limit 4Gi → 5Gi (mysql-cluster-1 at 95.9%) - dbaas: bump ResourceQuota limits.memory 24Gi → 27Gi to accommodate - monitoring: fix MetalLBControllerDown alert selector for v0.15 (controller → metallb-controller)	2026-03-24 18:55:07 +02:00
Viktor Barzin	c49e4561a3	consolidate MetalLB IPs: 5 → 1 (10.0.20.200) Migrate all 11 LoadBalancer services to share 10.0.20.200: - Update annotations: metallb.universe.tf → metallb.io - Pin all services to 10.0.20.200 with allow-shared-ip: shared - Standardize externalTrafficPolicy to Cluster (required for IP sharing) - Remove redundant port 80 (roundcube) from mailserver LB - Update CoreDNS forward: 10.0.20.204 → 10.0.20.200 - Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200 Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks, torrserver, wireguard, mailserver, traefik, xray, technitium	2026-03-24 18:35:43 +02:00
Viktor Barzin	33037eba46	upgrade MetalLB v0.10.2 → v0.15.3 and update annotations - Replace custom ViktorBarzin/metallb module with official Helm chart - Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement) - Update Traefik LB annotations from metallb.universe.tf to metallb.io format - Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment - Headscale split DNS already configured to use 10.0.20.204	2026-03-24 17:24:05 +02:00
Viktor Barzin	e9311915cb	add agent route to k8s-portal	2026-03-23 02:25:08 +02:00
Viktor Barzin	644562454c	add IPv6 connectivity via Hurricane Electric 6in4 tunnel - Add public_ipv6 variable and AAAA records for all 34 non-proxied services - Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel) - Update SPF record with current IPv4/IPv6 addresses - Add AAAA update support to Technitium DNS updater CLI - Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT - pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6), socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules	2026-03-23 02:22:00 +02:00
Viktor Barzin	311ff5dd9e	add hourly SQLite integrity check for vaultwarden with Prometheus alerting - New CronJob runs PRAGMA integrity_check every hour - Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway - VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical) - VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning) - Prevents running for days on a corrupted DB unnoticed	2026-03-23 00:50:15 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	b3c9c45a17	multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip] - Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance) - ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults - CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners	2026-03-19 23:49:15 +00:00
Viktor Barzin	250a058627	feat(traefik): add custom error pages with tarampampam/error-pages Deploy error-pages service to show themed error pages instead of raw Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1) for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.	2026-03-19 23:14:27 +00:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	01eb9dd121	fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from the legacy RefreshPowerOld code path during a Huawei OEM support refactor. Built a patched image that restores the single missing line: mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id) Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 13:37:14 +00:00
Viktor Barzin	c8b42f78df	fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container	2026-03-17 07:39:29 +00:00
Viktor Barzin	c31ba2c50c	k8s-portal: use Recreate strategy, limit revision history to 3 Prevents stale pods serving old content during rapid successive deploys. With 1 replica + RollingUpdate, old and new pods briefly coexist.	2026-03-16 22:55:15 +00:00
Viktor Barzin	fb66676d7b	post-mortem: kured + containerd cascade outage — alerts + report 26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path	2026-03-16 22:06:10 +00:00
Viktor Barzin	327c021a90	fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter - Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries - Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh, ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient - Add fallback field to all Slack receivers for clean push notifications - Multiply ratio exprs by 100 for readable percentages - Rename "New Tailscale client" to CamelCase "NewTailscaleClient" - Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive	2026-03-16 19:35:24 +00:00
Viktor Barzin	a04335d0f3	right-size 14 services and scale down GPU-heavy workloads [ci skip] Memory right-sizing based on VPA upperBound analysis: - Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi, dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi, navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi, ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi, mysql-operator 384→512Mi - Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi, dcgm-exporter 2560→1536Mi - Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure) Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved, all ResourceQuota pressures cleared, GPU health green.	2026-03-15 23:00:49 +00:00
Viktor Barzin	50620e6047	add generic multi-user cluster onboarding system Data-driven user onboarding: add a JSON entry to Vault KV k8s_users, apply vault + platform + woodpecker stacks, and everything is auto-generated. Vault stack: namespace creation, per-user Vault policies with secret isolation via identity entities/aliases, K8s deployer roles, CI policy update. Platform stack: domains field in k8s_users type, TLS secrets per user namespace, user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal. Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true. K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt, contributing page with CI pipeline template, versioned image tags in CI pipeline. New: stacks/_template/ with copyable stack template for namespace-owners.	2026-03-15 22:23:36 +00:00
Viktor Barzin	0f262ceda3	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-15 19:17:57 +00:00
Viktor Barzin	f0312df2be	fix gpu-workload Kyverno policy: use replace with explicit priority value The API server doesn't re-resolve priority from PriorityClassName after webhook mutation. Changed from remove+add to replace with explicit priority=1200000 and preemptionPolicy=PreemptLowerPriority.	2026-03-15 19:11:44 +00:00
Viktor Barzin	1acf8cc4e8	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-15 19:05:04 +00:00
Viktor Barzin	90b7d6ebb5	etcd defrag cronjob: add --command-timeout=60s Default 5s timeout causes defrag to fail on fragmented DBs. Discovered during manual defrag that took ~7s.	2026-03-15 17:24:24 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	5beb481dc4	fix immich TF drift from Kyverno ndots injection, right-size nvidia GPU operator - immich: add lifecycle ignore_changes for dns_config on all 3 deployments to prevent perpetual plan drift from Kyverno ndots:2 mutation policy - nvidia dcgm-exporter: 768Mi → 2560Mi (VPA upper 2091Mi, was under-provisioned) - nvidia cuda-validator: 1024Mi → 256Mi (one-shot job, vastly over-provisioned)	2026-03-15 15:36:19 +00:00
Viktor Barzin	a6d281dbc6	vaultwarden: upgrade to 1.35.4, use Recreate strategy - Upgrade from 1.35.2 to 1.35.4 (fixes API key login userDecryptionOptions bug) - Switch deployment strategy from RollingUpdate to Recreate (iSCSI PVC can't multi-attach)	2026-03-15 15:35:09 +00:00
Viktor Barzin	194281e527	right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.	2026-03-15 15:30:18 +00:00
Viktor Barzin	b219961bd8	fix: MySQL memory overcommit + shlink OOMKill - dbaas: MySQL requests 4Gi -> 2Gi (limits stay 4Gi) to free 6Gi of request capacity. Actual usage is 1-1.5Gi per instance. - url/shlink: increase memory limit 512Mi -> 768Mi (OOMKilled)	2026-03-15 03:22:07 +00:00
Viktor Barzin	52bce849d7	fix cluster health: resolve 21/23 failures from healthcheck - nvidia: change GPU taint NoSchedule -> PreferNoSchedule to allow overflow scheduling on k8s-node1 (frees ~7Gi capacity) - kyverno: increase reports-controller memory 256Mi -> 512Mi (OOMKilled) - speedtest: add missing DB_PORT=3306 env var (nc: service "" unknown) - realestate-crawler: increase API memory 64Mi -> 256Mi (OOMKilled) - calibre: increase liveness probe timeout 1s -> 5s (false restarts)	2026-03-15 02:33:46 +00:00
Viktor Barzin	fca4e02c54	prometheus: increase memory to 4Gi and probe delays for TSDB compaction Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts in 8h), causing sustained IO pressure on the PVE host spinning disk. Increase liveness probe delay to 300s so WAL replay completes before the probe kills the pod.	2026-03-15 01:44:52 +00:00
Viktor Barzin	43b49f7f6c	cluster recovery: fix resource limits and node1 memory - nvidia quota: requests.memory 8Gi → 12Gi (unblock cuda-validator) - calibre: startup probe initial_delay 60→120s, timeout 1→5s, wait_for_rollout=false (DOCKER_MODS install takes 10+ min) - immich ML: memory 2Gi → 4Gi (OOMKilled loading CLIP models) Also done outside TF (not in this commit): - node1 VM: 16 GiB → 24 GiB RAM (Proxmox) - tigera-operator: kubectl patch 128→256Mi - nvidia-driver-daemonset: kubectl patch 1→4Gi memory - kyverno reports-controller: kubectl patch 128→256Mi - CNPG operator: kubectl rollout restart	2026-03-15 01:44:28 +00:00
Viktor Barzin	29032e0b6b	fix vaultwarden backup image: use docker.io/library/alpine for Kyverno	2026-03-15 00:16:31 +00:00
Viktor Barzin	6f562b5da6	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-15 00:03:59 +00:00
Viktor Barzin	92cc3f01c1	migrate vaultwarden storage from NFS to iSCSI SQLite on NFS causes DB corruption due to unreliable POSIX fcntl locking. iSCSI provides a block device with a local filesystem where locking works correctly. Same approach used for Redis, MySQL, PostgreSQL, etc.	2026-03-14 23:42:17 +00:00
Viktor Barzin	23019da8e5	equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas.	2026-03-14 21:46:49 +00:00
Viktor Barzin	2be858f616	fix: eliminate memory overcommit to prevent node OOM crashes Set requests = limits (Guaranteed QoS) across LimitRange defaults and explicit pod resources. Node2 crashed 2026-03-14 from 250% memory overcommit (61GB limits on 24GB node). Changes: - LimitRange: default = defaultRequest for all 6 tiers - Grafana: 3 → 2 replicas - Grampsweb: document why replicas=0 - Prometheus: 1Gi/4Gi → 3Gi/3Gi - OpenClaw: 512Mi/2Gi → 768Mi/768Mi - Immich server: 256Mi/2Gi → 512Mi/512Mi - Immich postgresql: 256Mi/1Gi → 512Mi/512Mi - Calibre: 256Mi/1536Mi → 256Mi/256Mi - Linkwarden: 256Mi/1536Mi → 768Mi/768Mi - N8N: 256Mi/1Gi → 512Mi/512Mi - MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi - pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi - DBaaS ResourceQuota limits.memory: 64Gi → 12Gi [ci skip]	2026-03-14 16:01:41 +00:00
Viktor Barzin	44aa6d61c2	Reduce downtime during platform stack applies CrowdSec fixes: - Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades) - Add LAPI startupProbe: 30 attempts × 10s = 5min startup window (LAPI pods were failing default startup probe during rolling upgrades) - Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true Prometheus startup guard on 8 rate-based alerts: - PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Suppresses false positives for 15m after Prometheus restart	2026-03-14 12:47:56 +00:00
Viktor Barzin	4ea3ffe9d3	Reduce downtime during platform stack applies CrowdSec Helm fix: - Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302% of quota, preventing scheduling during rolling upgrades - Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive - Add wait=true and wait_for_jobs=true for proper readiness checking Prometheus startup guard: - Add startup guard to 8 rate/increase-based alerts that false-fire after Prometheus restarts (needs 2 scrapes for rate() to work): PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Guard: and on() (time() - process_start_time_seconds) > 900 suppresses alerts for 15m after Prometheus startup	2026-03-14 12:09:09 +00:00
Viktor Barzin	17065304dc	Fix NFSServerUnresponsive false positives Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps	2026-03-14 11:28:17 +00:00
Viktor Barzin	6377a8b85b	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards Noise reduction (8 alerts tuned): - PoisonFountainDown: 2m→5m, critical→warning (fail-open service) - NodeExporterDown: 2m→5m (flaps during node restarts) - PowerOutage: add for:1m (debounce transient voltage dips) - New Tailscale client: add for:5m (debounce headscale reauths) - NoNodeLoadData: use absent() instead of OR vector(0)==0 - NodeHighCPUUsage: 30%→60% (normal for 70+ services) - HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading) - PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB) Alert regrouping: - Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health" - Move New Tailscale client → "Infrastructure Health" New alerts (14): - Networking: Cloudflared (2), MetalLB (2), Technitium DNS - Storage: NFS CSI, iSCSI CSI controllers - Critical Services: PgBouncer, CNPG operator, MySQL operator - Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker Inhibit rules: - Consolidate 3 NodeDown rules into 1 comprehensive rule - Extend NFS rule to suppress NFS-dependent services - Add PowerOutage → downstream suppression Dashboard loading: - Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards - Remove duplicate caretta dashboard ConfigMap from caretta.tf	2026-03-14 10:25:31 +00:00
Viktor Barzin	2102cb2d73	Right-size CPU requests cluster-wide and remove missed CPU limits Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m, clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m). Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each, prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m, shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m, technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU. Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.	2026-03-14 09:22:24 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00

1 2 3 4

151 commits