infra

Author	SHA1	Message	Date
Viktor Barzin	d91fbd4a60	[monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds ## Context stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old shell implementation of a power-cycle watchdog that polled the Dell iDRAC on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes are public, so those credentials — and the implicit statement that 'this host has not rotated the default BMC password' — have been exposed. The current implementation is main.py in the same directory. It reads iDRAC credentials from the environment variables `idrac_user` and `idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR constants), which are populated from Vault via ExternalSecret at runtime. main.sh is not referenced by any Terraform, ConfigMap, or deploy script — grep confirms no `file()` / `templatefile()` / `filebase64()` call loads it, and no hand-rolled shell wrapper invokes it. ## This change - git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh main.py is retained unchanged. ## What is NOT in this change - iDRAC password rotation on 192.168.1.4. The BMC should be moved off the vendor default `calvin` regardless; rotation is tracked in the broader remediation plan and in the iDRAC web UI. - A separate finding in stacks/monitoring/modules/monitoring/idrac.tf (the redfish-exporter ConfigMap has `default: username: root, password: calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT addressed here — filed as its own task so the fix (drop the default block vs. source from env) can be considered in isolation. - Git-history scrub of main.sh is pending the broader filter-repo pass. ## Test plan ### Automated $ grep -rn 'server-power-cycle/main\.sh\\|main\.sh' \ --include='.tf' --include='.hcl' --include='.yaml' \ --include='.yml' --include='*.sh' (no consumer references) ### Manual Verification 1. `git show HEAD --stat` shows only the one deletion. 2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh` 3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows the exporter running — unrelated to this file. 4. main.py continues to run its watchdog loop without regression, because it was never coupled to main.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 19:42:55 +00:00
Viktor Barzin	7a884a0b97	[monitoring] Fix alerts for intentionally scaled-down services PoisonFountainDown and ForwardAuthFallbackActive both fired because poison-fountain was scaled to 0 replicas (intentional). Updated both alert expressions to check kube_deployment_spec_replicas > 0 before alerting on missing available replicas — if desired replicas is 0, the service is intentionally down and should not alert. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 19:17:41 +00:00
Viktor Barzin	cdc851fc63	[alerts] Fix status-page-pusher crash + Prometheus backup push ## status-page-pusher (ExternalAccessDivergence false positive) The pusher was crashing with `AttributeError: 'list' object has no attribute 'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return format. Fixed by making beat flattening more robust: handle any nesting of lists/dicts in the heartbeat data, and add isinstance check before calling `.get()` on the latest beat. ## Prometheus backup (PrometheusBackupNeverRun) The backup sidecar's Pushgateway push was silently failing because `wget --post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway to accept the Prometheus exposition format. Added the header. Also manually pushed the metric to clear the `absent()` alert immediately. Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf, poison, dns, travel) ARE genuinely externally unreachable but internally up. This is a real issue (likely Cloudflare tunnel routing) not a false positive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:29:43 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	375a3d91d5	[monitoring] Exclude websocket protocol from HighServiceLatency alert Traefik records websocket connection lifetimes (minutes to hours) as "request duration." When websockets close, the full lifetime pollutes the average latency metric — Authentik showed 6.7s avg (201s websocket avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across 12 services (Authentik, Vaultwarden, Terminal, HA, etc.). Changes: - Add protocol!="websocket" filter to HighServiceLatency alert expr - Raise minimum traffic threshold from 0.01 to 0.05 rps to filter statistical noise from services with <3 req/min - Remove .githooks/pre-commit file-size hook (blocked state commits) Validated against 7-day historical data: 637 breaches → ~2 with both filters applied (99.7% reduction). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:51:19 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	82f674a0b4	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip] Reflects the schedule change from weekly to daily. All references updated: - scripts/weekly-backup.{sh,timer,service} → daily-backup.* - Pushgateway job name: weekly-backup → daily-backup - Prometheus metric names: weekly_backup_* → daily_backup_* - All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory - offsite-sync dependency: After=daily-backup.service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:37:04 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	1c300a14cf	mailserver: overhaul inbound delivery, monitoring, CrowdSec, and migrate to Brevo relay Inbound: - Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned) - Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection - Removed Cloudflare Email Routing (can't store-and-forward) - Fixed dual SPF violation, hardened to -all - Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform - Removed dead BIND zones from config.tfvars (199 lines) Outbound: - Migrated from Mailgun (100/day) to Brevo (300/day free) - Added Brevo DKIM CNAMEs and verification TXT Monitoring: - Probe frequency: 30m → 20m, alert thresholds adjusted to 60m - Enabled Dovecot exporter scraping (port 9166) - Added external SMTP monitor on public IP Documentation: - New docs/architecture/mailserver.md with full architecture - New docs/architecture/mailserver-visual.html visualization - Updated monitoring.md, CLAUDE.md, historical plan docs	2026-04-12 22:24:38 +01:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	5da6d75094	fix(monitoring): PodCrashLooping alert now fires only for active CrashLoopBackOff Switch from restart-count based detection (increase restarts[1h] > 5) to waiting-reason based (kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}). Alert auto-resolves when pod recovers, making it clear whether the issue is active.	2026-04-12 12:41:07 +01:00
Viktor Barzin	d7de5de07c	fix(monitoring): add pve_* metrics to Prometheus whitelist ProxmoxMetricsMissing alert was firing because pve_* metrics were excluded from the kubernetes-service-endpoints metric_relabel_configs whitelist. The exporter was scraping successfully but metrics were being dropped before ingestion.	2026-04-10 22:58:49 +01:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	fbdb57eb58	fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck Changed from simple time-based (24h on inverter) to condition-based: only fires when on inverter AND battery charge <80% for 1h. This means normal daytime inverter usage won't trigger alerts — only fires when the grid is unavailable and battery is draining.	2026-04-06 15:43:47 +03:00
Viktor Barzin	b5689afe6d	fix(monitoring): tune alert thresholds to reduce false positives - HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W) - HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO) from latency checks — both have inherently high avg response times	2026-04-06 15:39:23 +03:00
Viktor Barzin	91242b0b40	feat(monitoring): add comprehensive hardware exporter alerts Added 20 new alerts across 3 rule groups: Power (8 new): - UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal - ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal Server Health (10 new): - iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy - iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning - iDRACServerPoweredOff, ProxmoxExporterDown - FuseMainFault, FuseGarageFault Metric Staleness (3 new): - FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing Plus 4 new inhibition rules for alert cascade protection.	2026-04-06 15:31:50 +03:00
Viktor Barzin	6abc0b9742	security(monitoring): remove public SNMP exporter ingress snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the public internet with no authentication. Removed the external ingress and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter via the existing .lan ingress (allow_local_access_only=true) using direct IP 10.0.20.200 with Host header.	2026-04-06 15:23:56 +03:00
Viktor Barzin	7f141faa8c	Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel - Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter - Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel) - Update ha-sofia REST integration to use external HTTPS endpoint - Fix ingress backend service routing to use existing snmp-exporter service - All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)	2026-04-06 15:14:19 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	fe342a974b	monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard - proxmox-csi: add RBAC for PVE host snapshot restore script - monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics - monitoring: add backup health Grafana dashboard	2026-04-06 11:57:41 +03:00
Viktor Barzin	0f2ef356d6	fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned) iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026. Controller is intentionally scaled to 0. Remove the stale alert and update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.	2026-04-05 23:53:18 +03:00
Viktor Barzin	3cd560d4d9	fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip] The Terraform Helm provider's YAML diff comparison silently ignores rules containing {{ $labels.job }} in annotations, preventing the alerts from being applied. Also syncs alerts to platform stack tpl.	2026-04-05 20:07:51 +03:00
Viktor Barzin	3217a5f605	add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip] CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).	2026-04-05 19:32:40 +03:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	06490b0634	reduce Prometheus cardinality round 3: drop 44k more series - cadvisor: drop unused network error/dropped counters, unused cpu metrics (load_avg, system, user), unused memory metrics (cache, failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive) - kubelet: drop all unused histogram buckets (storage_operation, csi, volume_operation, image_pull, http_requests, rest_client, pod_worker, volume_metric, cgroup_manager) + kubernetes_feature_enabled - apiserver: drop flowcontrol/rest_client histograms, longrunning_requests - traefik: drop all router-level metrics (keep service + entrypoint) - service-endpoints: drop coredns histograms, node_filesystem_* Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)	2026-03-29 00:27:23 +02:00
Viktor Barzin	a9ca65bc31	reduce Prometheus cardinality round 2: drop 137k more series - fix traefik double-scrape: kubernetes-pods job was scraping traefik pods again (43k duplicate series). Added namespace drop rule. - drop unused cadvisor metrics: container_fs_, container_blkio_, container_pressure_, container_spec_, and misc (30k series) - drop more apiserver histogram buckets: watch_list, watch_cache, response_sizes, watch_events, admission_controller, workqueue (11k) - drop unused kube-state-metrics: replicaset_, pod_tolerations, pod_labels, endpoint_, service_, configmap_, etc (53k series) Post-relabel samples: 332k → 142k (-57%) Ingestion rate: 5,480 → 3,239 samples/sec (-41%)	2026-03-28 23:51:24 +02:00
Viktor Barzin	4b3851829b	feat: organize Grafana dashboards into folders Enable sidecar folderAnnotation + foldersFromFilesStructure to group 26 dashboards into 5 managed folders: - Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics - Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic - Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU - Operations (4): backup health, registry, audit logs, Loki - Applications (2): realestate-crawler, qBittorrent Dashboard-to-folder mapping defined in grafana.tf locals block. External stacks (headscale, technitium) annotated individually.	2026-03-28 16:23:49 +02:00
Viktor Barzin	725fefe565	fix: add Headscale monitoring, alerts, and pin UI image - Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes, HighHTTPLatency, HighErrorRate - Add Grafana dashboard with node count, map responses, HTTP latency, nodestore operations, and memory panels - Pin headscale-ui to digest sha256:015f5ba0... (was :latest) - Set disable_check_updates: true to skip GitHub check on startup - Uptime Kuma monitor already existed (id=19, 300s interval)	2026-03-28 16:07:04 +02:00
Viktor Barzin	8a5a53a832	fix alerts and reduce Prometheus disk write rate - linkwarden: add Reloader match annotation to DB secret so pods auto-restart on Vault credential rotation (was causing 100% 5xx) - authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi) to prevent OOM kills - prometheus: drop 113k high-cardinality series to reduce HDD write rate from ~8.8 to ~6.0 MB/s (31% reduction): - drop all traefik/apiserver/etcd histogram bucket metrics - drop goflow2_flow_process_nf_templates_total (9.3k series) - drop container_tasks_state and container_memory_failures_total - rewrite HighServiceLatency alert to use avg latency (_sum/_count) - update cluster_health dashboard to match - raise KubeletRuntimeOperationsLatency threshold from 30s to 60s	2026-03-28 15:42:14 +02:00
Viktor Barzin	04a96955c0	fix: exclude NFS PVs from PVFillingUp alert NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).	2026-03-28 01:14:05 +02:00
Viktor Barzin	ae21502698	fix: exclude disabled London Pi cloud sync task from CloudSyncFailing alert Task 2 (Backup London pi) fails because 192.168.8.102 is unreachable. Disabled task via TrueNAS, excluded task_id=2 from alert rule.	2026-03-27 15:15:48 +02:00
Viktor Barzin	b8a5740138	reduce alert noise: remove 4 memory alerts, raise latency threshold [ci skip] - Remove ClusterMemoryRequestsHigh, ContainerNearOOM, NodeLowFreeMemory, NodeMemoryPressureTrending — all fire regularly due to intentional memory overcommit and are not actionable - Keep ContainerOOMKilled (actionable — container actually died) - Raise HighServiceLatency p99 threshold from 10s to 30s to ignore transient spikes	2026-03-26 01:15:18 +02:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	78dec8f0ad	add e2e email roundtrip monitoring CronJob (every 30 min) sends test email via Mailgun API to smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale, EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.	2026-03-25 22:50:22 +02:00
Viktor Barzin	e455bd06f4	state(monitoring): update encrypted state	2026-03-25 11:04:29 +02:00
Viktor Barzin	d20c5e5535	add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard - All 7 backup CronJobs now push backup_output_bytes (file size after backup) - Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes - Grafana dashboard: new Output (MiB) table column, Output Size Trend panel, Write Throughput panel, Cloud Sync Transfer Volume bargauge - All timeseries panels use points-only draw style (discrete backup snapshots) - etcd backup restructured: init_container for etcdctl (distroless image), busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS - Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG) - Fixed grep -oP not available in alpine/busybox (cloud sync monitor)	2026-03-25 10:44:53 +02:00
Viktor Barzin	42eb85c578	fix: rybbit init port, mysql memory limit, metallb alert selector - rybbit-client: fix Kyverno wait-for port 3001 → 80 (service port, not targetPort) - dbaas: increase MySQL memory limit 4Gi → 5Gi (mysql-cluster-1 at 95.9%) - dbaas: bump ResourceQuota limits.memory 24Gi → 27Gi to accommodate - monitoring: fix MetalLBControllerDown alert selector for v0.15 (controller → metallb-controller)	2026-03-24 18:55:07 +02:00
Viktor Barzin	d9eaf42f36	exclude iDRAC from HighServiceLatency alert iDRAC Redfish exporter is inherently slow, causing noisy alerts.	2026-03-23 22:51:42 +02:00
Viktor Barzin	304f0de43a	add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics Replace fragile NoiDRACData alert with proper absent() checks. Add UPSMetricsMissing (critical), iDRACRedfishMetricsMissing, iDRACSNMPMetricsMissing, ATSMetricsMissing, and HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown inhibit rules to suppress staleness alerts during outages.	2026-03-23 22:24:17 +02:00
Viktor Barzin	6a2bee93b5	fix(monitoring): use patched idrac exporter with PSU input voltage metric The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC never emits idrac_power_supply_input_voltage. Switch to the patched viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.	2026-03-23 22:07:36 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00

1 2

65 commits