infra

Author	SHA1	Message	Date
Viktor Barzin	e455bd06f4	state(monitoring): update encrypted state	2026-03-25 11:04:29 +02:00
Viktor Barzin	d20c5e5535	add backup_output_bytes metric and cloudsync_transferred_bytes to backup dashboard - All 7 backup CronJobs now push backup_output_bytes (file size after backup) - Cloud Sync monitor parses rclone transfer stats into cloudsync_transferred_bytes - Grafana dashboard: new Output (MiB) table column, Output Size Trend panel, Write Throughput panel, Cloud Sync Transfer Volume bargauge - All timeseries panels use points-only draw style (discrete backup snapshots) - etcd backup restructured: init_container for etcdctl (distroless image), busybox sidecar for metrics push + purge, ClusterFirstWithHostNet DNS - Fixed pre-existing curl missing in postgres:16.4-bullseye (immich, dbaas PG) - Fixed grep -oP not available in alpine/busybox (cloud sync monitor)	2026-03-25 10:44:53 +02:00
Viktor Barzin	42eb85c578	fix: rybbit init port, mysql memory limit, metallb alert selector - rybbit-client: fix Kyverno wait-for port 3001 → 80 (service port, not targetPort) - dbaas: increase MySQL memory limit 4Gi → 5Gi (mysql-cluster-1 at 95.9%) - dbaas: bump ResourceQuota limits.memory 24Gi → 27Gi to accommodate - monitoring: fix MetalLBControllerDown alert selector for v0.15 (controller → metallb-controller)	2026-03-24 18:55:07 +02:00
Viktor Barzin	d9eaf42f36	exclude iDRAC from HighServiceLatency alert iDRAC Redfish exporter is inherently slow, causing noisy alerts.	2026-03-23 22:51:42 +02:00
Viktor Barzin	304f0de43a	add Metric Staleness alerts for UPS, iDRAC, ATS, and HA metrics Replace fragile NoiDRACData alert with proper absent() checks. Add UPSMetricsMissing (critical), iDRACRedfishMetricsMissing, iDRACSNMPMetricsMissing, ATSMetricsMissing, and HomeAssistantMetricsMissing alerts. Update PowerOutage and NodeDown inhibit rules to suppress staleness alerts during outages.	2026-03-23 22:24:17 +02:00
Viktor Barzin	6a2bee93b5	fix(monitoring): use patched idrac exporter with PSU input voltage metric The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC never emits idrac_power_supply_input_voltage. Switch to the patched viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.	2026-03-23 22:07:36 +02:00
Viktor Barzin	0a294a30a6	add backup IO logging, Pushgateway metrics, and Grafana dashboard - Add /proc/self/io read/write tracking to vault raft-backup and etcd backup - Push backup_duration_seconds, backup_read_bytes, backup_written_bytes, backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs (etcd skipped — distroless image has no wget/curl) - Add cloudsync_duration_seconds metric to cloudsync-monitor - New "Backup Health" Grafana dashboard with 8 panels: time since last backup, overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule	2026-03-23 12:19:01 +02:00
Viktor Barzin	5652972c53	fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric - Added refId to all targets (required by Grafana) - Added explicit panel IDs for stable references - Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't exist, use cs_lapi_route_requests_total instead - Added drawStyle/showPoints to all timeseries panels - Updated via MySQL + ConfigMap + Grafana restart	2026-03-23 10:31:44 +02:00
Viktor Barzin	9527f62c2e	fix network traffic dashboard: use only available GoFlow2 metrics GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total, process_nf_total, delay_seconds) — no per-source/dest labels. Removed panels referencing non-existent src_addr/dst_port labels. Replaced with flowset records by type, separated bytes and flows into own panels to avoid scale issues.	2026-03-23 10:16:46 +02:00
Viktor Barzin	55246c8b5d	add network traffic monitoring and adversary detection - CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514), add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections - Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection, pushes metrics to Pushgateway) - Grafana: add "Network Traffic & Adversary Detection" dashboard (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics) pfSense changes applied live: syslog forwarding to 10.0.20.202:30514, Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled	2026-03-23 03:06:56 +02:00
Viktor Barzin	877cd15b45	fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert - Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich ML pods scheduling headroom (was at 96% utilization) - Add critical NvidiaExporterDown Prometheus alert that fires when GPU metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)	2026-03-23 03:04:33 +02:00
Viktor Barzin	e4cf0dee83	add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout - New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway - Increase Prometheus Helm timeout to 900s for slow iSCSI reattach	2026-03-23 02:24:39 +02:00
Viktor Barzin	311ff5dd9e	add hourly SQLite integrity check for vaultwarden with Prometheus alerting - New CronJob runs PRAGMA integrity_check every hour - Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway - VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical) - VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning) - Prevents running for days on a corrupted DB unnoticed	2026-03-23 00:50:15 +02:00
Viktor Barzin	3b89a7d7e4	add VaultwardenDown alert and tighten backup staleness threshold - Add dedicated VaultwardenDown Prometheus alert (critical, 5m) - Reduce backup staleness threshold from 8d to 24h to match 6h schedule - Fixes monitoring gap where VW downtime went undetected	2026-03-23 00:47:00 +02:00
Viktor Barzin	bd98b84ded	scale grafana and alertmanager to 1 replica to free cluster memory Grafana: 2 → 1 (saves ~312 Mi) Alertmanager: 2 → 1 (saves ~150 Mi) Matrix already scaled to 0 (saves ~212 Mi)	2026-03-22 03:02:17 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	af2222fce8	backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild	2026-03-19 20:34:33 +00:00
Viktor Barzin	e54bc016ba	reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator - ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster) - ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit) - NodeMemoryPressureTrending: 85% → 92% - HighService4xxRate: exclude claude-memory (401s from unauth requests are expected) - mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)	2026-03-19 20:25:36 +00:00
Viktor Barzin	b05421dbb5	add comment explaining prometheus 4Gi minimum memory requirement [ci skip]	2026-03-18 21:45:26 +00:00
Viktor Barzin	9d87ce605f	revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit The 2Gi WAL tmpfs (medium: Memory) counts against the container's memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.	2026-03-18 21:44:14 +00:00
Viktor Barzin	12a51c4ffa	right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip] - nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop inflating GPU operator init containers; saves ~2.5Gi on GPU node - nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi) - monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi) - onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi) - immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default) - dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init container (no explicit resources), wasting ~2.5Gi scheduling overhead on the GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.	2026-03-17 22:35:54 +00:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00

22 commits