Commit graph

17 commits

Author SHA1 Message Date
Viktor Barzin
6a2bee93b5 fix(monitoring): use patched idrac exporter with PSU input voltage metric
The upstream ghcr.io/mrlhansen/idrac_exporter:2.4.1 is missing
NewPowerSupplyInputVoltage in RefreshPowerOld, so the R730 iDRAC
never emits idrac_power_supply_input_voltage. Switch to the patched
viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix image.
2026-03-23 22:07:36 +02:00
Viktor Barzin
0a294a30a6 add backup IO logging, Pushgateway metrics, and Grafana dashboard
- Add /proc/self/io read/write tracking to vault raft-backup and etcd backup
- Push backup_duration_seconds, backup_read_bytes, backup_written_bytes,
  backup_last_success_timestamp to Pushgateway from all 6 backup CronJobs
  (etcd skipped — distroless image has no wget/curl)
- Add cloudsync_duration_seconds metric to cloudsync-monitor
- New "Backup Health" Grafana dashboard with 8 panels: time since last backup,
  overview table, duration/IO trends, cloud sync status, alerts, CronJob schedule
2026-03-23 12:19:01 +02:00
Viktor Barzin
5652972c53 fix dashboard: add refIds, explicit panel IDs, fix CrowdSec bouncer metric
- Added refId to all targets (required by Grafana)
- Added explicit panel IDs for stable references
- Fixed CrowdSec bouncer metric: cs_lapi_bouncer_requests_total doesn't
  exist, use cs_lapi_route_requests_total instead
- Added drawStyle/showPoints to all timeseries panels
- Updated via MySQL + ConfigMap + Grafana restart
2026-03-23 10:31:44 +02:00
Viktor Barzin
9527f62c2e fix network traffic dashboard: use only available GoFlow2 metrics
GoFlow2 v2 only exposes aggregate metrics (traffic_bytes_total,
process_nf_total, delay_seconds) — no per-source/dest labels.
Removed panels referencing non-existent src_addr/dst_port labels.
Replaced with flowset records by type, separated bytes and flows
into own panels to avoid scale issues.
2026-03-23 10:16:46 +02:00
Viktor Barzin
55246c8b5d add network traffic monitoring and adversary detection
- CrowdSec: add syslog listener for pfSense firewall logs (NodePort 30514),
  add postfix/dovecot log acquisition, install pf/postfix/dovecot/sshd collections
- Monitoring: add DNS anomaly CronJob (queries Technitium every 15m, DGA detection,
  pushes metrics to Pushgateway)
- Grafana: add "Network Traffic & Adversary Detection" dashboard
  (GoFlow2 flows, CrowdSec decisions, DNS anomaly metrics)

pfSense changes applied live: syslog forwarding to 10.0.20.202:30514,
Snort suppress rules for http_inspect false positives, IPS connectivity policy enabled
2026-03-23 03:06:56 +02:00
Viktor Barzin
877cd15b45 fix: increase tier-2-gpu quota to 12Gi, add NvidiaExporterDown alert
- Increase tier-2-gpu requests.memory from 8Gi to 12Gi to give immich
  ML pods scheduling headroom (was at 96% utilization)
- Add critical NvidiaExporterDown Prometheus alert that fires when GPU
  metrics are absent for >10 minutes (faster than generic ScrapeTargetDown)
2026-03-23 03:04:33 +02:00
Viktor Barzin
e4cf0dee83 add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout
- New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway
- Increase Prometheus Helm timeout to 900s for slow iSCSI reattach
2026-03-23 02:24:39 +02:00
Viktor Barzin
311ff5dd9e add hourly SQLite integrity check for vaultwarden with Prometheus alerting
- New CronJob runs PRAGMA integrity_check every hour
- Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway
- VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical)
- VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning)
- Prevents running for days on a corrupted DB unnoticed
2026-03-23 00:50:15 +02:00
Viktor Barzin
3b89a7d7e4 add VaultwardenDown alert and tighten backup staleness threshold
- Add dedicated VaultwardenDown Prometheus alert (critical, 5m)
- Reduce backup staleness threshold from 8d to 24h to match 6h schedule
- Fixes monitoring gap where VW downtime went undetected
2026-03-23 00:47:00 +02:00
Viktor Barzin
bd98b84ded scale grafana and alertmanager to 1 replica to free cluster memory
Grafana: 2 → 1 (saves ~312 Mi)
Alertmanager: 2 → 1 (saves ~150 Mi)
Matrix already scaled to 0 (saves ~212 Mi)
2026-03-22 03:02:17 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
af2222fce8 backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks
Phase 1: Add 12 PrometheusRules for backup health alerting
- PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts
- CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces
- Generic BackupCronJobFailed alert

Phase 2: Fix backup rotation
- etcd: timestamped snapshots instead of overwriting single file
- Redis: timestamped RDB files with 7-day retention purge
- PostgreSQL: retention increased from 7 to 14 days

Phase 3: Fix MySQL password exposure
- Move root password from command line arg to MYSQL_PWD env var via secretKeyRef

Phase 5: Add restore runbooks
- PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild
2026-03-19 20:34:33 +00:00
Viktor Barzin
e54bc016ba reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator
- ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster)
- ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit)
- NodeMemoryPressureTrending: 85% → 92%
- HighService4xxRate: exclude claude-memory (401s from unauth requests are expected)
- mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)
2026-03-19 20:25:36 +00:00
Viktor Barzin
b05421dbb5 add comment explaining prometheus 4Gi minimum memory requirement [ci skip] 2026-03-18 21:45:26 +00:00
Viktor Barzin
9d87ce605f revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit
The 2Gi WAL tmpfs (medium: Memory) counts against the container's
memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on
startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.
2026-03-18 21:44:14 +00:00
Viktor Barzin
12a51c4ffa right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip]
- nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop
  inflating GPU operator init containers; saves ~2.5Gi on GPU node
- nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi)
- monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi)
- onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi)
- immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default)
- dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica

Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init
container (no explicit resources), wasting ~2.5Gi scheduling overhead on the
GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.
2026-03-17 22:35:54 +00:00
Viktor Barzin
ae36dc253b extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip]
Phase 2 of platform stack split. 5 more modules extracted into
independent stacks. All applied successfully with zero destroys.
Cloudflared now reads k8s_users from Vault directly to compute
user_domains. Woodpecker pipeline runs all 8 extracted stacks
in parallel. Memory bumped to 6Gi for 9 concurrent TF processes.
Platform reduced from 27 to 19 modules.
2026-03-17 21:34:11 +00:00