Commit graph

84 commits

Author SHA1 Message Date
Viktor Barzin
26ba9ea371 [ci skip] Fix Prometheus storage alert and Grafana quota exhaustion
- Enable size-based TSDB retention (45GB) to clean up old blocks
  (including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
  CPU/memory limits to allow Grafana rolling updates
2026-02-21 21:04:08 +00:00
Viktor Barzin
f06b3ac0e4 [ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries
Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to
return NXDOMAIN immediately, preventing search domain expansion junk
queries from reaching Technitium. Add trailing dots to Prometheus
scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.
2026-02-16 21:38:38 +00:00
Viktor Barzin
cd5261161b [ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
46ffc37dcf [ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
c8a41ac567 [ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
dbf397841a Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
b36932f9a3 Migrate all service modules from nginx-ingress to Traefik
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
da4cf18d6d Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
10092ec285 reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
Viktor Barzin
a1d945a0b2 add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
185a138cc5 dedup ram alert and increase threshold to 95% [ci skip] 2026-01-17 22:42:22 +00:00
Viktor Barzin
1b5cbeb9c8 monitor idrac more frequently [ci skip] 2026-01-07 18:55:59 +00:00
Viktor Barzin
934fa34c79 update cpu temp alert to above 60 [ci skip] 2026-01-04 12:26:46 +00:00
Viktor Barzin
01d4c9c3e1 update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip] 2026-01-03 23:30:28 +00:00
Viktor Barzin
31c403cadb update cpu temp alert to 55C down from 75C [ci skip] 2026-01-03 16:48:54 +00:00
Viktor Barzin
d37c693a94 increase idrac scrape timeout in attempt to reduce 499 [ci skip] 2025-12-29 20:34:40 +00:00
Viktor Barzin
253e77f22d add registry low cache hit rate alert [ci skip] 2025-12-29 10:43:57 +00:00
Viktor Barzin
cf9d346cae add more alerts in prometheus and gorup them better [ci skip] 2025-12-28 20:07:33 +00:00
Viktor Barzin
0b2e6d09d2 move prometheus wal to tmpfs to reduce wear [ci skip] 2025-12-26 20:10:20 +00:00
Viktor Barzin
6c1ae20448 add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip] 2025-12-26 16:23:49 +00:00
Viktor Barzin
a225bad3cb add alert for docker registry [ci skip] 2025-12-18 10:45:32 +00:00
Viktor Barzin
b4f45c7e73 add separate idrac monitoring tool and dashboard [ci skip] 2025-12-14 09:50:16 +00:00
Viktor Barzin
34df786fe4 add haos monitoring job in prometheus 2025-11-29 11:46:42 +00:00
Viktor Barzin
0b7b092c26 add api key to tiny tuya target in prometheus scrape [ci skip] 2025-11-09 22:03:25 +00:00
Viktor Barzin
5f2cc75a8e add prometheus targets for fuses [ci skip] 2025-10-29 21:59:06 +00:00
Viktor Barzin
76103f52e3 add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip] 2025-10-29 20:09:21 +00:00
Viktor Barzin
3161da29a4 add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip] 2025-10-28 21:38:40 +00:00
Viktor Barzin
d5dd81ba30 increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip] 2025-10-08 20:33:51 +00:00
Viktor Barzin
c5bb343ebe disable errors for matrix ingress [ci skip] 2025-08-23 20:38:53 +00:00
Viktor Barzin
1530323477 update registry prometheus url to devvm as pi was too slow [ci skip] 2025-08-23 20:15:05 +00:00
Viktor Barzin
c5ae6873c9 add registry monitoring to prometheus [ci skip] 2025-03-30 11:15:54 +00:00
Viktor Barzin
6100372a9e adjust batter low alert to fire only when there is no pwoer [ci skip] 2025-03-22 15:47:30 +00:00
Viktor Barzin
cd89d13ab2 disable alert for pods less than in spec [ci skip] 2025-03-16 18:27:13 +00:00
Viktor Barzin
dfe47657ee disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip] 2025-03-15 17:53:24 +00:00
Viktor Barzin
1e674bff7e add alert for ups low battery remaining [ci skip] 2025-03-02 20:48:07 +00:00
Viktor Barzin
ab9b5b356a increase low voltage alert to 10 min [ci skip] 2025-03-01 14:28:56 +00:00
Viktor Barzin
6a6a6974e2 increase interval for 500 alerts to 20m [ci skip] 2025-01-10 20:47:25 +00:00
Viktor Barzin
f291d8545b move prometheus alerts to different channel and move high cpu period [ci skip] 2025-01-04 14:27:48 +00:00
Viktor Barzin
4643e22cc8 increase idle power threshold to 130w [ci skip] 2025-01-03 17:49:24 +00:00
Viktor Barzin
46736680a6 add alert status to message [ci skip] 2025-01-02 21:13:09 +00:00
Viktor Barzin
53d8b2d2c6 update prometheus alerts to be correctly grouped and sent to slack and deprecate some old ones [ci skip] 2025-01-02 20:33:55 +00:00
Viktor Barzin
48a0deb283 update prometheus chart values to get slack notiifcations to work and add alerts for 4xx and 5xx on ingress [ci skip] 2025-01-01 11:39:16 +00:00
Viktor Barzin
7336e7c033 fix monitoring stack [ci skip] 2024-12-31 17:15:06 +00:00
Viktor Barzin
5ee5e59e61 add low voltage alert to prometheus and update some dashboards [ci skip] 2024-12-23 18:21:01 +00:00
Viktor Barzin
c987301c48 add ups snmp exporter to prometheus [ci skip] 2024-12-15 18:13:33 +00:00
Viktor Barzin
72d780c26f replace oauth proxy with authentik auth [ci skip] 2024-11-18 22:06:31 +00:00
Viktor Barzin
cf39034bdf add homepage module and some more integrations [ci skip] 2024-10-20 13:05:03 +00:00
Viktor Barzin
ead57fe29b add meshcentral and diun[ci skip] 2024-08-18 18:14:22 +00:00
Viktor Barzin
c05d088598 reduce prometheus storage retention from 12w -> 8w to save ~30gb [ci skip] 2024-08-07 20:18:13 +00:00
Viktor Barzin
84b707fcf8 update old prometheus alert detectors and upgrade immich to 101 [ci skip] 2024-04-12 21:15:31 +00:00