Viktor Barzin
8bea552664
[ci skip] Add HomeAssistantDown alert for ha-sofia
...
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
0c18a86a7b
[ci skip] Fix all active Prometheus alerts
...
- meshcentral: rename port from "https" to "http" — MeshCentral serves
plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
— nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
79°C), lower registry cache threshold 50%→25%, add minimum traffic
floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
04eaf0f989
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
...
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
12a84c5207
Standardize Prometheus alert formatting and fix Slack notifications
...
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
c32acc70e6
Migrate all service modules from nginx-ingress to Traefik
...
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
4a857ebefd
Add per-pod GPU memory metrics exporter
...
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
3bda3ab956
reduce the frequency of polling idrac and remove some duplicates [ci skip]
2026-01-24 18:47:22 +00:00
Viktor Barzin
d751a5924c
add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
...
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
5609bbbaf3
dedup ram alert and increase threshold to 95% [ci skip]
2026-01-17 22:42:22 +00:00
Viktor Barzin
20cd480988
monitor idrac more frequently [ci skip]
2026-01-07 18:55:59 +00:00
Viktor Barzin
402dc1f91a
update cpu temp alert to above 60 [ci skip]
2026-01-04 12:26:46 +00:00
Viktor Barzin
29194c06b9
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]
2026-01-03 23:30:28 +00:00
Viktor Barzin
d151b582f7
update cpu temp alert to 55C down from 75C [ci skip]
2026-01-03 16:48:54 +00:00
Viktor Barzin
feeb6ee86c
increase idrac scrape timeout in attempt to reduce 499 [ci skip]
2025-12-29 20:34:40 +00:00
Viktor Barzin
42403e0b35
add registry low cache hit rate alert [ci skip]
2025-12-29 10:43:57 +00:00
Viktor Barzin
8be0fc9699
add more alerts in prometheus and gorup them better [ci skip]
2025-12-28 20:07:33 +00:00
Viktor Barzin
e12c117bdf
move prometheus wal to tmpfs to reduce wear [ci skip]
2025-12-26 20:10:20 +00:00
Viktor Barzin
a7dc4320b3
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]
2025-12-26 16:23:49 +00:00
Viktor Barzin
bd60f0faa3
add alert for docker registry [ci skip]
2025-12-18 10:45:32 +00:00
Viktor Barzin
bc486227f7
add separate idrac monitoring tool and dashboard [ci skip]
2025-12-14 09:50:16 +00:00
Viktor Barzin
f85d793afd
add haos monitoring job in prometheus
2025-11-29 11:46:42 +00:00
Viktor Barzin
0752e80231
add api key to tiny tuya target in prometheus scrape [ci skip]
2025-11-09 22:03:25 +00:00
Viktor Barzin
16d27ec225
add prometheus targets for fuses [ci skip]
2025-10-29 21:59:06 +00:00
Viktor Barzin
279592b6e3
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip]
2025-10-29 20:09:21 +00:00
Viktor Barzin
6be6b06d90
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip]
2025-10-28 21:38:40 +00:00
Viktor Barzin
093ed81fce
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip]
2025-10-08 20:33:51 +00:00
Viktor Barzin
c3bc184169
disable errors for matrix ingress [ci skip]
2025-08-23 20:38:53 +00:00
Viktor Barzin
adcd0695ba
update registry prometheus url to devvm as pi was too slow [ci skip]
2025-08-23 20:15:05 +00:00
Viktor Barzin
16d6bcc544
add registry monitoring to prometheus [ci skip]
2025-03-30 11:15:54 +00:00
Viktor Barzin
534fcdbfe3
adjust batter low alert to fire only when there is no pwoer [ci skip]
2025-03-22 15:47:30 +00:00
Viktor Barzin
987fc402b5
disable alert for pods less than in spec [ci skip]
2025-03-16 18:27:13 +00:00
Viktor Barzin
72bedfdd6e
disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip]
2025-03-15 17:53:24 +00:00
Viktor Barzin
f7eff3cb74
add alert for ups low battery remaining [ci skip]
2025-03-02 20:48:07 +00:00
Viktor Barzin
095624a337
increase low voltage alert to 10 min [ci skip]
2025-03-01 14:28:56 +00:00
Viktor Barzin
5ef9ba5917
increase interval for 500 alerts to 20m [ci skip]
2025-01-10 20:47:25 +00:00
Viktor Barzin
aeee71751f
move prometheus alerts to different channel and move high cpu period [ci skip]
2025-01-04 14:27:48 +00:00
Viktor Barzin
3473f64670
increase idle power threshold to 130w [ci skip]
2025-01-03 17:49:24 +00:00
Viktor Barzin
4b725b02a6
add alert status to message [ci skip]
2025-01-02 21:13:09 +00:00
Viktor Barzin
c7113fa495
update prometheus alerts to be correctly grouped and sent to slack and deprecate some old ones [ci skip]
2025-01-02 20:33:55 +00:00
Viktor Barzin
9b0d686873
update prometheus chart values to get slack notiifcations to work and add alerts for 4xx and 5xx on ingress [ci skip]
2025-01-01 11:39:16 +00:00
Viktor Barzin
40f4354316
fix monitoring stack [ci skip]
2024-12-31 17:15:06 +00:00
Viktor Barzin
ce90629b54
add low voltage alert to prometheus and update some dashboards [ci skip]
2024-12-23 18:21:01 +00:00
Viktor Barzin
fbe305a891
add ups snmp exporter to prometheus [ci skip]
2024-12-15 18:13:33 +00:00
Viktor Barzin
185a944acd
replace oauth proxy with authentik auth [ci skip]
2024-11-18 22:06:31 +00:00
Viktor Barzin
64f81621c8
add homepage module and some more integrations [ci skip]
2024-10-20 13:05:03 +00:00
Viktor Barzin
b54fbf72fd
add meshcentral and diun[ci skip]
2024-08-18 18:14:22 +00:00
Viktor Barzin
506b4a2f87
reduce prometheus storage retention from 12w -> 8w to save ~30gb [ci skip]
2024-08-07 20:18:13 +00:00
Viktor Barzin
828f3f115a
update old prometheus alert detectors and upgrade immich to 101 [ci skip]
2024-04-12 21:15:31 +00:00
Viktor Barzin
8afbec0d23
remove hack for london openwrt monitoring after having tailscale now [ci skip]
2024-03-30 18:28:11 +00:00
Viktor Barzin
e5061dec27
update openwrt london prometheus target address [ci skip]
2024-03-29 22:20:29 +00:00