Viktor Barzin
c8a41ac567
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
...
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
dbf397841a
Standardize Prometheus alert formatting and fix Slack notifications
...
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
73aab7f4ce
[ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
...
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
b36932f9a3
Migrate all service modules from nginx-ingress to Traefik
...
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
da4cf18d6d
Add per-pod GPU memory metrics exporter
...
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
10092ec285
reduce the frequency of polling idrac and remove some duplicates [ci skip]
2026-01-24 18:47:22 +00:00
Viktor Barzin
a1d945a0b2
add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
...
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
185a138cc5
dedup ram alert and increase threshold to 95% [ci skip]
2026-01-17 22:42:22 +00:00
Viktor Barzin
61e318398c
scale grafana to 3 pods for resilience [ci skip]
2026-01-12 18:27:54 +00:00
Viktor Barzin
f1e9fb9afe
add tier to all deployments [ci skip]
2026-01-10 16:28:14 +00:00
Viktor Barzin
1b5cbeb9c8
monitor idrac more frequently [ci skip]
2026-01-07 18:55:59 +00:00
Viktor Barzin
934fa34c79
update cpu temp alert to above 60 [ci skip]
2026-01-04 12:26:46 +00:00
Viktor Barzin
01d4c9c3e1
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]
2026-01-03 23:30:28 +00:00
Viktor Barzin
31c403cadb
update cpu temp alert to 55C down from 75C [ci skip]
2026-01-03 16:48:54 +00:00
Viktor Barzin
d37c693a94
increase idrac scrape timeout in attempt to reduce 499 [ci skip]
2025-12-29 20:34:40 +00:00
Viktor Barzin
253e77f22d
add registry low cache hit rate alert [ci skip]
2025-12-29 10:43:57 +00:00
Viktor Barzin
f1dde96d80
replace hardcoded namespace with module reference [ci skip]
2025-12-29 10:23:42 +00:00
Viktor Barzin
cf9d346cae
add more alerts in prometheus and gorup them better [ci skip]
2025-12-28 20:07:33 +00:00
Viktor Barzin
a595c4db56
move out all monitoring resources to separate tf files [ci skip]
2025-12-28 20:07:00 +00:00
Viktor Barzin
26d55c6637
move grafana into separate file and tunr off persistence as we use external db now [ci skip]
2025-12-28 20:05:27 +00:00
Viktor Barzin
f06e050eaa
migrate grafana to mysql from sqlite [ci skip]
2025-12-27 20:51:05 +00:00
Viktor Barzin
0b2e6d09d2
move prometheus wal to tmpfs to reduce wear [ci skip]
2025-12-26 20:10:20 +00:00
Viktor Barzin
6c1ae20448
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]
2025-12-26 16:23:49 +00:00
Viktor Barzin
d07c625064
add pve exporter playbook + pve exporter in k8s [ci skip]
2025-12-26 16:23:17 +00:00
Viktor Barzin
59e6591e2a
update most important grafana dashboards [ci skip]
2025-12-23 18:13:25 +00:00
Viktor Barzin
a225bad3cb
add alert for docker registry [ci skip]
2025-12-18 10:45:32 +00:00
Viktor Barzin
397fa0cba7
add local-only ingress for snmp and idrac exporters [ci skip]
2025-12-14 19:08:44 +00:00
Viktor Barzin
b4f45c7e73
add separate idrac monitoring tool and dashboard [ci skip]
2025-12-14 09:50:16 +00:00
Viktor Barzin
34df786fe4
add haos monitoring job in prometheus
2025-11-29 11:46:42 +00:00
Viktor Barzin
1b0d5e60d8
add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip]
2025-11-15 05:15:50 +00:00
Viktor Barzin
0b7b092c26
add api key to tiny tuya target in prometheus scrape [ci skip]
2025-11-09 22:03:25 +00:00
Viktor Barzin
5f2cc75a8e
add prometheus targets for fuses [ci skip]
2025-10-29 21:59:06 +00:00
Viktor Barzin
76103f52e3
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip]
2025-10-29 20:09:21 +00:00
Viktor Barzin
18a2695e64
add breakdown in main power source from inverterer in grafana [ci skip]
2025-10-28 22:41:44 +00:00
Viktor Barzin
77f92ae4ef
update ups grafana dash to have inverter stats [ci skip]
2025-10-28 22:17:32 +00:00
Viktor Barzin
3161da29a4
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip]
2025-10-28 21:38:40 +00:00
Viktor Barzin
d5dd81ba30
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip]
2025-10-08 20:33:51 +00:00
Viktor Barzin
c5bb343ebe
disable errors for matrix ingress [ci skip]
2025-08-23 20:38:53 +00:00
Viktor Barzin
87c33629b4
backup all grafana dashboards [ci skip]
2025-08-23 20:30:37 +00:00
Viktor Barzin
1530323477
update registry prometheus url to devvm as pi was too slow [ci skip]
2025-08-23 20:15:05 +00:00
Viktor Barzin
a87b9793ad
disable loki and alloy as it is not used [ci skip]
2025-08-23 20:02:37 +00:00
Viktor Barzin
c49e4d0a86
add loki + alloy deployments for logs collection [ci skip]
2025-05-04 11:25:39 +00:00
Viktor Barzin
c5ae6873c9
add registry monitoring to prometheus [ci skip]
2025-03-30 11:15:54 +00:00
Viktor Barzin
6100372a9e
adjust batter low alert to fire only when there is no pwoer [ci skip]
2025-03-22 15:47:30 +00:00
Viktor Barzin
8eb18ec651
add power and ups battery over time widgets to grafana [ci skip]
2025-03-22 15:46:17 +00:00
Viktor Barzin
cd89d13ab2
disable alert for pods less than in spec [ci skip]
2025-03-16 18:27:13 +00:00
Viktor Barzin
59a651c2ad
add 2 more oids for ups to monitor active and reactive power consumption [ci skip]
2025-03-15 17:54:04 +00:00
Viktor Barzin
dfe47657ee
disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip]
2025-03-15 17:53:24 +00:00
Viktor Barzin
1e674bff7e
add alert for ups low battery remaining [ci skip]
2025-03-02 20:48:07 +00:00
Viktor Barzin
ab9b5b356a
increase low voltage alert to 10 min [ci skip]
2025-03-01 14:28:56 +00:00