Viktor Barzin
|
4a857ebefd
|
Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
|
2026-01-31 16:58:14 +00:00 |
|
Viktor Barzin
|
3bda3ab956
|
reduce the frequency of polling idrac and remove some duplicates [ci skip]
|
2026-01-24 18:47:22 +00:00 |
|
Viktor Barzin
|
d751a5924c
|
add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
|
2026-01-18 11:04:51 +00:00 |
|
Viktor Barzin
|
5609bbbaf3
|
dedup ram alert and increase threshold to 95% [ci skip]
|
2026-01-17 22:42:22 +00:00 |
|
Viktor Barzin
|
88a62f90f5
|
scale grafana to 3 pods for resilience [ci skip]
|
2026-01-12 18:27:54 +00:00 |
|
Viktor Barzin
|
8abb8eddc0
|
add tier to all deployments [ci skip]
|
2026-01-10 16:28:14 +00:00 |
|
Viktor Barzin
|
20cd480988
|
monitor idrac more frequently [ci skip]
|
2026-01-07 18:55:59 +00:00 |
|
Viktor Barzin
|
402dc1f91a
|
update cpu temp alert to above 60 [ci skip]
|
2026-01-04 12:26:46 +00:00 |
|
Viktor Barzin
|
29194c06b9
|
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]
|
2026-01-03 23:30:28 +00:00 |
|
Viktor Barzin
|
d151b582f7
|
update cpu temp alert to 55C down from 75C [ci skip]
|
2026-01-03 16:48:54 +00:00 |
|
Viktor Barzin
|
feeb6ee86c
|
increase idrac scrape timeout in attempt to reduce 499 [ci skip]
|
2025-12-29 20:34:40 +00:00 |
|
Viktor Barzin
|
42403e0b35
|
add registry low cache hit rate alert [ci skip]
|
2025-12-29 10:43:57 +00:00 |
|
Viktor Barzin
|
a3624f80e0
|
replace hardcoded namespace with module reference [ci skip]
|
2025-12-29 10:23:42 +00:00 |
|
Viktor Barzin
|
8be0fc9699
|
add more alerts in prometheus and gorup them better [ci skip]
|
2025-12-28 20:07:33 +00:00 |
|
Viktor Barzin
|
95a6708361
|
move out all monitoring resources to separate tf files [ci skip]
|
2025-12-28 20:07:00 +00:00 |
|
Viktor Barzin
|
34f90c06dc
|
move grafana into separate file and tunr off persistence as we use external db now [ci skip]
|
2025-12-28 20:05:27 +00:00 |
|
Viktor Barzin
|
90bdd38de1
|
migrate grafana to mysql from sqlite [ci skip]
|
2025-12-27 20:51:05 +00:00 |
|
Viktor Barzin
|
e12c117bdf
|
move prometheus wal to tmpfs to reduce wear [ci skip]
|
2025-12-26 20:10:20 +00:00 |
|
Viktor Barzin
|
a7dc4320b3
|
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]
|
2025-12-26 16:23:49 +00:00 |
|
Viktor Barzin
|
b622c94334
|
add pve exporter playbook + pve exporter in k8s [ci skip]
|
2025-12-26 16:23:17 +00:00 |
|
Viktor Barzin
|
0197c5a09c
|
update most important grafana dashboards [ci skip]
|
2025-12-23 18:13:25 +00:00 |
|
Viktor Barzin
|
bd60f0faa3
|
add alert for docker registry [ci skip]
|
2025-12-18 10:45:32 +00:00 |
|
Viktor Barzin
|
33be167720
|
add local-only ingress for snmp and idrac exporters [ci skip]
|
2025-12-14 19:08:44 +00:00 |
|
Viktor Barzin
|
bc486227f7
|
add separate idrac monitoring tool and dashboard [ci skip]
|
2025-12-14 09:50:16 +00:00 |
|
Viktor Barzin
|
f85d793afd
|
add haos monitoring job in prometheus
|
2025-11-29 11:46:42 +00:00 |
|
Viktor Barzin
|
2c022fd924
|
add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip]
|
2025-11-15 05:15:50 +00:00 |
|
Viktor Barzin
|
0752e80231
|
add api key to tiny tuya target in prometheus scrape [ci skip]
|
2025-11-09 22:03:25 +00:00 |
|
Viktor Barzin
|
16d27ec225
|
add prometheus targets for fuses [ci skip]
|
2025-10-29 21:59:06 +00:00 |
|
Viktor Barzin
|
279592b6e3
|
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip]
|
2025-10-29 20:09:21 +00:00 |
|
Viktor Barzin
|
71428ddbc0
|
add breakdown in main power source from inverterer in grafana [ci skip]
|
2025-10-28 22:41:44 +00:00 |
|
Viktor Barzin
|
62bec95bf2
|
update ups grafana dash to have inverter stats [ci skip]
|
2025-10-28 22:17:32 +00:00 |
|
Viktor Barzin
|
6be6b06d90
|
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip]
|
2025-10-28 21:38:40 +00:00 |
|
Viktor Barzin
|
093ed81fce
|
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip]
|
2025-10-08 20:33:51 +00:00 |
|
Viktor Barzin
|
c3bc184169
|
disable errors for matrix ingress [ci skip]
|
2025-08-23 20:38:53 +00:00 |
|
Viktor Barzin
|
085dc3258e
|
backup all grafana dashboards [ci skip]
|
2025-08-23 20:30:37 +00:00 |
|
Viktor Barzin
|
adcd0695ba
|
update registry prometheus url to devvm as pi was too slow [ci skip]
|
2025-08-23 20:15:05 +00:00 |
|
Viktor Barzin
|
cfa32d0e31
|
disable loki and alloy as it is not used [ci skip]
|
2025-08-23 20:02:37 +00:00 |
|
Viktor Barzin
|
b425985555
|
add loki + alloy deployments for logs collection [ci skip]
|
2025-05-04 11:25:39 +00:00 |
|
Viktor Barzin
|
16d6bcc544
|
add registry monitoring to prometheus [ci skip]
|
2025-03-30 11:15:54 +00:00 |
|
Viktor Barzin
|
534fcdbfe3
|
adjust batter low alert to fire only when there is no pwoer [ci skip]
|
2025-03-22 15:47:30 +00:00 |
|
Viktor Barzin
|
daeb3b6693
|
add power and ups battery over time widgets to grafana [ci skip]
|
2025-03-22 15:46:17 +00:00 |
|
Viktor Barzin
|
987fc402b5
|
disable alert for pods less than in spec [ci skip]
|
2025-03-16 18:27:13 +00:00 |
|
Viktor Barzin
|
d9e06a9853
|
add 2 more oids for ups to monitor active and reactive power consumption [ci skip]
|
2025-03-15 17:54:04 +00:00 |
|
Viktor Barzin
|
72bedfdd6e
|
disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip]
|
2025-03-15 17:53:24 +00:00 |
|
Viktor Barzin
|
f7eff3cb74
|
add alert for ups low battery remaining [ci skip]
|
2025-03-02 20:48:07 +00:00 |
|
Viktor Barzin
|
095624a337
|
increase low voltage alert to 10 min [ci skip]
|
2025-03-01 14:28:56 +00:00 |
|
Viktor Barzin
|
5ef9ba5917
|
increase interval for 500 alerts to 20m [ci skip]
|
2025-01-10 20:47:25 +00:00 |
|
Viktor Barzin
|
aeee71751f
|
move prometheus alerts to different channel and move high cpu period [ci skip]
|
2025-01-04 14:27:48 +00:00 |
|
Viktor Barzin
|
3473f64670
|
increase idle power threshold to 130w [ci skip]
|
2025-01-03 17:49:24 +00:00 |
|
Viktor Barzin
|
4b725b02a6
|
add alert status to message [ci skip]
|
2025-01-02 21:13:09 +00:00 |
|