Commit graph

106 commits

Author SHA1 Message Date
Viktor Barzin
feeb6ee86c
increase idrac scrape timeout in attempt to reduce 499 [ci skip] 2025-12-29 20:34:40 +00:00
Viktor Barzin
42403e0b35
add registry low cache hit rate alert [ci skip] 2025-12-29 10:43:57 +00:00
Viktor Barzin
a3624f80e0
replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
Viktor Barzin
8be0fc9699
add more alerts in prometheus and gorup them better [ci skip] 2025-12-28 20:07:33 +00:00
Viktor Barzin
95a6708361
move out all monitoring resources to separate tf files [ci skip] 2025-12-28 20:07:00 +00:00
Viktor Barzin
34f90c06dc
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00
Viktor Barzin
90bdd38de1
migrate grafana to mysql from sqlite [ci skip] 2025-12-27 20:51:05 +00:00
Viktor Barzin
e12c117bdf
move prometheus wal to tmpfs to reduce wear [ci skip] 2025-12-26 20:10:20 +00:00
Viktor Barzin
a7dc4320b3
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip] 2025-12-26 16:23:49 +00:00
Viktor Barzin
b622c94334
add pve exporter playbook + pve exporter in k8s [ci skip] 2025-12-26 16:23:17 +00:00
Viktor Barzin
0197c5a09c
update most important grafana dashboards [ci skip] 2025-12-23 18:13:25 +00:00
Viktor Barzin
bd60f0faa3
add alert for docker registry [ci skip] 2025-12-18 10:45:32 +00:00
Viktor Barzin
33be167720
add local-only ingress for snmp and idrac exporters [ci skip] 2025-12-14 19:08:44 +00:00
Viktor Barzin
bc486227f7 add separate idrac monitoring tool and dashboard [ci skip] 2025-12-14 09:50:16 +00:00
Viktor Barzin
f85d793afd
add haos monitoring job in prometheus 2025-11-29 11:46:42 +00:00
Viktor Barzin
2c022fd924
add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip] 2025-11-15 05:15:50 +00:00
Viktor Barzin
0752e80231
add api key to tiny tuya target in prometheus scrape [ci skip] 2025-11-09 22:03:25 +00:00
Viktor Barzin
16d27ec225
add prometheus targets for fuses [ci skip] 2025-10-29 21:59:06 +00:00
Viktor Barzin
279592b6e3
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip] 2025-10-29 20:09:21 +00:00
Viktor Barzin
71428ddbc0
add breakdown in main power source from inverterer in grafana [ci skip] 2025-10-28 22:41:44 +00:00
Viktor Barzin
62bec95bf2
update ups grafana dash to have inverter stats [ci skip] 2025-10-28 22:17:32 +00:00
Viktor Barzin
6be6b06d90
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip] 2025-10-28 21:38:40 +00:00
Viktor Barzin
093ed81fce
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip] 2025-10-08 20:33:51 +00:00
Viktor Barzin
c3bc184169
disable errors for matrix ingress [ci skip] 2025-08-23 20:38:53 +00:00
Viktor Barzin
085dc3258e
backup all grafana dashboards [ci skip] 2025-08-23 20:30:37 +00:00
Viktor Barzin
adcd0695ba
update registry prometheus url to devvm as pi was too slow [ci skip] 2025-08-23 20:15:05 +00:00
Viktor Barzin
cfa32d0e31
disable loki and alloy as it is not used [ci skip] 2025-08-23 20:02:37 +00:00
Viktor Barzin
b425985555 add loki + alloy deployments for logs collection [ci skip] 2025-05-04 11:25:39 +00:00
Viktor Barzin
16d6bcc544 add registry monitoring to prometheus [ci skip] 2025-03-30 11:15:54 +00:00
Viktor Barzin
534fcdbfe3
adjust batter low alert to fire only when there is no pwoer [ci skip] 2025-03-22 15:47:30 +00:00
Viktor Barzin
daeb3b6693
add power and ups battery over time widgets to grafana [ci skip] 2025-03-22 15:46:17 +00:00
Viktor Barzin
987fc402b5
disable alert for pods less than in spec [ci skip] 2025-03-16 18:27:13 +00:00
Viktor Barzin
d9e06a9853
add 2 more oids for ups to monitor active and reactive power consumption [ci skip] 2025-03-15 17:54:04 +00:00
Viktor Barzin
72bedfdd6e
disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip] 2025-03-15 17:53:24 +00:00
Viktor Barzin
f7eff3cb74
add alert for ups low battery remaining [ci skip] 2025-03-02 20:48:07 +00:00
Viktor Barzin
095624a337
increase low voltage alert to 10 min [ci skip] 2025-03-01 14:28:56 +00:00
Viktor Barzin
5ef9ba5917
increase interval for 500 alerts to 20m [ci skip] 2025-01-10 20:47:25 +00:00
Viktor Barzin
aeee71751f
move prometheus alerts to different channel and move high cpu period [ci skip] 2025-01-04 14:27:48 +00:00
Viktor Barzin
3473f64670
increase idle power threshold to 130w [ci skip] 2025-01-03 17:49:24 +00:00
Viktor Barzin
4b725b02a6
add alert status to message [ci skip] 2025-01-02 21:13:09 +00:00
Viktor Barzin
c7113fa495
update prometheus alerts to be correctly grouped and sent to slack and deprecate some old ones [ci skip] 2025-01-02 20:33:55 +00:00
Viktor Barzin
9b0d686873
update prometheus chart values to get slack notiifcations to work and add alerts for 4xx and 5xx on ingress [ci skip] 2025-01-01 11:39:16 +00:00
Viktor Barzin
40f4354316
fix monitoring stack [ci skip] 2024-12-31 17:15:06 +00:00
Viktor Barzin
d94f39f531
add all grafana dashboards models [ci skip] 2024-12-24 13:48:21 +00:00
Viktor Barzin
ce90629b54
add low voltage alert to prometheus and update some dashboards [ci skip] 2024-12-23 18:21:01 +00:00
Viktor Barzin
0ef7430b6f
fix typo in idrac voltage to be in volts not watts [ci skip] 2024-12-17 19:35:27 +00:00
Viktor Barzin
e6aa28be1c
update ups grafana [ci skip] 2024-12-17 19:22:08 +00:00
Viktor Barzin
23a882a3d5
update idract refresh rate to 1m[ci skip] 2024-12-17 19:05:57 +00:00
Viktor Barzin
63df62ce1f
add idrac grafana dashboard to repo [ci skip] 2024-12-16 22:36:00 +00:00
Viktor Barzin
ec8f672dfd
add grafana dashboard for ups [ci skip] 2024-12-15 20:58:01 +00:00