Viktor Barzin
c32acc70e6
Migrate all service modules from nginx-ingress to Traefik
...
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
4a857ebefd
Add per-pod GPU memory metrics exporter
...
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
3bda3ab956
reduce the frequency of polling idrac and remove some duplicates [ci skip]
2026-01-24 18:47:22 +00:00
Viktor Barzin
d751a5924c
add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
...
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
5609bbbaf3
dedup ram alert and increase threshold to 95% [ci skip]
2026-01-17 22:42:22 +00:00
Viktor Barzin
20cd480988
monitor idrac more frequently [ci skip]
2026-01-07 18:55:59 +00:00
Viktor Barzin
402dc1f91a
update cpu temp alert to above 60 [ci skip]
2026-01-04 12:26:46 +00:00
Viktor Barzin
29194c06b9
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]
2026-01-03 23:30:28 +00:00
Viktor Barzin
d151b582f7
update cpu temp alert to 55C down from 75C [ci skip]
2026-01-03 16:48:54 +00:00
Viktor Barzin
feeb6ee86c
increase idrac scrape timeout in attempt to reduce 499 [ci skip]
2025-12-29 20:34:40 +00:00
Viktor Barzin
42403e0b35
add registry low cache hit rate alert [ci skip]
2025-12-29 10:43:57 +00:00
Viktor Barzin
8be0fc9699
add more alerts in prometheus and gorup them better [ci skip]
2025-12-28 20:07:33 +00:00
Viktor Barzin
e12c117bdf
move prometheus wal to tmpfs to reduce wear [ci skip]
2025-12-26 20:10:20 +00:00
Viktor Barzin
a7dc4320b3
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]
2025-12-26 16:23:49 +00:00
Viktor Barzin
bd60f0faa3
add alert for docker registry [ci skip]
2025-12-18 10:45:32 +00:00
Viktor Barzin
bc486227f7
add separate idrac monitoring tool and dashboard [ci skip]
2025-12-14 09:50:16 +00:00
Viktor Barzin
f85d793afd
add haos monitoring job in prometheus
2025-11-29 11:46:42 +00:00
Viktor Barzin
0752e80231
add api key to tiny tuya target in prometheus scrape [ci skip]
2025-11-09 22:03:25 +00:00
Viktor Barzin
16d27ec225
add prometheus targets for fuses [ci skip]
2025-10-29 21:59:06 +00:00
Viktor Barzin
279592b6e3
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip]
2025-10-29 20:09:21 +00:00
Viktor Barzin
6be6b06d90
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip]
2025-10-28 21:38:40 +00:00
Viktor Barzin
093ed81fce
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip]
2025-10-08 20:33:51 +00:00
Viktor Barzin
c3bc184169
disable errors for matrix ingress [ci skip]
2025-08-23 20:38:53 +00:00
Viktor Barzin
adcd0695ba
update registry prometheus url to devvm as pi was too slow [ci skip]
2025-08-23 20:15:05 +00:00
Viktor Barzin
16d6bcc544
add registry monitoring to prometheus [ci skip]
2025-03-30 11:15:54 +00:00
Viktor Barzin
534fcdbfe3
adjust batter low alert to fire only when there is no pwoer [ci skip]
2025-03-22 15:47:30 +00:00
Viktor Barzin
987fc402b5
disable alert for pods less than in spec [ci skip]
2025-03-16 18:27:13 +00:00
Viktor Barzin
72bedfdd6e
disable perms errors and server errors for grafana and nextcloud ingresses as they were too noisy [ci skip]
2025-03-15 17:53:24 +00:00
Viktor Barzin
f7eff3cb74
add alert for ups low battery remaining [ci skip]
2025-03-02 20:48:07 +00:00
Viktor Barzin
095624a337
increase low voltage alert to 10 min [ci skip]
2025-03-01 14:28:56 +00:00
Viktor Barzin
5ef9ba5917
increase interval for 500 alerts to 20m [ci skip]
2025-01-10 20:47:25 +00:00
Viktor Barzin
aeee71751f
move prometheus alerts to different channel and move high cpu period [ci skip]
2025-01-04 14:27:48 +00:00
Viktor Barzin
3473f64670
increase idle power threshold to 130w [ci skip]
2025-01-03 17:49:24 +00:00
Viktor Barzin
4b725b02a6
add alert status to message [ci skip]
2025-01-02 21:13:09 +00:00
Viktor Barzin
c7113fa495
update prometheus alerts to be correctly grouped and sent to slack and deprecate some old ones [ci skip]
2025-01-02 20:33:55 +00:00
Viktor Barzin
9b0d686873
update prometheus chart values to get slack notiifcations to work and add alerts for 4xx and 5xx on ingress [ci skip]
2025-01-01 11:39:16 +00:00
Viktor Barzin
40f4354316
fix monitoring stack [ci skip]
2024-12-31 17:15:06 +00:00
Viktor Barzin
ce90629b54
add low voltage alert to prometheus and update some dashboards [ci skip]
2024-12-23 18:21:01 +00:00
Viktor Barzin
fbe305a891
add ups snmp exporter to prometheus [ci skip]
2024-12-15 18:13:33 +00:00
Viktor Barzin
185a944acd
replace oauth proxy with authentik auth [ci skip]
2024-11-18 22:06:31 +00:00
Viktor Barzin
64f81621c8
add homepage module and some more integrations [ci skip]
2024-10-20 13:05:03 +00:00
Viktor Barzin
b54fbf72fd
add meshcentral and diun[ci skip]
2024-08-18 18:14:22 +00:00
Viktor Barzin
506b4a2f87
reduce prometheus storage retention from 12w -> 8w to save ~30gb [ci skip]
2024-08-07 20:18:13 +00:00
Viktor Barzin
828f3f115a
update old prometheus alert detectors and upgrade immich to 101 [ci skip]
2024-04-12 21:15:31 +00:00
Viktor Barzin
8afbec0d23
remove hack for london openwrt monitoring after having tailscale now [ci skip]
2024-03-30 18:28:11 +00:00
Viktor Barzin
e5061dec27
update openwrt london prometheus target address [ci skip]
2024-03-29 22:20:29 +00:00
Viktor Barzin
215deb5568
add monitoring jobs to p8s for istiod and the service mesh [ci skip]
2024-01-07 17:47:36 +00:00
Viktor Barzin
15bade148c
upgrade prometheus helm chart [ci skip]
2023-12-25 21:40:19 +00:00
Viktor Barzin
e3a8cd16b4
add baseurl to prometheus helm to chart so alertmanager sends correct links with prometheus public url instead of podname [ci skip]
2023-12-25 13:48:19 +00:00
Viktor Barzin
3019f1cca8
add prometheus monitoring to crowdsec [ci skip]
2023-11-25 13:34:16 +00:00