Viktor Barzin
c4a7c5df8e
[ci skip] Update Loki dashboard to use correct datasource UID
2026-02-13 23:41:40 +00:00
Viktor Barzin
fabece6370
[ci skip] Fix compactor/ruler paths to use writable /var/loki mount
2026-02-13 23:22:13 +00:00
Viktor Barzin
0d3acec82c
[ci skip] Re-enable lokiCanary (required by Helm chart validation)
2026-02-13 23:18:13 +00:00
Viktor Barzin
fea7c6cbb1
[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy
2026-02-13 23:17:32 +00:00
Viktor Barzin
69aae2ec9d
[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc
2026-02-13 23:08:44 +00:00
Viktor Barzin
71ff803978
[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet
2026-02-13 23:03:40 +00:00
Viktor Barzin
cd5261161b
[ci skip] Add HomeAssistantDown alert for ha-sofia
...
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
46ffc37dcf
[ci skip] Fix all active Prometheus alerts
...
- meshcentral: rename port from "https" to "http" — MeshCentral serves
plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
— nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
79°C), lower registry cache threshold 50%→25%, add minimum traffic
floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
c8a41ac567
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
...
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
dbf397841a
Standardize Prometheus alert formatting and fix Slack notifications
...
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
73aab7f4ce
[ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
...
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
b36932f9a3
Migrate all service modules from nginx-ingress to Traefik
...
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
da4cf18d6d
Add per-pod GPU memory metrics exporter
...
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job
[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
10092ec285
reduce the frequency of polling idrac and remove some duplicates [ci skip]
2026-01-24 18:47:22 +00:00
Viktor Barzin
a1d945a0b2
add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
...
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
185a138cc5
dedup ram alert and increase threshold to 95% [ci skip]
2026-01-17 22:42:22 +00:00
Viktor Barzin
61e318398c
scale grafana to 3 pods for resilience [ci skip]
2026-01-12 18:27:54 +00:00
Viktor Barzin
f1e9fb9afe
add tier to all deployments [ci skip]
2026-01-10 16:28:14 +00:00
Viktor Barzin
1b5cbeb9c8
monitor idrac more frequently [ci skip]
2026-01-07 18:55:59 +00:00
Viktor Barzin
934fa34c79
update cpu temp alert to above 60 [ci skip]
2026-01-04 12:26:46 +00:00
Viktor Barzin
01d4c9c3e1
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]
2026-01-03 23:30:28 +00:00
Viktor Barzin
31c403cadb
update cpu temp alert to 55C down from 75C [ci skip]
2026-01-03 16:48:54 +00:00
Viktor Barzin
d37c693a94
increase idrac scrape timeout in attempt to reduce 499 [ci skip]
2025-12-29 20:34:40 +00:00
Viktor Barzin
253e77f22d
add registry low cache hit rate alert [ci skip]
2025-12-29 10:43:57 +00:00
Viktor Barzin
f1dde96d80
replace hardcoded namespace with module reference [ci skip]
2025-12-29 10:23:42 +00:00
Viktor Barzin
cf9d346cae
add more alerts in prometheus and gorup them better [ci skip]
2025-12-28 20:07:33 +00:00
Viktor Barzin
a595c4db56
move out all monitoring resources to separate tf files [ci skip]
2025-12-28 20:07:00 +00:00
Viktor Barzin
26d55c6637
move grafana into separate file and tunr off persistence as we use external db now [ci skip]
2025-12-28 20:05:27 +00:00
Viktor Barzin
f06e050eaa
migrate grafana to mysql from sqlite [ci skip]
2025-12-27 20:51:05 +00:00
Viktor Barzin
0b2e6d09d2
move prometheus wal to tmpfs to reduce wear [ci skip]
2025-12-26 20:10:20 +00:00
Viktor Barzin
6c1ae20448
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]
2025-12-26 16:23:49 +00:00
Viktor Barzin
d07c625064
add pve exporter playbook + pve exporter in k8s [ci skip]
2025-12-26 16:23:17 +00:00
Viktor Barzin
59e6591e2a
update most important grafana dashboards [ci skip]
2025-12-23 18:13:25 +00:00
Viktor Barzin
a225bad3cb
add alert for docker registry [ci skip]
2025-12-18 10:45:32 +00:00
Viktor Barzin
397fa0cba7
add local-only ingress for snmp and idrac exporters [ci skip]
2025-12-14 19:08:44 +00:00
Viktor Barzin
b4f45c7e73
add separate idrac monitoring tool and dashboard [ci skip]
2025-12-14 09:50:16 +00:00
Viktor Barzin
34df786fe4
add haos monitoring job in prometheus
2025-11-29 11:46:42 +00:00
Viktor Barzin
1b0d5e60d8
add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip]
2025-11-15 05:15:50 +00:00
Viktor Barzin
0b7b092c26
add api key to tiny tuya target in prometheus scrape [ci skip]
2025-11-09 22:03:25 +00:00
Viktor Barzin
5f2cc75a8e
add prometheus targets for fuses [ci skip]
2025-10-29 21:59:06 +00:00
Viktor Barzin
76103f52e3
add alert if we use inverter power for 1d straight - probably an issue with switching [ci skip]
2025-10-29 20:09:21 +00:00
Viktor Barzin
18a2695e64
add breakdown in main power source from inverterer in grafana [ci skip]
2025-10-28 22:41:44 +00:00
Viktor Barzin
77f92ae4ef
update ups grafana dash to have inverter stats [ci skip]
2025-10-28 22:17:32 +00:00
Viktor Barzin
3161da29a4
add scrape config for tuya bridge and prohibit access to the metrics path via ingress [ci skip]
2025-10-28 21:38:40 +00:00
Viktor Barzin
d5dd81ba30
increaes threshold for high power usage to 180 as we have bigger cpu now [ci skip]
2025-10-08 20:33:51 +00:00
Viktor Barzin
c5bb343ebe
disable errors for matrix ingress [ci skip]
2025-08-23 20:38:53 +00:00
Viktor Barzin
87c33629b4
backup all grafana dashboards [ci skip]
2025-08-23 20:30:37 +00:00
Viktor Barzin
1530323477
update registry prometheus url to devvm as pi was too slow [ci skip]
2025-08-23 20:15:05 +00:00
Viktor Barzin
a87b9793ad
disable loki and alloy as it is not used [ci skip]
2025-08-23 20:02:37 +00:00
Viktor Barzin
c49e4d0a86
add loki + alloy deployments for logs collection [ci skip]
2025-05-04 11:25:39 +00:00