Commit graph

140 commits

Author SHA1 Message Date
Viktor Barzin
c1a18a7426
[ci skip] Fix Prometheus storage alert and Grafana quota exhaustion
- Enable size-based TSDB retention (45GB) to clean up old blocks
  (including 2021-era blocks with failed compaction)
- Increase monitoring namespace quota from 64/128Gi to 80/160Gi
  CPU/memory limits to allow Grafana rolling updates
2026-02-21 21:04:08 +00:00
Viktor Barzin
a12a81bdd5
[ci skip] Bump inotify max_user_instances from 512 to 8192
Fixes "failed to create fsnotify watcher: too many open files" in Drone
CI builds where vitest exhausts the default inotify instance limit.
2026-02-21 20:21:04 +00:00
Viktor Barzin
1206b3860b
[ci skip] Remove Authentik forward auth from Grafana, add admin password management
Fixes HA mobile app 403 when embedding Grafana dashboards - the webview
blocks third-party cookies needed by Authentik forward auth. Grafana
already has anonymous Viewer access enabled, so forward auth is not
needed. Also adds grafana_admin_password variable and explicit resource
limits to prevent ResourceQuota issues during rolling updates.
2026-02-18 21:40:32 +00:00
Viktor Barzin
d0b39f1987
[ci skip] Implement multi-user Kubernetes access with OIDC
- Add RBAC module (modules/kubernetes/rbac/) with admin, power-user,
  and namespace-owner roles, API server OIDC flags, and audit logging
- Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app
  with kubeconfig download and setup instructions
- Configure Alloy to collect audit logs from kube-apiserver
- Add Grafana dashboard for Kubernetes audit log visualization
- Configure Authentik OIDC provider with groups scope mapping
- Wire up k8s_users and ssh_private_key variables through module chain
2026-02-17 21:42:39 +00:00
Viktor Barzin
6f3395fbf5
[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard 2026-02-17 19:48:02 +00:00
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
19136c21f1
[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries
Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to
return NXDOMAIN immediately, preventing search domain expansion junk
queries from reaching Technitium. Add trailing dots to Prometheus
scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.
2026-02-16 21:38:38 +00:00
Viktor Barzin
3d4cdf3203
[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict
- Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were
  OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node
- iDRAC Redfish Exporter: add explicit priority_class_name to resolve
  conflict between Kyverno priority injection and default priority: 0
2026-02-16 20:09:53 +00:00
Viktor Barzin
2d015c1cb4
Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip] 2026-02-15 21:51:41 +00:00
Viktor Barzin
a2b44c8ff7
Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip] 2026-02-15 21:24:08 +00:00
Viktor Barzin
f447e45ee1
Add Cluster Health Overview Grafana dashboard [ci skip] 2026-02-15 19:38:28 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
7644c419a4
[ci skip] Update Loki dashboard to use correct datasource UID 2026-02-13 23:41:40 +00:00
Viktor Barzin
cd2d13d949
[ci skip] Fix compactor/ruler paths to use writable /var/loki mount 2026-02-13 23:22:13 +00:00
Viktor Barzin
d906513f09
[ci skip] Re-enable lokiCanary (required by Helm chart validation) 2026-02-13 23:18:13 +00:00
Viktor Barzin
a38c3d3dc7
[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy 2026-02-13 23:17:32 +00:00
Viktor Barzin
f013c0a139
[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
c7236f09f1
[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
8bea552664
[ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
0c18a86a7b
[ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
04eaf0f989
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
12a84c5207
Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
6acf5ee300
[ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
c32acc70e6
Migrate all service modules from nginx-ingress to Traefik
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
4a857ebefd Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
3bda3ab956
reduce the frequency of polling idrac and remove some duplicates [ci skip] 2026-01-24 18:47:22 +00:00
Viktor Barzin
d751a5924c add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip]
- Add DeploymentReplicasMismatch alert
- Add StatefulSetReplicasMismatch alert
- Add DaemonSetMissingPods alert
- Add .claude/ directory with remote executor and knowledge base
2026-01-18 11:04:51 +00:00
Viktor Barzin
5609bbbaf3
dedup ram alert and increase threshold to 95% [ci skip] 2026-01-17 22:42:22 +00:00
Viktor Barzin
88a62f90f5 scale grafana to 3 pods for resilience [ci skip] 2026-01-12 18:27:54 +00:00
Viktor Barzin
8abb8eddc0
add tier to all deployments [ci skip] 2026-01-10 16:28:14 +00:00
Viktor Barzin
20cd480988
monitor idrac more frequently [ci skip] 2026-01-07 18:55:59 +00:00
Viktor Barzin
402dc1f91a update cpu temp alert to above 60 [ci skip] 2026-01-04 12:26:46 +00:00
Viktor Barzin
29194c06b9
update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip] 2026-01-03 23:30:28 +00:00
Viktor Barzin
d151b582f7
update cpu temp alert to 55C down from 75C [ci skip] 2026-01-03 16:48:54 +00:00
Viktor Barzin
feeb6ee86c
increase idrac scrape timeout in attempt to reduce 499 [ci skip] 2025-12-29 20:34:40 +00:00
Viktor Barzin
42403e0b35
add registry low cache hit rate alert [ci skip] 2025-12-29 10:43:57 +00:00
Viktor Barzin
a3624f80e0
replace hardcoded namespace with module reference [ci skip] 2025-12-29 10:23:42 +00:00
Viktor Barzin
8be0fc9699
add more alerts in prometheus and gorup them better [ci skip] 2025-12-28 20:07:33 +00:00
Viktor Barzin
95a6708361
move out all monitoring resources to separate tf files [ci skip] 2025-12-28 20:07:00 +00:00
Viktor Barzin
34f90c06dc
move grafana into separate file and tunr off persistence as we use external db now [ci skip] 2025-12-28 20:05:27 +00:00
Viktor Barzin
90bdd38de1
migrate grafana to mysql from sqlite [ci skip] 2025-12-27 20:51:05 +00:00
Viktor Barzin
e12c117bdf
move prometheus wal to tmpfs to reduce wear [ci skip] 2025-12-26 20:10:20 +00:00
Viktor Barzin
a7dc4320b3
add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip] 2025-12-26 16:23:49 +00:00
Viktor Barzin
b622c94334
add pve exporter playbook + pve exporter in k8s [ci skip] 2025-12-26 16:23:17 +00:00
Viktor Barzin
0197c5a09c
update most important grafana dashboards [ci skip] 2025-12-23 18:13:25 +00:00
Viktor Barzin
bd60f0faa3
add alert for docker registry [ci skip] 2025-12-18 10:45:32 +00:00
Viktor Barzin
33be167720
add local-only ingress for snmp and idrac exporters [ci skip] 2025-12-14 19:08:44 +00:00
Viktor Barzin
bc486227f7 add separate idrac monitoring tool and dashboard [ci skip] 2025-12-14 09:50:16 +00:00
Viktor Barzin
f85d793afd
add haos monitoring job in prometheus 2025-11-29 11:46:42 +00:00
Viktor Barzin
2c022fd924
add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip] 2025-11-15 05:15:50 +00:00