infra

Author	SHA1	Message	Date
Viktor Barzin	26ba9ea371	[ci skip] Fix Prometheus storage alert and Grafana quota exhaustion - Enable size-based TSDB retention (45GB) to clean up old blocks (including 2021-era blocks with failed compaction) - Increase monitoring namespace quota from 64/128Gi to 80/160Gi CPU/memory limits to allow Grafana rolling updates	2026-02-21 21:04:08 +00:00
Viktor Barzin	dcce738641	[ci skip] Bump inotify max_user_instances from 512 to 8192 Fixes "failed to create fsnotify watcher: too many open files" in Drone CI builds where vitest exhausts the default inotify instance limit.	2026-02-21 20:21:04 +00:00
Viktor Barzin	9889728c49	[ci skip] Remove Authentik forward auth from Grafana, add admin password management Fixes HA mobile app 403 when embedding Grafana dashboards - the webview blocks third-party cookies needed by Authentik forward auth. Grafana already has anonymous Viewer access enabled, so forward auth is not needed. Also adds grafana_admin_password variable and explicit resource limits to prevent ResourceQuota issues during rolling updates.	2026-02-18 21:40:32 +00:00
Viktor Barzin	9bcdb9e59f	[ci skip] Implement multi-user Kubernetes access with OIDC - Add RBAC module (modules/kubernetes/rbac/) with admin, power-user, and namespace-owner roles, API server OIDC flags, and audit logging - Add self-service portal (modules/kubernetes/k8s-portal/) SvelteKit app with kubeconfig download and setup instructions - Configure Alloy to collect audit logs from kube-apiserver - Add Grafana dashboard for Kubernetes audit log visualization - Configure Authentik OIDC provider with groups scope mapping - Wire up k8s_users and ssh_private_key variables through module chain	2026-02-17 21:42:39 +00:00
Viktor Barzin	0545576335	[ci skip] Add Smart Home (ha-sofia) section to Cluster Health Overview dashboard	2026-02-17 19:48:02 +00:00
Viktor Barzin	039f8559c9	[ci skip] Add Grafana dashboard for Technitium DNS query logs Add MySQL datasource and 15-panel dashboard for DNS analytics: queries over time, response codes, top domains/clients, response times, blocked/NxDomain domains. Enable Grafana dashboard sidecar for auto-provisioning dashboards from ConfigMaps.	2026-02-16 23:06:41 +00:00
Viktor Barzin	f06b3ac0e4	[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to return NXDOMAIN immediately, preventing search domain expansion junk queries from reaching Technitium. Add trailing dots to Prometheus scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.	2026-02-16 21:38:38 +00:00
Viktor Barzin	0eb5eb738a	[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict - Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node - iDRAC Redfish Exporter: add explicit priority_class_name to resolve conflict between Kyverno priority injection and default priority: 0	2026-02-16 20:09:53 +00:00
Viktor Barzin	5cb4bda289	Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip]	2026-02-15 21:51:41 +00:00
Viktor Barzin	2db6e96115	Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip]	2026-02-15 21:24:08 +00:00
Viktor Barzin	608d5ab636	Add Cluster Health Overview Grafana dashboard [ci skip]	2026-02-15 19:38:28 +00:00
Viktor Barzin	4d9b8242e8	Add tier-based resource governance via Kyverno [ci skip] Four layers of noisy-neighbor protection using existing tier system: - PriorityClasses (tier-0-core through tier-4-aux) - LimitRange defaults auto-generated per namespace tier - ResourceQuotas auto-generated per namespace tier - PriorityClassName injection on pods via Kyverno mutate Custom quota overrides for monitoring and crowdsec namespaces which exceed the default tier quotas.	2026-02-15 18:48:33 +00:00
Viktor Barzin	c4a7c5df8e	[ci skip] Update Loki dashboard to use correct datasource UID	2026-02-13 23:41:40 +00:00
Viktor Barzin	fabece6370	[ci skip] Fix compactor/ruler paths to use writable /var/loki mount	2026-02-13 23:22:13 +00:00
Viktor Barzin	0d3acec82c	[ci skip] Re-enable lokiCanary (required by Helm chart validation)	2026-02-13 23:18:13 +00:00
Viktor Barzin	fea7c6cbb1	[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy	2026-02-13 23:17:32 +00:00
Viktor Barzin	69aae2ec9d	[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc	2026-02-13 23:08:44 +00:00
Viktor Barzin	71ff803978	[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet	2026-02-13 23:03:40 +00:00
Viktor Barzin	cd5261161b	[ci skip] Add HomeAssistantDown alert for ha-sofia Fires after 5m if the haos Prometheus scrape target is unreachable. Covers the HTTP API endpoint which shares the same process as the WebSocket API used by the mobile app.	2026-02-11 23:24:46 +00:00
Viktor Barzin	46ffc37dcf	[ci skip] Fix all active Prometheus alerts - meshcentral: rename port from "https" to "http" — MeshCentral serves plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the port name, causing 100% 5xx errors - osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops trying to build graph with no valid transit trips - wireguard: add prometheus.io/port=9586 annotation — without it, Prometheus tried scraping all container ports (51820 UDP, 80) - travel-blog: remove stale prometheus.io annotations and dead port 9113 — nginx-exporter sidecar was commented out but annotations remained - dawarich: remove prometheus.io annotations — exporter env vars are commented out so nothing listens on port 9394 - monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is 79°C), lower registry cache threshold 50%→25%, add minimum traffic floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives on low-traffic services	2026-02-11 22:40:56 +00:00
Viktor Barzin	c8a41ac567	[ci skip] Add 12 Prometheus alert rules for monitoring gaps Add 3 new alert groups and 1 rule to existing group: - Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used) - K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady, NodeConditionBad, JobFailed - Infrastructure Health: CoreDNSErrors, ScrapeTargetDown, PrometheusStorageFull, PrometheusNotificationsFailing - R730 Host: FanFailure (iDRAC Redfish fan health)	2026-02-11 22:14:30 +00:00
Viktor Barzin	dbf397841a	Standardize Prometheus alert formatting and fix Slack notifications - Add color coding (red/green) to Slack alerts, show alertname in title - Use summary annotation in Slack text (description was always empty) - Format all alert summaries consistently: value with units and threshold - Fix ratio expressions (CPU/memory) to display as percentages - Fix "failiure" typo, capitalize Tailscale	2026-02-11 21:53:22 +00:00
Viktor Barzin	73aab7f4ce	[ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin - ollama: Add basicAuth middleware for external API access - monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9) - plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore - reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body) - traefik: Switch to packruler/rewrite-body plugin v1.2.0	2026-02-10 21:29:54 +00:00
Viktor Barzin	b36932f9a3	Migrate all service modules from nginx-ingress to Traefik - Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet) - Update ingress annotations to use Traefik middleware CRDs - Delete nginx-ingress module (replaced by traefik) - Add new traefik middleware.tf for shared middleware definitions - Update service modules to work with new ingress_factory interface	2026-02-07 13:25:49 +00:00
Viktor Barzin	da4cf18d6d	Add per-pod GPU memory metrics exporter - Add DaemonSet that runs on GPU node and exposes Prometheus metrics - Uses nvidia-smi to collect per-process GPU memory usage - Maps PIDs to container IDs via /proc/<pid>/cgroup - Exposes gpu_pod_memory_used_bytes metric at :9401/metrics - Add Prometheus scrape config for gpu-pod-memory job [ci skip]	2026-01-31 16:58:14 +00:00
Viktor Barzin	10092ec285	reduce the frequency of polling idrac and remove some duplicates [ci skip]	2026-01-24 18:47:22 +00:00
Viktor Barzin	a1d945a0b2	add prometheus alerts for deployment/statefulset/daemonset replica mismatches [ci skip] - Add DeploymentReplicasMismatch alert - Add StatefulSetReplicasMismatch alert - Add DaemonSetMissingPods alert - Add .claude/ directory with remote executor and knowledge base	2026-01-18 11:04:51 +00:00
Viktor Barzin	185a138cc5	dedup ram alert and increase threshold to 95% [ci skip]	2026-01-17 22:42:22 +00:00
Viktor Barzin	61e318398c	scale grafana to 3 pods for resilience [ci skip]	2026-01-12 18:27:54 +00:00
Viktor Barzin	f1e9fb9afe	add tier to all deployments [ci skip]	2026-01-10 16:28:14 +00:00
Viktor Barzin	1b5cbeb9c8	monitor idrac more frequently [ci skip]	2026-01-07 18:55:59 +00:00
Viktor Barzin	934fa34c79	update cpu temp alert to above 60 [ci skip]	2026-01-04 12:26:46 +00:00
Viktor Barzin	01d4c9c3e1	update definition of high cpu usage to use pve metrics in stead for a longer period [ci skip]	2026-01-03 23:30:28 +00:00
Viktor Barzin	31c403cadb	update cpu temp alert to 55C down from 75C [ci skip]	2026-01-03 16:48:54 +00:00
Viktor Barzin	d37c693a94	increase idrac scrape timeout in attempt to reduce 499 [ci skip]	2025-12-29 20:34:40 +00:00
Viktor Barzin	253e77f22d	add registry low cache hit rate alert [ci skip]	2025-12-29 10:43:57 +00:00
Viktor Barzin	f1dde96d80	replace hardcoded namespace with module reference [ci skip]	2025-12-29 10:23:42 +00:00
Viktor Barzin	cf9d346cae	add more alerts in prometheus and gorup them better [ci skip]	2025-12-28 20:07:33 +00:00
Viktor Barzin	a595c4db56	move out all monitoring resources to separate tf files [ci skip]	2025-12-28 20:07:00 +00:00
Viktor Barzin	26d55c6637	move grafana into separate file and tunr off persistence as we use external db now [ci skip]	2025-12-28 20:05:27 +00:00
Viktor Barzin	f06e050eaa	migrate grafana to mysql from sqlite [ci skip]	2025-12-27 20:51:05 +00:00
Viktor Barzin	0b2e6d09d2	move prometheus wal to tmpfs to reduce wear [ci skip]	2025-12-26 20:10:20 +00:00
Viktor Barzin	6c1ae20448	add job to monitor pve host using node exporter and add alert for high ssd writes [ci skip]	2025-12-26 16:23:49 +00:00
Viktor Barzin	d07c625064	add pve exporter playbook + pve exporter in k8s [ci skip]	2025-12-26 16:23:17 +00:00
Viktor Barzin	59e6591e2a	update most important grafana dashboards [ci skip]	2025-12-23 18:13:25 +00:00
Viktor Barzin	a225bad3cb	add alert for docker registry [ci skip]	2025-12-18 10:45:32 +00:00
Viktor Barzin	397fa0cba7	add local-only ingress for snmp and idrac exporters [ci skip]	2025-12-14 19:08:44 +00:00
Viktor Barzin	b4f45c7e73	add separate idrac monitoring tool and dashboard [ci skip]	2025-12-14 09:50:16 +00:00
Viktor Barzin	34df786fe4	add haos monitoring job in prometheus	2025-11-29 11:46:42 +00:00
Viktor Barzin	1b0d5e60d8	add ${__field.name:wrap} in the idrac dashboard to fix wrapping issue[ci skip]	2025-11-15 05:15:50 +00:00

1 2 3

139 commits