infra

Author	SHA1	Message	Date
Viktor Barzin	c0363be5e4	[ci skip] Add Grafana dashboard for Technitium DNS query logs Add MySQL datasource and 15-panel dashboard for DNS analytics: queries over time, response codes, top domains/clients, response times, blocked/NxDomain domains. Enable Grafana dashboard sidecar for auto-provisioning dashboards from ConfigMaps.	2026-02-16 23:06:41 +00:00
Viktor Barzin	1802b4c86d	[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0	2026-02-16 22:30:57 +00:00
Viktor Barzin	a268b9107f	[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex Single template regex in the viktorbarzin.lan block catches ALL search domain expansion junk (.com.viktorbarzin.lan, .cluster.local.viktorbarzin.lan, etc.) instead of needing separate server blocks per pattern. Legitimate single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.	2026-02-16 21:49:03 +00:00
Viktor Barzin	19136c21f1	[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to return NXDOMAIN immediately, preventing search domain expansion junk queries from reaching Technitium. Add trailing dots to Prometheus scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.	2026-02-16 21:38:38 +00:00
Viktor Barzin	c08bab7da7	[ci skip] Update preference: always use cluster_healthcheck.sh for health checks	2026-02-16 21:19:49 +00:00
Viktor Barzin	205eb2704b	[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy DNS queries were going through Traefik's IngressRouteUDP, replacing real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs. Changed Technitium DNS service from NodePort to LoadBalancer with externalTrafficPolicy: Local, removed dns-udp entrypoint and IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan queries to Technitium's LoadBalancer IP directly.	2026-02-16 21:16:16 +00:00
Viktor Barzin	3d4cdf3203	[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict - Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node - iDRAC Redfish Exporter: add explicit priority_class_name to resolve conflict between Kyverno priority injection and default priority: 0	2026-02-16 20:09:53 +00:00
Viktor Barzin	32a7c04deb	[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks	2026-02-16 19:45:31 +00:00
Viktor Barzin	5c7efbdb48	[ci skip] Fix docker-registry VM: add SSH key, remove hourly restart cron - Set explicit devvm SSH public key for cloud-init (was empty, breaking SSH access) - Remove hourly cron that restarted all registry containers, which wiped the in-memory blobdescriptor cache and caused low pull-through cache hit rates	2026-02-15 22:16:41 +00:00
Viktor Barzin	7854f89add	[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood Documents how Kubernetes ndots:5 search domain expansion floods external DNS with NxDomain queries, and the CoreDNS template block fix.	2026-02-15 21:52:27 +00:00
Viktor Barzin	2d015c1cb4	Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip]	2026-02-15 21:51:41 +00:00
Viktor Barzin	a8f42d7fc0	[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries Add kubernetes_config_map for CoreDNS to the technitium module, with a template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN immediately. This prevents ndots:5 search domain expansion from flooding Technitium with ~66k/day junk queries (e.g. redis.redis.svc.cluster.local.viktorbarzin.lan). Also enabled saveCache on Technitium so the DNS cache persists across pod restarts.	2026-02-15 21:51:12 +00:00
Viktor Barzin	a2b44c8ff7	Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip]	2026-02-15 21:24:08 +00:00
Viktor Barzin	8d38e1474b	[ci skip] Document Terraform state splitting plan for future implementation	2026-02-15 21:10:40 +00:00
Viktor Barzin	f447e45ee1	Add Cluster Health Overview Grafana dashboard [ci skip]	2026-02-15 19:38:28 +00:00
Viktor Barzin	1564ec7e79	Add tier-based resource governance via Kyverno [ci skip] Four layers of noisy-neighbor protection using existing tier system: - PriorityClasses (tier-0-core through tier-4-aux) - LimitRange defaults auto-generated per namespace tier - ResourceQuotas auto-generated per namespace tier - PriorityClassName injection on pods via Kyverno mutate Custom quota overrides for monitoring and crowdsec namespaces which exceed the default tier quotas.	2026-02-15 18:48:33 +00:00
Viktor Barzin	7ef23470cd	Add Uptime Kuma monitor check to cluster health script [ci skip] Adds check #14 that queries Uptime Kuma API for application-level monitor status, complementing the kubectl-level checks with HTTP/ping health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.	2026-02-15 17:49:40 +00:00
Viktor Barzin	df840a0078	[ci skip] remember: spawn subagent to monitor pods instead of sleeping	2026-02-15 17:48:42 +00:00
Viktor Barzin	8867769a75	Add cluster health check script with 13 diagnostic sections [ci skip]	2026-02-15 17:34:22 +00:00
Viktor Barzin	349fffc124	Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip] - Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance) - Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead) - Increase gpu-pod-exporter liveness probe timeout from 1s to 5s - Add osm-routing NFS exports (osrm-data, otp-data)	2026-02-15 17:20:47 +00:00
Viktor Barzin	2d475c8496	[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure	2026-02-15 17:18:17 +00:00
Viktor Barzin	f68ed73434	[ci skip] Strengthen Terraform-only change policy in project instructions	2026-02-15 15:10:11 +00:00
Viktor Barzin	c05614e4b8	update the scrape schedule for wrongmove [ci skip]	2026-02-15 14:40:05 +00:00
Viktor Barzin	0da86577fb	[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404	2026-02-15 14:36:50 +00:00
Viktor Barzin	dca2b0cabd	[ci skip] Add uptime-kuma management skill with tiered monitoring	2026-02-15 14:35:53 +00:00
Viktor Barzin	36d32b49e7	[ci skip] Fix pull-through cache for all registries Replace deprecated wildcard containerd mirror with per-registry config_path approach. Add proxy containers for ghcr.io, quay.io, registry.k8s.io, and reg.kyverno.io on the docker-registry VM. Set static IP for docker-registry VM to avoid DHCP issues.	2026-02-15 14:35:52 +00:00
Viktor Barzin	163d6a728d	Drone CI Update TLS Certificates Commit	2026-02-15 00:05:36 +00:00
Viktor Barzin	22fdb8fbf0	[ci skip] Add pfSense firewall management skill	2026-02-14 12:42:10 +00:00
Viktor Barzin	2b6bcae77f	[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup	2026-02-13 23:47:45 +00:00
Viktor Barzin	ecebf401bf	[ci skip] Update knowledge base with Loki + Alloy service notes	2026-02-13 23:46:01 +00:00
Viktor Barzin	cd52fc400c	[ci skip] Sync tfstate after Loki + Alloy deployment	2026-02-13 23:44:17 +00:00
Viktor Barzin	7644c419a4	[ci skip] Update Loki dashboard to use correct datasource UID	2026-02-13 23:41:40 +00:00
Viktor Barzin	cd2d13d949	[ci skip] Fix compactor/ruler paths to use writable /var/loki mount	2026-02-13 23:22:13 +00:00
Viktor Barzin	d906513f09	[ci skip] Re-enable lokiCanary (required by Helm chart validation)	2026-02-13 23:18:13 +00:00
Viktor Barzin	a38c3d3dc7	[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy	2026-02-13 23:17:32 +00:00
Viktor Barzin	f013c0a139	[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc	2026-02-13 23:08:44 +00:00
Viktor Barzin	c7236f09f1	[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet	2026-02-13 23:03:40 +00:00
Viktor Barzin	c330648b7b	[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway Add new Kubernetes service for OpenClaw gateway connected to in-cluster Ollama, with kubectl/terraform/git access for infrastructure management. Protected behind Authentik SSO.	2026-02-13 22:57:36 +00:00
Viktor Barzin	9df9ab1654	[ci skip] Add extend-vm-storage script and skills - Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon) - Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts) - Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)	2026-02-13 22:08:46 +00:00
Viktor Barzin	ecffe93c22	[ci skip] Add centralized log collection implementation plan	2026-02-13 21:54:55 +00:00
Viktor Barzin	3d64fc9f2c	[ci skip] Add centralized log collection design doc	2026-02-13 21:53:04 +00:00
Viktor Barzin	06f9a7fe74	[ci skip] Add skill: local-llm-gpu-selection	2026-02-13 19:26:19 +00:00
Viktor Barzin	e0ff08978d	[ci skip] add vibetunnel proxy	2026-02-13 18:20:50 +00:00
Viktor Barzin	2377045630	[ci skip] sync tfstate and add frigate helper scripts	2026-02-12 23:11:23 +00:00
Viktor Barzin	8bea552664	[ci skip] Add HomeAssistantDown alert for ha-sofia Fires after 5m if the haos Prometheus scrape target is unreachable. Covers the HTTP API endpoint which shares the same process as the WebSocket API used by the mobile app.	2026-02-11 23:24:46 +00:00
Viktor Barzin	b2d74a93a0	sync tfstate [ci skip]	2026-02-11 22:53:39 +00:00
Viktor Barzin	0c18a86a7b	[ci skip] Fix all active Prometheus alerts - meshcentral: rename port from "https" to "http" — MeshCentral serves plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the port name, causing 100% 5xx errors - osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops trying to build graph with no valid transit trips - wireguard: add prometheus.io/port=9586 annotation — without it, Prometheus tried scraping all container ports (51820 UDP, 80) - travel-blog: remove stale prometheus.io annotations and dead port 9113 — nginx-exporter sidecar was commented out but annotations remained - dawarich: remove prometheus.io annotations — exporter env vars are commented out so nothing listens on port 9394 - monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is 79°C), lower registry cache threshold 50%→25%, add minimum traffic floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives on low-traffic services	2026-02-11 22:40:56 +00:00
Viktor Barzin	9c3f8adc11	[ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications - Switch acquisition from ingress-nginx to traefik namespace/pods - Change collection from crowdsecurity/nginx to crowdsecurity/traefik - Add Slack notification plugin for ban/captcha decisions - Wire alertmanager_slack_api_url through to CrowdSec module	2026-02-11 22:25:03 +00:00
Viktor Barzin	04eaf0f989	[ci skip] Add 12 Prometheus alert rules for monitoring gaps Add 3 new alert groups and 1 rule to existing group: - Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used) - K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady, NodeConditionBad, JobFailed - Infrastructure Health: CoreDNSErrors, ScrapeTargetDown, PrometheusStorageFull, PrometheusNotificationsFailing - R730 Host: FanFailure (iDRAC Redfish fan health)	2026-02-11 22:14:30 +00:00
Viktor Barzin	12a84c5207	Standardize Prometheus alert formatting and fix Slack notifications - Add color coding (red/green) to Slack alerts, show alertname in title - Use summary annotation in Slack text (description was always empty) - Format all alert summaries consistently: value with units and threshold - Fix ratio expressions (CPU/memory) to display as percentages - Fix "failiure" typo, capitalize Tailscale	2026-02-11 21:53:22 +00:00

1 2 3 4 5 ...

1580 commits