Commit graph

1580 commits

Author SHA1 Message Date
Viktor Barzin
c0363be5e4
[ci skip] Add Grafana dashboard for Technitium DNS query logs
Add MySQL datasource and 15-panel dashboard for DNS analytics:
queries over time, response codes, top domains/clients, response
times, blocked/NxDomain domains. Enable Grafana dashboard sidecar
for auto-provisioning dashboards from ConfigMaps.
2026-02-16 23:06:41 +00:00
Viktor Barzin
1802b4c86d
[ci skip] Add pfsense-dnsmasq-interface-binding skill, update ndots skill to v1.1.0 2026-02-16 22:30:57 +00:00
Viktor Barzin
a268b9107f
[ci skip] Replace specific CoreDNS catch-all blocks with generic template regex
Single template regex in the viktorbarzin.lan block catches ALL search
domain expansion junk (*.com.viktorbarzin.lan, *.cluster.local.viktorbarzin.lan,
etc.) instead of needing separate server blocks per pattern. Legitimate
single-label queries (idrac.viktorbarzin.lan) fall through to Technitium.
2026-02-16 21:49:03 +00:00
Viktor Barzin
19136c21f1
[ci skip] Fix .viktorbarzin.lan.viktorbarzin.lan duplicate DNS queries
Add CoreDNS catch-all block for viktorbarzin.lan.viktorbarzin.lan to
return NXDOMAIN immediately, preventing search domain expansion junk
queries from reaching Technitium. Add trailing dots to Prometheus
scrape targets (idrac, ups, ha-sofia) to bypass ndots expansion.
2026-02-16 21:38:38 +00:00
Viktor Barzin
c08bab7da7
[ci skip] Update preference: always use cluster_healthcheck.sh for health checks 2026-02-16 21:19:49 +00:00
Viktor Barzin
205eb2704b
[ci skip] Fix Technitium DNS client IP logging: bypass Traefik L4 proxy
DNS queries were going through Traefik's IngressRouteUDP, replacing
real client IPs with Traefik pod IPs (10.10.169.150) in Technitium logs.
Changed Technitium DNS service from NodePort to LoadBalancer with
externalTrafficPolicy: Local, removed dns-udp entrypoint and
IngressRouteUDP from Traefik, and updated CoreDNS to forward .lan
queries to Technitium's LoadBalancer IP directly.
2026-02-16 21:16:16 +00:00
Viktor Barzin
3d4cdf3203
[ci skip] Fix Alloy OOMKill and iDRAC priority class conflict
- Alloy: bump memory limits from 64Mi/128Mi to 256Mi/768Mi — pods were
  OOMKilled at 128Mi, steady-state usage is ~400-450Mi per node
- iDRAC Redfish Exporter: add explicit priority_class_name to resolve
  conflict between Kyverno priority injection and default priority: 0
2026-02-16 20:09:53 +00:00
Viktor Barzin
32a7c04deb
[ci skip] Remember to use cluster_healthcheck.sh for cluster status checks 2026-02-16 19:45:31 +00:00
Viktor Barzin
5c7efbdb48
[ci skip] Fix docker-registry VM: add SSH key, remove hourly restart cron
- Set explicit devvm SSH public key for cloud-init (was empty, breaking SSH access)
- Remove hourly cron that restarted all registry containers, which wiped the
  in-memory blobdescriptor cache and caused low pull-through cache hit rates
2026-02-15 22:16:41 +00:00
Viktor Barzin
7854f89add
[ci skip] Add skill: k8s-ndots-search-domain-nxdomain-flood
Documents how Kubernetes ndots:5 search domain expansion floods external
DNS with NxDomain queries, and the CoreDNS template block fix.
2026-02-15 21:52:27 +00:00
Viktor Barzin
2d015c1cb4
Update Cluster Health dashboard: dedup metrics, GPU memory, remove broken panels [ci skip] 2026-02-15 21:51:41 +00:00
Viktor Barzin
a8f42d7fc0
[ci skip] Manage CoreDNS Corefile in Terraform and block junk NxDomain queries
Add kubernetes_config_map for CoreDNS to the technitium module, with a
template block for cluster.local.viktorbarzin.lan that returns NXDOMAIN
immediately. This prevents ndots:5 search domain expansion from flooding
Technitium with ~66k/day junk queries (e.g.
redis.redis.svc.cluster.local.viktorbarzin.lan).

Also enabled saveCache on Technitium so the DNS cache persists across
pod restarts.
2026-02-15 21:51:12 +00:00
Viktor Barzin
a2b44c8ff7
Update Cluster Health dashboard: reorder rows, add links, key services, sorting [ci skip] 2026-02-15 21:24:08 +00:00
Viktor Barzin
8d38e1474b
[ci skip] Document Terraform state splitting plan for future implementation 2026-02-15 21:10:40 +00:00
Viktor Barzin
f447e45ee1
Add Cluster Health Overview Grafana dashboard [ci skip] 2026-02-15 19:38:28 +00:00
Viktor Barzin
1564ec7e79
Add tier-based resource governance via Kyverno [ci skip]
Four layers of noisy-neighbor protection using existing tier system:
- PriorityClasses (tier-0-core through tier-4-aux)
- LimitRange defaults auto-generated per namespace tier
- ResourceQuotas auto-generated per namespace tier
- PriorityClassName injection on pods via Kyverno mutate

Custom quota overrides for monitoring and crowdsec namespaces
which exceed the default tier quotas.
2026-02-15 18:48:33 +00:00
Viktor Barzin
7ef23470cd
Add Uptime Kuma monitor check to cluster health script [ci skip]
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.
2026-02-15 17:49:40 +00:00
Viktor Barzin
df840a0078
[ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
8867769a75
Add cluster health check script with 13 diagnostic sections [ci skip] 2026-02-15 17:34:22 +00:00
Viktor Barzin
349fffc124
Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip]
- Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance)
- Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead)
- Increase gpu-pod-exporter liveness probe timeout from 1s to 5s
- Add osm-routing NFS exports (osrm-data, otp-data)
2026-02-15 17:20:47 +00:00
Viktor Barzin
2d475c8496
[ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure 2026-02-15 17:18:17 +00:00
Viktor Barzin
f68ed73434
[ci skip] Strengthen Terraform-only change policy in project instructions 2026-02-15 15:10:11 +00:00
Viktor Barzin
c05614e4b8
update the scrape schedule for wrongmove [ci skip] 2026-02-15 14:40:05 +00:00
Viktor Barzin
0da86577fb
[ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404 2026-02-15 14:36:50 +00:00
Viktor Barzin
dca2b0cabd
[ci skip] Add uptime-kuma management skill with tiered monitoring 2026-02-15 14:35:53 +00:00
Viktor Barzin
36d32b49e7
[ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
163d6a728d Drone CI Update TLS Certificates Commit 2026-02-15 00:05:36 +00:00
Viktor Barzin
22fdb8fbf0
[ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
Viktor Barzin
2b6bcae77f
[ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup 2026-02-13 23:47:45 +00:00
Viktor Barzin
ecebf401bf
[ci skip] Update knowledge base with Loki + Alloy service notes 2026-02-13 23:46:01 +00:00
Viktor Barzin
cd52fc400c
[ci skip] Sync tfstate after Loki + Alloy deployment 2026-02-13 23:44:17 +00:00
Viktor Barzin
7644c419a4
[ci skip] Update Loki dashboard to use correct datasource UID 2026-02-13 23:41:40 +00:00
Viktor Barzin
cd2d13d949
[ci skip] Fix compactor/ruler paths to use writable /var/loki mount 2026-02-13 23:22:13 +00:00
Viktor Barzin
d906513f09
[ci skip] Re-enable lokiCanary (required by Helm chart validation) 2026-02-13 23:18:13 +00:00
Viktor Barzin
a38c3d3dc7
[ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy 2026-02-13 23:17:32 +00:00
Viktor Barzin
f013c0a139
[ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
c7236f09f1
[ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
c330648b7b
[ci skip] Deploy MoltBot (OpenClaw) AI agent gateway
Add new Kubernetes service for OpenClaw gateway connected to in-cluster
Ollama, with kubectl/terraform/git access for infrastructure management.
Protected behind Authentik SSO.
2026-02-13 22:57:36 +00:00
Viktor Barzin
9df9ab1654
[ci skip] Add extend-vm-storage script and skills
- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
2026-02-13 22:08:46 +00:00
Viktor Barzin
ecffe93c22
[ci skip] Add centralized log collection implementation plan 2026-02-13 21:54:55 +00:00
Viktor Barzin
3d64fc9f2c
[ci skip] Add centralized log collection design doc 2026-02-13 21:53:04 +00:00
Viktor Barzin
06f9a7fe74
[ci skip] Add skill: local-llm-gpu-selection 2026-02-13 19:26:19 +00:00
Viktor Barzin
e0ff08978d
[ci skip] add vibetunnel proxy 2026-02-13 18:20:50 +00:00
Viktor Barzin
2377045630
[ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
Viktor Barzin
8bea552664
[ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
b2d74a93a0
sync tfstate [ci skip] 2026-02-11 22:53:39 +00:00
Viktor Barzin
0c18a86a7b
[ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
9c3f8adc11
[ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications
- Switch acquisition from ingress-nginx to traefik namespace/pods
- Change collection from crowdsecurity/nginx to crowdsecurity/traefik
- Add Slack notification plugin for ban/captcha decisions
- Wire alertmanager_slack_api_url through to CrowdSec module
2026-02-11 22:25:03 +00:00
Viktor Barzin
04eaf0f989
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
12a84c5207
Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00