Commit graph

1226 commits

Author SHA1 Message Date
Viktor Barzin
2bae6ccce3 Add Uptime Kuma monitor check to cluster health script [ci skip]
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.
2026-02-15 17:49:40 +00:00
Viktor Barzin
719e3c6244 [ci skip] remember: spawn subagent to monitor pods instead of sleeping 2026-02-15 17:48:42 +00:00
Viktor Barzin
9c4ff21d58 Add cluster health check script with 13 diagnostic sections [ci skip] 2026-02-15 17:34:22 +00:00
Viktor Barzin
a73f3fcb6b Cluster health remediation: cleanup CronJob, disable Collabora, fix GPU probe, add NFS exports [ci skip]
- Add daily CronJob to auto-clean Failed/Evicted pods cluster-wide (infra-maintenance)
- Disable Collabora in Nextcloud (broken HPA caused scaling storm; using OnlyOffice instead)
- Increase gpu-pod-exporter liveness probe timeout from 1s to 5s
- Add osm-routing NFS exports (osrm-data, otp-data)
2026-02-15 17:20:47 +00:00
Viktor Barzin
3da35166ab [ci skip] Add skills: helm-stuck-release-recovery, k8s-hpa-scaling-storm, crowdsec-agent-registration-failure 2026-02-15 17:18:17 +00:00
Viktor Barzin
95013c9056 [ci skip] Strengthen Terraform-only change policy in project instructions 2026-02-15 15:10:11 +00:00
Viktor Barzin
606a79078e [ci skip] Add skills: containerd-multi-registry-pull-through-cache, traefik-plugin-download-failure-404 2026-02-15 14:36:50 +00:00
Viktor Barzin
a7f2d6b9e6 [ci skip] Add uptime-kuma management skill with tiered monitoring 2026-02-15 14:35:53 +00:00
Viktor Barzin
a67a6f350e [ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
5a37c26e9b Drone CI Update TLS Certificates Commit 2026-02-15 00:05:36 +00:00
Viktor Barzin
c473663b98 [ci skip] Add pfSense firewall management skill 2026-02-14 12:42:10 +00:00
Viktor Barzin
ca43b97fa0 [ci skip] Add skills: loki-helm-deployment-pitfalls, grafana-stale-datasource-cleanup 2026-02-13 23:47:45 +00:00
Viktor Barzin
a5b240629c [ci skip] Update knowledge base with Loki + Alloy service notes 2026-02-13 23:46:01 +00:00
Viktor Barzin
c4a7c5df8e [ci skip] Update Loki dashboard to use correct datasource UID 2026-02-13 23:41:40 +00:00
Viktor Barzin
fabece6370 [ci skip] Fix compactor/ruler paths to use writable /var/loki mount 2026-02-13 23:22:13 +00:00
Viktor Barzin
0d3acec82c [ci skip] Re-enable lokiCanary (required by Helm chart validation) 2026-02-13 23:18:13 +00:00
Viktor Barzin
fea7c6cbb1 [ci skip] Disable gateway/canary/cache, increase timeout for Loki deploy 2026-02-13 23:17:32 +00:00
Viktor Barzin
69aae2ec9d [ci skip] Fix code review findings: correct Alertmanager URL, add atomic to Loki, remove dead minio NFS export, update design doc 2026-02-13 23:08:44 +00:00
Viktor Barzin
71ff803978 [ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
a44dfac721 [ci skip] Deploy MoltBot (OpenClaw) AI agent gateway
Add new Kubernetes service for OpenClaw gateway connected to in-cluster
Ollama, with kubectl/terraform/git access for infrastructure management.
Protected behind Authentik SSO.
2026-02-13 22:57:36 +00:00
Viktor Barzin
08ea489fe0 [ci skip] Add extend-vm-storage script and skills
- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
2026-02-13 22:08:46 +00:00
Viktor Barzin
04dd438b01 [ci skip] Add centralized log collection implementation plan 2026-02-13 21:54:55 +00:00
Viktor Barzin
6ac8d549cb [ci skip] Add centralized log collection design doc 2026-02-13 21:53:04 +00:00
Viktor Barzin
92f392f64c [ci skip] Add skill: local-llm-gpu-selection 2026-02-13 19:26:19 +00:00
Viktor Barzin
0137913954 [ci skip] add vibetunnel proxy 2026-02-13 18:20:50 +00:00
Viktor Barzin
a926a5022c [ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
Viktor Barzin
cd5261161b [ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
46ffc37dcf [ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
b4f68d99d8 [ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications
- Switch acquisition from ingress-nginx to traefik namespace/pods
- Change collection from crowdsecurity/nginx to crowdsecurity/traefik
- Add Slack notification plugin for ban/captcha decisions
- Wire alertmanager_slack_api_url through to CrowdSec module
2026-02-11 22:25:03 +00:00
Viktor Barzin
c8a41ac567 [ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
dbf397841a Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
d48052276e [ci skip] Add skill: traefik-rewrite-body-compression
Extracted from debugging session where packruler/rewrite-body plugin
corrupted gzip responses, breaking HA Companion app auth flow and
WebSocket connections. Fix: strip Accept-Encoding header before
rewrite-body plugin so backends send uncompressed responses.
2026-02-11 21:42:07 +00:00
Viktor Barzin
f03b8a055b [ci skip] Fix rewrite-body plugin corrupting compressed responses
The packruler/rewrite-body plugin (used for rybbit analytics injection)
fails to decompress gzip responses with "flate: corrupt input before
offset 5", corrupting the response body. This broke HA Companion app's
external_auth flow and WebSocket connections on ha-sofia.

Fix: add a strip-accept-encoding middleware that removes Accept-Encoding
from requests when rybbit is active, forcing backends to send uncompressed
responses that the plugin can safely process.

Also add extra_middlewares variable to reverse_proxy factory for
extensibility.
2026-02-11 21:40:11 +00:00
Viktor Barzin
036ec06256 immich to 2.5.6 [ci skip] 2026-02-10 22:01:08 +00:00
Viktor Barzin
c82f82af57 [ci skip] Add ingress-factory-migration skill 2026-02-10 21:31:48 +00:00
Viktor Barzin
73aab7f4ce [ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
5e1e18a044 [ci skip] Use RollingUpdate strategy for real-estate-crawler deployments
Set max_unavailable=0, max_surge=1 on both UI and API deployments
to ensure at least 1 replica is always available during updates.
2026-02-10 21:28:38 +00:00
Viktor Barzin
6d6ec0c1e2 [ci skip] Refactor raw ingresses to use ingress_factory module
Enhance ingress_factory with full_host, extra_middlewares, and
skip_default_rate_limit variables. Fix TLS hosts bug to use
effective_host. Migrate 13 services from raw kubernetes_ingress_v1
resources to centralized ingress_factory module calls, removing
manual rybbit middleware CRDs where the factory now handles them.
2026-02-10 21:11:46 +00:00
Viktor Barzin
70376b623e [ci skip] Fix health service port: container listens on 3000, not 80 2026-02-09 21:27:50 +00:00
Viktor Barzin
f04a072beb [ci skip] Add internal OSM routing services (OSRM foot, bicycle, OTP)
New osm-routing namespace with walking, cycling, and transit routing
services for the real-estate-crawler. Internal-only (no public ingress).
2026-02-09 21:03:57 +00:00
Viktor Barzin
5a81ce5774 [ci skip] allow 100 time slicing of nvidia gpu 2026-02-09 21:00:15 +00:00
Viktor Barzin
7b747350de [ci skip] Add descheduler profile to restart idrac-redfish-exporter every 6h 2026-02-09 20:57:09 +00:00
Viktor Barzin
c408887560 [ci skip] Add WebAuthn env vars to real-estate-crawler API deployment 2026-02-08 20:06:24 +00:00
Viktor Barzin
bcdebfd9c1 [ci skip] update claude knowledge: fix NFS scripts path to secrets/ 2026-02-08 02:41:42 +00:00
Viktor Barzin
13659e0fc6 [ci skip] Fix grampsweb ingress: set service_name to match backend service
The ingress_factory defaults service_name to name, so it was routing
to a non-existent "family" service instead of "grampsweb".
2026-02-08 02:30:19 +00:00
Viktor Barzin
945d2d90a7 [ci skip] update claude knowledge: always apply cloudflared module for DNS
When deploying a new service, the cloudflared module must also be applied
to create the Cloudflare DNS record. Updated CLAUDE.md and setup-project skill.
2026-02-08 02:30:19 +00:00
Viktor Barzin
ce8f81db0c [ci skip] Deploy Gramps Web genealogy service
Add grampsweb module with web app + Celery worker in a single pod,
using shared Redis (DB 2/3), NFS storage, email via mailserver,
and Ollama AI integration. Available at family.viktorbarzin.me.
2026-02-08 02:30:18 +00:00
Viktor Barzin
861cd80c64 add the nfs dirs 2026-02-08 02:29:48 +00:00
Viktor Barzin
a2e1a79286 [ci skip] update claude knowledge: add health service 2026-02-08 01:55:30 +00:00
Viktor Barzin
5ad7b7e76d [ci skip] Deploy health dashboard service
Apple Health data visualization app (Svelte + FastAPI + Caddy).
Uses shared PostgreSQL via DBaaS, NFS storage for uploads,
accessible at health.viktorbarzin.me.
2026-02-08 01:54:24 +00:00