Commit graph

1540 commits

Author SHA1 Message Date
Viktor Barzin
3d64fc9f2c
[ci skip] Add centralized log collection design doc 2026-02-13 21:53:04 +00:00
Viktor Barzin
06f9a7fe74
[ci skip] Add skill: local-llm-gpu-selection 2026-02-13 19:26:19 +00:00
Viktor Barzin
e0ff08978d
[ci skip] add vibetunnel proxy 2026-02-13 18:20:50 +00:00
Viktor Barzin
2377045630
[ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
Viktor Barzin
8bea552664
[ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
b2d74a93a0
sync tfstate [ci skip] 2026-02-11 22:53:39 +00:00
Viktor Barzin
0c18a86a7b
[ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
9c3f8adc11
[ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications
- Switch acquisition from ingress-nginx to traefik namespace/pods
- Change collection from crowdsecurity/nginx to crowdsecurity/traefik
- Add Slack notification plugin for ban/captcha decisions
- Wire alertmanager_slack_api_url through to CrowdSec module
2026-02-11 22:25:03 +00:00
Viktor Barzin
04eaf0f989
[ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
12a84c5207
Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
81bdf53386
[ci skip] Add skill: traefik-rewrite-body-compression
Extracted from debugging session where packruler/rewrite-body plugin
corrupted gzip responses, breaking HA Companion app auth flow and
WebSocket connections. Fix: strip Accept-Encoding header before
rewrite-body plugin so backends send uncompressed responses.
2026-02-11 21:42:07 +00:00
Viktor Barzin
220f4a18b7
[ci skip] Fix rewrite-body plugin corrupting compressed responses
The packruler/rewrite-body plugin (used for rybbit analytics injection)
fails to decompress gzip responses with "flate: corrupt input before
offset 5", corrupting the response body. This broke HA Companion app's
external_auth flow and WebSocket connections on ha-sofia.

Fix: add a strip-accept-encoding middleware that removes Accept-Encoding
from requests when rybbit is active, forcing backends to send uncompressed
responses that the plugin can safely process.

Also add extra_middlewares variable to reverse_proxy factory for
extensibility.
2026-02-11 21:40:11 +00:00
Viktor Barzin
8e54a936c4
immich to 2.5.6 [ci skip] 2026-02-10 22:01:08 +00:00
Viktor Barzin
ed3cb6df07
[ci skip] Add ingress-factory-migration skill 2026-02-10 21:31:48 +00:00
Viktor Barzin
6acf5ee300
[ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
f9c07f015b
[ci skip] Use RollingUpdate strategy for real-estate-crawler deployments
Set max_unavailable=0, max_surge=1 on both UI and API deployments
to ensure at least 1 replica is always available during updates.
2026-02-10 21:28:38 +00:00
Viktor Barzin
348d706a48
[ci skip] Refactor raw ingresses to use ingress_factory module
Enhance ingress_factory with full_host, extra_middlewares, and
skip_default_rate_limit variables. Fix TLS hosts bug to use
effective_host. Migrate 13 services from raw kubernetes_ingress_v1
resources to centralized ingress_factory module calls, removing
manual rybbit middleware CRDs where the factory now handles them.
2026-02-10 21:11:46 +00:00
Viktor Barzin
97175c5444
[ci skip] Fix health service port: container listens on 3000, not 80 2026-02-09 21:27:50 +00:00
Viktor Barzin
dadee44046
[ci skip] Add internal OSM routing services (OSRM foot, bicycle, OTP)
New osm-routing namespace with walking, cycling, and transit routing
services for the real-estate-crawler. Internal-only (no public ingress).
2026-02-09 21:03:57 +00:00
Viktor Barzin
cb761f90d7
[ci skip] allow 100 time slicing of nvidia gpu 2026-02-09 21:00:15 +00:00
Viktor Barzin
b6499a8090
[ci skip] Add descheduler profile to restart idrac-redfish-exporter every 6h 2026-02-09 20:57:09 +00:00
Viktor Barzin
a33476919f
[ci skip] Add WebAuthn env vars to real-estate-crawler API deployment 2026-02-08 20:06:24 +00:00
Viktor Barzin
748c77ff9f
[ci skip] update claude knowledge: fix NFS scripts path to secrets/ 2026-02-08 02:41:42 +00:00
Viktor Barzin
74cc37a3c2
[ci skip] Fix grampsweb ingress: set service_name to match backend service
The ingress_factory defaults service_name to name, so it was routing
to a non-existent "family" service instead of "grampsweb".
2026-02-08 02:30:19 +00:00
Viktor Barzin
87f851c20e
[ci skip] update claude knowledge: always apply cloudflared module for DNS
When deploying a new service, the cloudflared module must also be applied
to create the Cloudflare DNS record. Updated CLAUDE.md and setup-project skill.
2026-02-08 02:30:19 +00:00
Viktor Barzin
d911db6cd9
[ci skip] Deploy Gramps Web genealogy service
Add grampsweb module with web app + Celery worker in a single pod,
using shared Redis (DB 2/3), NFS storage, email via mailserver,
and Ollama AI integration. Available at family.viktorbarzin.me.
2026-02-08 02:30:18 +00:00
Viktor Barzin
c04a5e6229 add the nfs dirs 2026-02-08 02:29:48 +00:00
Viktor Barzin
71223c02ad
[ci skip] update claude knowledge: add health service 2026-02-08 01:55:30 +00:00
Viktor Barzin
43bee50de8
[ci skip] Deploy health dashboard service
Apple Health data visualization app (Svelte + FastAPI + Caddy).
Uses shared PostgreSQL via DBaaS, NFS storage for uploads,
accessible at health.viktorbarzin.me.
2026-02-08 01:54:24 +00:00
Viktor Barzin
00943e92fe
[ci skip] update add-service skill: require NFS setup before deployment
Add step 3 (NFS Storage Setup) to ensure NFS directories are created
and exported on TrueNAS before deploying services that need persistent
storage. Prevents pods getting stuck in ContainerCreating due to missing
NFS mounts.
2026-02-08 01:51:44 +00:00
Viktor Barzin
44a17f8089
[ci skip] Add Ollama TCP entrypoint for HA voice pipeline
Expose Ollama at 10.0.20.202:11434 via Traefik TCP passthrough,
bypassing TLS/auth issues with the HTTPS ingress.
2026-02-08 01:51:43 +00:00
Viktor Barzin
bdbd354396
[ci skip] Add Wyoming Piper TTS alongside Whisper STT
Deploy Piper (rhasspy/wyoming-piper) in the whisper namespace with
en_US-lessac-medium voice. Exposed via Traefik TCP on port 10200.
2026-02-08 01:51:43 +00:00
Viktor Barzin
d89947c2fd
[ci skip] Deploy Wyoming Whisper STT service for Home Assistant voice input
Add Wyoming Faster Whisper (rhasspy/wyoming-whisper) as a new K8s service
exposed via Traefik TCP entrypoint on port 10300. Accessible from ha-london
RPi via VPN at 10.0.20.202:10300.
2026-02-08 01:51:43 +00:00
Viktor Barzin
e067504170
[ci skip] update claude knowledge: fix ha-london IP to 192.168.8.103 2026-02-08 01:51:42 +00:00
Viktor Barzin
476f2d2b66 Drone CI Update TLS Certificates Commit 2026-02-08 00:04:51 +00:00
Viktor Barzin
e04fabaa72
[ci skip] Fix registry tag cleanup for pull-through cache
- Rewrite cleanup script to use filesystem deletion (shutil.rmtree)
  since proxy registries don't support DELETE via API (405)
- Fix cron entry to invoke with python3
2026-02-07 22:45:17 +00:00
Viktor Barzin
9bc56b3f04
[ci skip] Add LLM agents, voice stack, and automations to ha-london knowledge map 2026-02-07 22:40:12 +00:00
Viktor Barzin
6ba026801c
[ci skip] Update terraform state 2026-02-07 22:39:44 +00:00
Viktor Barzin
a5c782578c
[ci skip] Add ha-london knowledge map: RPi Docker setup, smart plugs, air quality, e-bike
ha-london runs on Raspberry Pi at 192.168.8.104 (Docker rootless, HA 2025.9.1).
Key systems: TP-Link Kasa smart plugs with energy monitoring, Apollo AIR-1 air
quality sensor (ESPHome), Cowboy e-bike, UptimeRobot, Oral-B BLE toothbrush.
SSH access via pi@192.168.8.104, config at /home/pi/docker/homeAssistant/.
2026-02-07 22:39:20 +00:00
Viktor Barzin
8bbf4e51da
Add registry DNS record and real-estate scrape schedules
Add registry.viktorbarzin.me to non-proxied DNS names. Add scrape
schedule config for real-estate-crawler. Fix crowdsec var formatting.
2026-02-07 22:38:42 +00:00
Viktor Barzin
1ff5242a57
Bump Immich version from v2.5.2 to v2.5.5 2026-02-07 22:38:33 +00:00
Viktor Barzin
b27e1ad9f1
Add Docker registry UI and tag cleanup automation
Deploy joxit/docker-registry-ui on port 8080 for browsing images/tags.
Add Python script to prune old registry tags (keeps last N per image),
scheduled daily at 2am via cron. Expose UI via reverse proxy at
registry.viktorbarzin.me with Authentik auth.
2026-02-07 22:38:15 +00:00
Viktor Barzin
4bd87df6d6
[ci skip] Add skill: traefik-udp-cross-namespace
Extracted from debugging DNS forwarding through Traefik v3. Documents
two non-obvious requirements for custom UDP entrypoints in the Helm chart:
expose.default=true (port not added to Service by default) and
allowCrossNamespace=true (IngressRouteUDP cross-namespace refs blocked
by default). Both issues compound silently.
2026-02-07 22:25:54 +00:00
Viktor Barzin
fb2a830cb2
[ci skip] Update ha-sofia SSH to direct IP 192.168.1.8 and document limitations 2026-02-07 22:21:30 +00:00
Viktor Barzin
8cc4f022c3
[ci skip] Add NAS, printer, iDRAC, AC, and AI to ha-sofia knowledge map 2026-02-07 21:40:47 +00:00
Viktor Barzin
f0844bc45a
[ci skip] Add ha-sofia knowledge map to home-assistant skill
Document all systems discovered via API: gas boiler (EMS-ESP), 4-room
thermostats, solar/battery (Solarman), ATS, Paradox alarm, Frigate NVR
with 9 cameras, Home Connect appliances, LED controllers, media, UPS,
Pax ventilation, and Bulgarian ↔ English room name mappings.
2026-02-07 21:39:58 +00:00
Viktor Barzin
fe2283d728
[ci skip] Add Proxmox VM inventory to claude knowledge 2026-02-07 21:37:38 +00:00
Viktor Barzin
4c001178f6
[ci skip] Add ha-sofia Home Assistant deployment to skills
- Update home-assistant skill to v2.0.0 covering both ha-london and ha-sofia
- Add separate API script for ha-sofia (home-assistant-sofia.py)
- ha-sofia: SSH via vbarzin@ha-sofia.viktorbarzin.lan, config at /config/
- Update CLAUDE.md with both HA deployments
2026-02-07 21:26:05 +00:00
Viktor Barzin
cea317a1b7
[ci skip] Add skills: traefik-http3-quic and helm-release-force-rerender
- traefik-http3-quic: Enable HTTP/3 (QUIC) on Traefik with advertisedPort
  gotcha, Cloudflare zone settings, and testing instructions
- helm-release-force-rerender: Fix Helm releases where Terraform applies
  but K8s resources don't reflect new values (state rm + reimport pattern)
2026-02-07 20:49:34 +00:00
Viktor Barzin
b964a92a8b
[ci skip] update claude knowledge: HTTP/3 enabled for Traefik and Cloudflare 2026-02-07 20:46:14 +00:00