Commit graph

991 commits

Author SHA1 Message Date
Viktor Barzin
71ff803978 [ci skip] Add centralized log collection: Loki + Alloy + sysctl DaemonSet 2026-02-13 23:03:40 +00:00
Viktor Barzin
a44dfac721 [ci skip] Deploy MoltBot (OpenClaw) AI agent gateway
Add new Kubernetes service for OpenClaw gateway connected to in-cluster
Ollama, with kubectl/terraform/git access for infrastructure management.
Protected behind Authentik SSO.
2026-02-13 22:57:36 +00:00
Viktor Barzin
0137913954 [ci skip] add vibetunnel proxy 2026-02-13 18:20:50 +00:00
Viktor Barzin
cd5261161b [ci skip] Add HomeAssistantDown alert for ha-sofia
Fires after 5m if the haos Prometheus scrape target is unreachable.
Covers the HTTP API endpoint which shares the same process as the
WebSocket API used by the mobile app.
2026-02-11 23:24:46 +00:00
Viktor Barzin
46ffc37dcf [ci skip] Fix all active Prometheus alerts
- meshcentral: rename port from "https" to "http" — MeshCentral serves
  plain HTTP when REVERSE_PROXY=true, but Traefik inferred HTTPS from the
  port name, causing 100% 5xx errors
- osm-routing/otp: scale to 0 — TfL GTFS data expired, OTP crash-loops
  trying to build graph with no valid transit trips
- wireguard: add prometheus.io/port=9586 annotation — without it,
  Prometheus tried scraping all container ports (51820 UDP, 80)
- travel-blog: remove stale prometheus.io annotations and dead port 9113
  — nginx-exporter sidecar was commented out but annotations remained
- dawarich: remove prometheus.io annotations — exporter env vars are
  commented out so nothing listens on port 9394
- monitoring: raise CPU temp threshold 60°C→75°C (E5-2699 v4 Tcase is
  79°C), lower registry cache threshold 50%→25%, add minimum traffic
  floor (>0.1 req/s) to 4xx/5xx rate alerts to prevent false positives
  on low-traffic services
2026-02-11 22:40:56 +00:00
Viktor Barzin
b4f68d99d8 [ci skip] Fix CrowdSec to monitor Traefik and add Slack notifications
- Switch acquisition from ingress-nginx to traefik namespace/pods
- Change collection from crowdsecurity/nginx to crowdsecurity/traefik
- Add Slack notification plugin for ban/captcha decisions
- Wire alertmanager_slack_api_url through to CrowdSec module
2026-02-11 22:25:03 +00:00
Viktor Barzin
c8a41ac567 [ci skip] Add 12 Prometheus alert rules for monitoring gaps
Add 3 new alert groups and 1 rule to existing group:
- Storage: NodeFilesystemFull (<10% free), PVFillingUp (>85% used)
- K8s Health: PodCrashLooping, ContainerOOMKilled, NodeNotReady,
  NodeConditionBad, JobFailed
- Infrastructure Health: CoreDNSErrors, ScrapeTargetDown,
  PrometheusStorageFull, PrometheusNotificationsFailing
- R730 Host: FanFailure (iDRAC Redfish fan health)
2026-02-11 22:14:30 +00:00
Viktor Barzin
dbf397841a Standardize Prometheus alert formatting and fix Slack notifications
- Add color coding (red/green) to Slack alerts, show alertname in title
- Use summary annotation in Slack text (description was always empty)
- Format all alert summaries consistently: value with units and threshold
- Fix ratio expressions (CPU/memory) to display as percentages
- Fix "failiure" typo, capitalize Tailscale
2026-02-11 21:53:22 +00:00
Viktor Barzin
f03b8a055b [ci skip] Fix rewrite-body plugin corrupting compressed responses
The packruler/rewrite-body plugin (used for rybbit analytics injection)
fails to decompress gzip responses with "flate: corrupt input before
offset 5", corrupting the response body. This broke HA Companion app's
external_auth flow and WebSocket connections on ha-sofia.

Fix: add a strip-accept-encoding middleware that removes Accept-Encoding
from requests when rybbit is active, forcing backends to send uncompressed
responses that the plugin can safely process.

Also add extra_middlewares variable to reverse_proxy factory for
extensibility.
2026-02-11 21:40:11 +00:00
Viktor Barzin
036ec06256 immich to 2.5.6 [ci skip] 2026-02-10 22:01:08 +00:00
Viktor Barzin
73aab7f4ce [ci skip] Assorted pending changes: ollama API auth, nvidia dashboard, traefik rewrite-body plugin
- ollama: Add basicAuth middleware for external API access
- monitoring: Update nvidia dashboard (add GPU memory per app panel, bump to v9)
- plotting-book: Switch to ancamilea/book-plotter:latest, add lifecycle ignore
- reverse_proxy/factory: Fix rybbit plugin name (rewritebody -> rewrite-body)
- traefik: Switch to packruler/rewrite-body plugin v1.2.0
2026-02-10 21:29:54 +00:00
Viktor Barzin
5e1e18a044 [ci skip] Use RollingUpdate strategy for real-estate-crawler deployments
Set max_unavailable=0, max_surge=1 on both UI and API deployments
to ensure at least 1 replica is always available during updates.
2026-02-10 21:28:38 +00:00
Viktor Barzin
6d6ec0c1e2 [ci skip] Refactor raw ingresses to use ingress_factory module
Enhance ingress_factory with full_host, extra_middlewares, and
skip_default_rate_limit variables. Fix TLS hosts bug to use
effective_host. Migrate 13 services from raw kubernetes_ingress_v1
resources to centralized ingress_factory module calls, removing
manual rybbit middleware CRDs where the factory now handles them.
2026-02-10 21:11:46 +00:00
Viktor Barzin
70376b623e [ci skip] Fix health service port: container listens on 3000, not 80 2026-02-09 21:27:50 +00:00
Viktor Barzin
f04a072beb [ci skip] Add internal OSM routing services (OSRM foot, bicycle, OTP)
New osm-routing namespace with walking, cycling, and transit routing
services for the real-estate-crawler. Internal-only (no public ingress).
2026-02-09 21:03:57 +00:00
Viktor Barzin
5a81ce5774 [ci skip] allow 100 time slicing of nvidia gpu 2026-02-09 21:00:15 +00:00
Viktor Barzin
7b747350de [ci skip] Add descheduler profile to restart idrac-redfish-exporter every 6h 2026-02-09 20:57:09 +00:00
Viktor Barzin
c408887560 [ci skip] Add WebAuthn env vars to real-estate-crawler API deployment 2026-02-08 20:06:24 +00:00
Viktor Barzin
13659e0fc6 [ci skip] Fix grampsweb ingress: set service_name to match backend service
The ingress_factory defaults service_name to name, so it was routing
to a non-existent "family" service instead of "grampsweb".
2026-02-08 02:30:19 +00:00
Viktor Barzin
ce8f81db0c [ci skip] Deploy Gramps Web genealogy service
Add grampsweb module with web app + Celery worker in a single pod,
using shared Redis (DB 2/3), NFS storage, email via mailserver,
and Ollama AI integration. Available at family.viktorbarzin.me.
2026-02-08 02:30:18 +00:00
Viktor Barzin
5ad7b7e76d [ci skip] Deploy health dashboard service
Apple Health data visualization app (Svelte + FastAPI + Caddy).
Uses shared PostgreSQL via DBaaS, NFS storage for uploads,
accessible at health.viktorbarzin.me.
2026-02-08 01:54:24 +00:00
Viktor Barzin
b78e60dbf6 [ci skip] Add Ollama TCP entrypoint for HA voice pipeline
Expose Ollama at 10.0.20.202:11434 via Traefik TCP passthrough,
bypassing TLS/auth issues with the HTTPS ingress.
2026-02-08 01:51:43 +00:00
Viktor Barzin
a8caa45589 [ci skip] Add Wyoming Piper TTS alongside Whisper STT
Deploy Piper (rhasspy/wyoming-piper) in the whisper namespace with
en_US-lessac-medium voice. Exposed via Traefik TCP on port 10200.
2026-02-08 01:51:43 +00:00
Viktor Barzin
b22a14c914 [ci skip] Deploy Wyoming Whisper STT service for Home Assistant voice input
Add Wyoming Faster Whisper (rhasspy/wyoming-whisper) as a new K8s service
exposed via Traefik TCP entrypoint on port 10300. Accessible from ha-london
RPi via VPN at 10.0.20.202:10300.
2026-02-08 01:51:43 +00:00
Viktor Barzin
375e3e115a [ci skip] Fix registry tag cleanup for pull-through cache
- Rewrite cleanup script to use filesystem deletion (shutil.rmtree)
  since proxy registries don't support DELETE via API (405)
- Fix cron entry to invoke with python3
2026-02-07 22:45:17 +00:00
Viktor Barzin
c57873c4d4 Bump Immich version from v2.5.2 to v2.5.5 2026-02-07 22:38:33 +00:00
Viktor Barzin
11d328fb99 Add Docker registry UI and tag cleanup automation
Deploy joxit/docker-registry-ui on port 8080 for browsing images/tags.
Add Python script to prune old registry tags (keeps last N per image),
scheduled daily at 2am via cron. Expose UI via reverse proxy at
registry.viktorbarzin.me with Authentik auth.
2026-02-07 22:38:15 +00:00
Viktor Barzin
2875bf9d4e [ci skip] Enable HTTP/3 (QUIC) for all ingresses
- Add http3.enabled + advertisedPort=443 to Traefik websecure entrypoint
- Add cloudflare_zone_settings_override to enable HTTP/3 for proxied domains
2026-02-07 20:43:49 +00:00
Viktor Barzin
eef9d25874 [ci skip] Strip Authentik auth headers before forwarding to backend
Add strip-auth-headers Traefik middleware that removes X-authentik-*
headers from requests before they reach the backend. Backends like
iDRAC and TP-Link gateway break when receiving these extra headers.
2026-02-07 20:28:44 +00:00
Viktor Barzin
30bc2e9386 [ci skip] Fix DNS forwarding through Traefik to Technitium
Expose UDP port 53 on the Traefik LoadBalancer service and enable
cross-namespace CRD references so the IngressRouteUDP in the traefik
namespace can route DNS traffic to technitium-dns in the technitium
namespace. This restores DNS resolution via 10.0.20.202 for pfSense
and Home Assistant.
2026-02-07 20:10:47 +00:00
Viktor Barzin
f01e92b1d9 [ci skip] Fix HTTPS backend proxying for reverse-proxy services
- Add insecureSkipVerify=true globally for self-signed backend certs
- Name service ports with https- prefix for HTTPS backends so Traefik uses HTTPS
- Add ServersTransport CRD for per-service insecureSkipVerify
- Add serversscheme/serverstransport annotations to reverse-proxy factory
2026-02-07 13:56:24 +00:00
Viktor Barzin
04d85221c7 [ci skip] Remove unsupported advertisedPort from Traefik Helm values 2026-02-07 13:41:06 +00:00
Viktor Barzin
510673949d [ci skip] Add --api.insecure=true to Traefik for dashboard access on port 8080 2026-02-07 13:35:58 +00:00
Viktor Barzin
b36932f9a3 Migrate all service modules from nginx-ingress to Traefik
- Remove nginx-specific ingress variables (use_proxy_protocol, proxy_timeout, additional_configuration_snippet)
- Update ingress annotations to use Traefik middleware CRDs
- Delete nginx-ingress module (replaced by traefik)
- Add new traefik middleware.tf for shared middleware definitions
- Update service modules to work with new ingress_factory interface
2026-02-07 13:25:49 +00:00
Viktor Barzin
43cdebe791 Migrate ingress_factory from nginx to Traefik annotations
- Replace nginx ingress class and annotations with Traefik middleware CRDs
- Add Traefik router middleware chain: rate-limit, CSP, CrowdSec, Authentik
- Remove nginx-specific proxy settings (handled by Traefik config)
- Add exclude_crowdsec and custom_content_security_policy options
- Add rybbit analytics and custom CSP middleware resources
2026-02-07 13:24:58 +00:00
Viktor Barzin
ebe5eb1e9b Add ssh_private_key/ssh_public_key variables to create-template-vm module 2026-02-07 13:19:15 +00:00
Viktor Barzin
e5d7e4e21e Add Traefik dashboard ingress with Authentik protection
- Enable api.insecure in Helm values for internal dashboard access on port 8080
- Add TLS secret, dashboard service, and ingress via ingress_factory (protected=true)
- Pass tls_secret_name to traefik module
- Add traefik to cloudflare_non_proxied_names DNS list
2026-02-07 13:06:57 +00:00
Viktor Barzin
c4e4aa25d0 Fix AFFiNE init container migration command for v0.26.0
The stable image removed scripts/self-host-predeploy.js. Use the new
predeploy flow: prisma migrate + dist/main.js run.

[ci skip]
2026-02-07 10:33:43 +00:00
Viktor Barzin
24469f4590 Add excalidraw project gitignore and README 2026-02-06 20:38:32 +00:00
Viktor Barzin
abfddfbab1 [ci skip] add blotting book repo 2026-02-06 20:32:08 +00:00
Viktor Barzin
67f5e875f0 Add Celery worker/beat deployments and fix crawler API config
Add celery worker and celery beat deployments for background task
processing and scheduled scraping. Fix API container name, add
image_pull_policy Always, and add missing path_type to ingress rules.
2026-02-06 20:31:34 +00:00
Viktor Barzin
442c662597 Upgrade immich to v2.5.2 and add GPU toleration to ML pod
Bump immich version from v2.5.0 to v2.5.2. Add nvidia.com/gpu
toleration to immich-machine-learning deployment.
2026-02-06 20:28:29 +00:00
Viktor Barzin
fd4dc96372 Forward authentik response headers through ingress
Add auth-response-headers annotation to pass user identity headers
(username, uid, email, name, groups) from authentik to backend services.
2026-02-06 20:26:21 +00:00
Viktor Barzin
594e794eab Add audiblez-web application source
Web frontend for audiblez audiobook conversion with FastAPI backend.
2026-02-06 20:24:10 +00:00
Viktor Barzin
5f0c32d005 Add audiblez-web service and refactor ebook2audiobook deployments
Uncomment ebook2audiobook deployment with proper GPU tolerations
(set to 0 replicas). Disable audiblez CLI deployment in favor of
audiblez-web. Add new audiblez-web deployment, service, and ingress
with GPU support, large upload limits, and auth protection.
2026-02-06 20:22:05 +00:00
Viktor Barzin
1275697f2b Add GPU node taint tolerations and enhance GPU memory exporter
Add nvidia.com/gpu toleration to all GPU workloads (frigate, ollama)
to support NoSchedule taint on GPU nodes. Update nvidia operator
helm values with daemonset tolerations. Enhance GPU pod memory
exporter with Kubernetes API integration to resolve container IDs
to pod names/namespaces, adding RBAC resources for API access.
2026-02-06 20:19:26 +00:00
Viktor Barzin
9ef4d38d51 Add DRONE_WEBHOOK_SECRET for GitHub webhook authentication
Fixes webhook signature validation failures causing 400 errors.
2026-02-01 20:42:07 +00:00
Viktor Barzin
da4cf18d6d Add per-pod GPU memory metrics exporter
- Add DaemonSet that runs on GPU node and exposes Prometheus metrics
- Uses nvidia-smi to collect per-process GPU memory usage
- Maps PIDs to container IDs via /proc/<pid>/cgroup
- Exposes gpu_pod_memory_used_bytes metric at :9401/metrics
- Add Prometheus scrape config for gpu-pod-memory job

[ci skip]
2026-01-31 16:58:14 +00:00
Viktor Barzin
751b83a53c Add crowdsec-blocklist-import CronJob
Import public threat intelligence blocklists into CrowdSec daily at 4 AM.
Uses kubectl exec to run the import script inside an existing CrowdSec
agent pod that is already registered with the LAPI.

Source: https://github.com/wolffcatskyy/crowdsec-blocklist-import

[ci skip]
2026-01-28 20:11:44 +00:00
Viktor Barzin
3d7190e935 fix resume pdf generation [ci skip] 2026-01-28 19:42:13 +00:00