infra

Author	SHA1	Message	Date
Viktor Barzin	c946e5fdc9	tune controller-manager + apiserver for faster volume detach - kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default) - kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default) - kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default) Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure. Applied live + codified in cloud-init template. [ci skip]	2026-04-05 22:14:23 +03:00
Viktor Barzin	c239300154	fix: disable rspamd-redis and correct proxmox-lvm PVC size ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start an embedded Redis server. The rspamd-redis subprocess was failing repeatedly due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage migration. Since the DKIM signing config uses use_redis=false, Redis is not needed. Also correct the PVC storage request to match the actual provisioned size (2Gi). The mismatch was causing unnecessary PVC replacement during terraform apply.	2026-04-05 21:44:52 +03:00
Viktor Barzin	2eeb73fc57	add priority-based kubelet graceful shutdown ordering Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with shutdownGracePeriodByPodPriority for 9-tier ordered shutdown: unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) → tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) → sys-node(30s). Total: 310s. Apps stop before databases, databases stop before infrastructure. VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]	2026-04-05 20:54:36 +03:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00
Viktor Barzin	3cd560d4d9	fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip] The Terraform Helm provider's YAML diff comparison silently ignores rules containing {{ $labels.job }} in annotations, preventing the alerts from being applied. Also syncs alerts to platform stack tpl.	2026-04-05 20:07:51 +03:00
Viktor Barzin	0e3c0fb503	security: harden traefik auth flow — fix header spoofing, TLS leak, DERP rate-limit - Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid, email, name, groups) to prevent client-supplied header spoofing when Authentik is down. Previously only username was set, allowing a malicious client to inject fake X-authentik-groups. - Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching domains no longer get the wildcard cert served (TLS info leak). - Added rate-limit and CrowdSec middleware to catch-all IngressRoute. - Added rate-limit middleware to Headscale DERP IngressRoute. - Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin). - Created Authentik brute-force reputation policy (threshold -5, IP+username).	2026-04-05 20:01:06 +03:00
Viktor Barzin	3217a5f605	add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip] CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).	2026-04-05 19:32:40 +03:00
Viktor Barzin	aa7a7e74b2	fix: technitium secondary to proxmox-lvm + bootstrap TF state - Migrate technitium-secondary-config from NFS to proxmox-lvm PVC - Change secondary strategy from RollingUpdate to Recreate (RWO) - Bootstrap encrypted state for insta2spotify and ebooks stacks - Import servarr sub-module PVCs and reconcile state	2026-04-05 19:32:40 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	ee39dd2fc9	feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all SQLite-backed services. Deployments updated to use new block storage PVCs. Old NFS modules retained for 1-week rollback. Services: ntfy, freshrss, insta2spotify, actualbudget (x3), wealthfolio, navidrome (DB only), audiobookshelf config, headscale, forgejo, uptime-kuma. Also: set Recreate strategy on ntfy, forgejo, insta2spotify, wealthfolio (required for RWO volumes).	2026-04-04 16:26:59 +03:00
Viktor Barzin	3d3759ea2f	fix: disable cert-manager webhook for pvc-autoresizer, use self-signed cert [ci skip] Cluster doesn't have cert-manager installed. Use self-signed certificate for the controller and disable the PVC mutating webhook (annotations are set directly on PVCs via Terraform).	2026-04-03 23:44:49 +03:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	d49acebd8e	migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip] - Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm - Update docs/architecture/storage.md: document Proxmox CSI as primary block storage, mark democratic-csi iSCSI as deprecated - Add full migration plan to docs/plans/	2026-04-03 19:45:34 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00
Viktor Barzin	7ad01661f0	novelapp: migrate NEXTAUTH env vars to Auth.js v5 (AUTH_*) Replace NEXTAUTH_URL/NEXTAUTH_SECRET with AUTH_URL/AUTH_SECRET and add AUTH_TRUST_HOST=true for Auth.js v5 compatibility.	2026-03-29 20:37:26 +03:00
Viktor Barzin	8bf83147db	add SLACK_WEBHOOK_URL env var to book-search deployment	2026-03-29 13:53:24 +03:00
Viktor Barzin	78eff9ab11	fix: bump book-search memory to 512Mi for file upload/email [ci skip] Downloads and sends ebook files via HTTP — needs more than 128Mi for large PDFs. Applied live via kubectl, persisting in Terraform.	2026-03-29 13:24:19 +03:00
Viktor Barzin	914e0b08e2	add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip]	2026-03-29 12:42:45 +03:00
Viktor Barzin	cbea959966	feat(ebooks): mount calibre-library PVC in book-search for permission fixing CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as root. book-search now mounts the library to periodically fix permissions on recently imported books.	2026-03-29 11:31:41 +03:00
Viktor Barzin	fed9df8c0e	feat(ebooks): mount stacks-config PVC in book-search for force re-download Adds stacks-config volume mount to book-search pod so it can delete Stacks history entries and force re-downloads when a book was consumed by CWA but failed to import.	2026-03-29 11:26:30 +03:00
Viktor Barzin	6d44b4292f	add /api/download-status to book-search unprotected API ingress [ci skip] Needed for async polling from iOS Shortcuts — status endpoint doesn't need Authentik auth (job IDs are unguessable UUIDs).	2026-03-29 10:11:22 +03:00
Viktor Barzin	46444e0306	openclaw: remove install-dotfiles init container to reduce NFS writes The init container was cloning the dotfiles repo via git on every pod start, causing 200+ small NFS writes that amplified through ZFS. Dotfiles already exist on NFS from a previous clone — no need to re-clone on every restart. To update dotfiles, run git pull manually. Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB error log left over from migration to MariaDB).	2026-03-29 01:11:33 +02:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	d41211ddd5	add API key + unprotected API ingress for book-search iOS Shortcut - API_KEY env var from calibre-secrets for /api/download-url auth - SHORTCUT_ICLOUD_URL env var for /shortcut redirect - Separate ingress for /api/download-url and /shortcut (bypasses Authentik)	2026-03-29 00:43:34 +02:00
Viktor Barzin	06490b0634	reduce Prometheus cardinality round 3: drop 44k more series - cadvisor: drop unused network error/dropped counters, unused cpu metrics (load_avg, system, user), unused memory metrics (cache, failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive) - kubelet: drop all unused histogram buckets (storage_operation, csi, volume_operation, image_pull, http_requests, rest_client, pod_worker, volume_metric, cgroup_manager) + kubernetes_feature_enabled - apiserver: drop flowcontrol/rest_client histograms, longrunning_requests - traefik: drop all router-level metrics (keep service + entrypoint) - service-endpoints: drop coredns histograms, node_filesystem_* Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)	2026-03-29 00:27:23 +02:00
Viktor Barzin	614d3c72bd	add liveness probe to annas-archive-stacks deployment Prevents corrupted SQLite DB from looping errors forever — K8s will auto-restart the pod if /api/version stops responding.	2026-03-29 00:17:29 +02:00
Viktor Barzin	a9ca65bc31	reduce Prometheus cardinality round 2: drop 137k more series - fix traefik double-scrape: kubernetes-pods job was scraping traefik pods again (43k duplicate series). Added namespace drop rule. - drop unused cadvisor metrics: container_fs_, container_blkio_, container_pressure_, container_spec_, and misc (30k series) - drop more apiserver histogram buckets: watch_list, watch_cache, response_sizes, watch_events, admission_controller, workqueue (11k) - drop unused kube-state-metrics: replicaset_, pod_tolerations, pod_labels, endpoint_, service_, configmap_, etc (53k series) Post-relabel samples: 332k → 142k (-57%) Ingestion rate: 5,480 → 3,239 samples/sec (-41%)	2026-03-28 23:51:24 +02:00
Viktor Barzin	aceea7db94	increase global rate limit from 10/50 to 50/200 HA frontend loads 30-50 JS bundles on page load, exhausting the burst. iOS Companion app reconnections also trigger bursts. 172 rate-limited (429) requests found in Traefik logs causing intermittent connectivity failures for ha-sofia iOS app.	2026-03-28 23:40:10 +02:00
Viktor Barzin	4b3851829b	feat: organize Grafana dashboards into folders Enable sidecar folderAnnotation + foldersFromFilesStructure to group 26 dashboards into 5 managed folders: - Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics - Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic - Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU - Operations (4): backup health, registry, audit logs, Loki - Applications (2): realestate-crawler, qBittorrent Dashboard-to-folder mapping defined in grafana.tf locals block. External stacks (headscale, technitium) annotated individually.	2026-03-28 16:23:49 +02:00
Viktor Barzin	725fefe565	fix: add Headscale monitoring, alerts, and pin UI image - Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes, HighHTTPLatency, HighErrorRate - Add Grafana dashboard with node count, map responses, HTTP latency, nodestore operations, and memory panels - Pin headscale-ui to digest sha256:015f5ba0... (was :latest) - Set disable_check_updates: true to skip GitHub check on startup - Uptime Kuma monitor already existed (id=19, 300s interval)	2026-03-28 16:07:04 +02:00
Viktor Barzin	f4ff654a69	perf: optimize Headscale for connectivity and latency - Remove viktorbarzin.me from split DNS (same IPs as public DNS, was adding unnecessary tunnel overhead for every DNS query) - Narrow reverse DNS split scope from 10.0.0.0/8 → 10.0.20.0/24 and 10.0.10.0/24 only; 192.168.0.0/16 → 192.168.1.0/24 only - Add extra_records for key internal services (technitium, k8s-master) for instant MagicDNS resolution without tunnel roundtrip - Replace full Tailscale DERP map (29 regions) with curated set: home + 8 European + 5 global fallback DERPs (14 total) - Add custom derp.yaml to ConfigMap, sourced from Vault Port 80 DERP dropped — Traefik's global HTTP→HTTPS redirect prevents non-TLS DERP upgrades on the web entrypoint.	2026-03-28 15:44:13 +02:00
Viktor Barzin	8a5a53a832	fix alerts and reduce Prometheus disk write rate - linkwarden: add Reloader match annotation to DB secret so pods auto-restart on Vault credential rotation (was causing 100% 5xx) - authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi) to prevent OOM kills - prometheus: drop 113k high-cardinality series to reduce HDD write rate from ~8.8 to ~6.0 MB/s (31% reduction): - drop all traefik/apiserver/etcd histogram bucket metrics - drop goflow2_flow_process_nf_templates_total (9.3k series) - drop container_tasks_state and container_memory_failures_total - rewrite HighServiceLatency alert to use avg latency (_sum/_count) - update cluster_health dashboard to match - raise KubeletRuntimeOperationsLatency threshold from 30s to 60s	2026-03-28 15:42:14 +02:00
Viktor Barzin	7e0b0d9362	fix: headscale VPN setup hardening - Add SQLite backup CronJob (every 6h to NFS for cloud sync pickup) - Move headscale-ui secrets (COOKIE_SECRET, ROOT_API_KEY) from hardcoded values to Vault-managed secrets - Add DERP IPv6 address (2001:470:6e:43d::2) for IPv6-capable clients - Clean up stale test nodes, duplicate users, rename "localhost" nodes Also updated headscale_config in Vault to include DERP ipv6 field and headscale_ui_cookie_secret/headscale_ui_api_key secrets.	2026-03-28 14:38:12 +02:00
Viktor Barzin	a42003fb8f	fix: add dedicated DERP IngressRoute bypassing middlewares CrowdSec, rate limiting, anti-AI, and error pages middlewares were interfering with the Upgrade: DERP protocol handshake. Also updated Headscale ACL in Vault to allow tailnet DNS traffic to Technitium (10.0.20.200:53).	2026-03-28 14:26:51 +02:00
Viktor Barzin	04a96955c0	fix: exclude NFS PVs from PVFillingUp alert NFS PVs report the entire NFS server filesystem usage (e.g., navidrome-music shows 5.3 TiB Synology volume at 97%), not PVC-specific usage. Filter out PVs with >1TiB capacity (always NFS mounts; iSCSI PVCs are 10-50Gi).	2026-03-28 01:14:05 +02:00
Viktor Barzin	ae21502698	fix: exclude disabled London Pi cloud sync task from CloudSyncFailing alert Task 2 (Backup London pi) fails because 192.168.8.102 is unreachable. Disabled task via TrueNAS, excluded task_id=2 from alert rule.	2026-03-27 15:15:48 +02:00
Viktor Barzin	252b65a574	fix: increase memory limits for OOMKilled pods (immich, clickhouse, speedtest) - immich-server: limits 1700Mi → 2500Mi (70 restarts from media processing spikes) - clickhouse: limits 1Gi → 1536Mi, max_server_memory_usage 800Mi → 1200Mi - speedtest: limits 256Mi → 512Mi, requests 256Mi → 128Mi (daily OOM during test)	2026-03-27 13:57:16 +02:00
Viktor Barzin	1ec480e5fa	novelapp: grant vabbit81 (Gheorghe) admin RBAC on novelapp namespace	2026-03-26 17:34:48 +02:00
Viktor Barzin	5e6e71e727	novelapp: add NextAuth + Google OAuth env vars Replace AUTH_SECRET with NEXTAUTH_URL, NEXTAUTH_SECRET, GOOGLE_CLIENT_ID, and GOOGLE_CLIENT_SECRET for Google OAuth integration.	2026-03-26 17:14:16 +02:00
Viktor Barzin	70ea01fb6e	vault: increase k8s auth token TTLs and add periodic renewal Stagger token periods across roles (7d/8d/9d/10d) to prevent bulk lease revocation storms that caused transient 504s. Periodic tokens auto-renew indefinitely, eliminating mass expiry.	2026-03-26 12:21:47 +02:00
Viktor Barzin	b8a5740138	reduce alert noise: remove 4 memory alerts, raise latency threshold [ci skip] - Remove ClusterMemoryRequestsHigh, ContainerNearOOM, NodeLowFreeMemory, NodeMemoryPressureTrending — all fire regularly due to intentional memory overcommit and are not actionable - Keep ContainerOOMKilled (actionable — container actually died) - Raise HighServiceLatency p99 threshold from 10s to 30s to ignore transient spikes	2026-03-26 01:15:18 +02:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	95e49134ae	cleanup: remove old audiobook-search, superseded by book-search - Delete servarr/audiobook-search TF module (moved to ebooks/book-search) - Remove audiobook-search from cloudflare_proxied_names - Remove commented-out module reference in servarr/main.tf - Clean up "renamed from" comment in ebooks/main.tf - K8s resources (deploy/svc/ingress) deleted from servarr namespace - Cloudflare DNS record already absent - Import book-search and insta2spotify DNS records into cloudflared state	2026-03-25 23:16:01 +02:00
Viktor Barzin	fe27709fd4	fix email monitor: use internal URL for Uptime Kuma push Pods can't reach uptime.viktorbarzin.me externally. Switch to http://uptime-kuma.uptime-kuma.svc.cluster.local for the push endpoint.	2026-03-25 22:59:26 +02:00
Viktor Barzin	78dec8f0ad	add e2e email roundtrip monitoring CronJob (every 30 min) sends test email via Mailgun API to smoke-test@viktorbarzin.me, verifies IMAP delivery in spam@ catch-all, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Prometheus alerts: EmailRoundtripFailing, EmailRoundtripStale, EmailRoundtripNeverRun. Uptime Kuma: SMTP/IMAP port checks + E2E push.	2026-03-25 22:50:22 +02:00
Viktor Barzin	3adaf88f62	add MAM_ID env var to book-search deployment [ci skip]	2026-03-25 15:52:24 +02:00
Viktor Barzin	946ea9e1f3	fix ebooks stack: prefix PV names, add book-search DNS, add secrets symlink [ci skip]	2026-03-25 15:14:08 +02:00

1 2 3 4 5 ...

401 commits