infra

Author	SHA1	Message	Date
root	a038b2a2c4	Woodpecker CI deploy commit [CI SKIP]	2026-04-06 09:28:10 +00:00
Viktor Barzin	5b43e57efa	actualbudget: use internal ClusterIP for http-api server URL The http-api sidecar was connecting to the public URL (https://budget-.viktorbarzin.me) which goes through Traefik/Authentik. When pods got rescheduled to different nodes, this caused ETIMEDOUT errors. Changed to internal service URL (http://budget-.actualbudget.svc.cluster.local) which is fast and reliable regardless of pod placement.	2026-04-06 12:22:57 +03:00
Viktor Barzin	aac02f0467	meshcentral: restore DB from backup; technitium: remove orphaned PVC - meshcentral: fix homepage annotations formatting (no functional change, serversscheme was tested but not needed since MeshCentral serves HTTP) - meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB) - technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer, never mounted — primary uses NFS, replicas have their own proxmox PVCs)	2026-04-06 12:17:08 +03:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	f9e85964ce	traefik: add middleware and platform traefik config updates	2026-04-06 11:57:52 +03:00
Viktor Barzin	dca06b8a00	freedify: increase memory limits and add new features	2026-04-06 11:57:47 +03:00
Viktor Barzin	61dc7a6862	nextcloud: refactor chart values and main.tf configuration	2026-04-06 11:57:44 +03:00
Viktor Barzin	fe342a974b	monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard - proxmox-csi: add RBAC for PVE host snapshot restore script - monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics - monitoring: add backup health Grafana dashboard	2026-04-06 11:57:41 +03:00
Viktor Barzin	b0178cf6d2	technitium: add tertiary DNS replica and fix CoreDNS forward order - Add tertiary DNS deployment with zone-transfer replication for externalTrafficPolicy=Local coverage across more nodes - Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first, then public DNS fallbacks (8.8.8.8, 1.1.1.1)	2026-04-06 11:57:31 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	faa6868f79	remove claude-memory PDB (blocks drains with single replica) Single replica + minAvailable=1 makes drains impossible. claude-memory is non-critical and recovers quickly. [ci skip]	2026-04-06 00:47:40 +03:00
Viktor Barzin	c8be07c403	resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s - MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss) - Descheduler: increase frequency from hourly to every 5 min for faster rebalancing - Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]	2026-04-06 00:25:49 +03:00
Viktor Barzin	0f2ef356d6	fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned) iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026. Controller is intentionally scaled to 0. Remove the stale alert and update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.	2026-04-05 23:53:18 +03:00
Viktor Barzin	ae0585048a	fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit. Also added custom LimitRange in dbaas stack (for when Terraform manages it directly).	2026-04-05 23:31:23 +03:00
Viktor Barzin	772f59d589	fix: add Vault-managed DB credentials for Matrix Synapse - Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser) - Add Vault DB static role pg-matrix with 24h rotation - Add ExternalSecret matrix-db-creds syncing password from Vault - Add inject-db-password init container that patches homeserver.yaml with current Vault password on every pod start - Update dependency annotation to pg-cluster-rw.dbaas - Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)	2026-04-05 23:18:16 +03:00
Viktor Barzin	e064778c2c	fix: resolve tandoor, matrix, navidrome crash loops - Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull failing with EOF from pull-through cache corruption) - Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead of legacy postgresql.dbaas service, update CNPG postgres password - Navidrome: deleted corrupted SQLite DB (malformed disk image from proxmox-lvm migration), navidrome recreates fresh DB on startup	2026-04-05 23:12:49 +03:00
Viktor Barzin	4da8f0242f	fix: right-size service memory after PVE RAM upgrade (142→272GB) - MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit) - Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled) - Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled) - Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled - Navidrome: 128Mi/128Mi → 256Mi/384Mi - Matrix: add explicit 256Mi/512Mi resources - Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled - Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi - Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi - Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections	2026-04-05 23:02:50 +03:00
Viktor Barzin	c946e5fdc9	tune controller-manager + apiserver for faster volume detach - kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default) - kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default) - kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default) Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure. Applied live + codified in cloud-init template. [ci skip]	2026-04-05 22:14:23 +03:00
Viktor Barzin	c239300154	fix: disable rspamd-redis and correct proxmox-lvm PVC size ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start an embedded Redis server. The rspamd-redis subprocess was failing repeatedly due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage migration. Since the DKIM signing config uses use_redis=false, Redis is not needed. Also correct the PVC storage request to match the actual provisioned size (2Gi). The mismatch was causing unnecessary PVC replacement during terraform apply.	2026-04-05 21:44:52 +03:00
Viktor Barzin	2eeb73fc57	add priority-based kubelet graceful shutdown ordering Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with shutdownGracePeriodByPodPriority for 9-tier ordered shutdown: unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) → tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) → sys-node(30s). Total: 310s. Apps stop before databases, databases stop before infrastructure. VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]	2026-04-05 20:54:36 +03:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00
Viktor Barzin	3cd560d4d9	fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip] The Terraform Helm provider's YAML diff comparison silently ignores rules containing {{ $labels.job }} in annotations, preventing the alerts from being applied. Also syncs alerts to platform stack tpl.	2026-04-05 20:07:51 +03:00
Viktor Barzin	0e3c0fb503	security: harden traefik auth flow — fix header spoofing, TLS leak, DERP rate-limit - Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid, email, name, groups) to prevent client-supplied header spoofing when Authentik is down. Previously only username was set, allowing a malicious client to inject fake X-authentik-groups. - Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching domains no longer get the wildcard cert served (TLS info leak). - Added rate-limit and CrowdSec middleware to catch-all IngressRoute. - Added rate-limit middleware to Headscale DERP IngressRoute. - Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin). - Created Authentik brute-force reputation policy (threshold -5, IP+username).	2026-04-05 20:01:06 +03:00
Viktor Barzin	3217a5f605	add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip] CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).	2026-04-05 19:32:40 +03:00
Viktor Barzin	aa7a7e74b2	fix: technitium secondary to proxmox-lvm + bootstrap TF state - Migrate technitium-secondary-config from NFS to proxmox-lvm PVC - Change secondary strategy from RollingUpdate to Recreate (RWO) - Bootstrap encrypted state for insta2spotify and ebooks stacks - Import servarr sub-module PVCs and reconcile state	2026-04-05 19:32:40 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	ee39dd2fc9	feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all SQLite-backed services. Deployments updated to use new block storage PVCs. Old NFS modules retained for 1-week rollback. Services: ntfy, freshrss, insta2spotify, actualbudget (x3), wealthfolio, navidrome (DB only), audiobookshelf config, headscale, forgejo, uptime-kuma. Also: set Recreate strategy on ntfy, forgejo, insta2spotify, wealthfolio (required for RWO volumes).	2026-04-04 16:26:59 +03:00
Viktor Barzin	3d3759ea2f	fix: disable cert-manager webhook for pvc-autoresizer, use self-signed cert [ci skip] Cluster doesn't have cert-manager installed. Use self-signed certificate for the controller and disable the PVC mutating webhook (annotations are set directly on PVCs via Terraform).	2026-04-03 23:44:49 +03:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	d49acebd8e	migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip] - Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm - Update docs/architecture/storage.md: document Proxmox CSI as primary block storage, mark democratic-csi iSCSI as deprecated - Add full migration plan to docs/plans/	2026-04-03 19:45:34 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00
Viktor Barzin	7ad01661f0	novelapp: migrate NEXTAUTH env vars to Auth.js v5 (AUTH_*) Replace NEXTAUTH_URL/NEXTAUTH_SECRET with AUTH_URL/AUTH_SECRET and add AUTH_TRUST_HOST=true for Auth.js v5 compatibility.	2026-03-29 20:37:26 +03:00
Viktor Barzin	8bf83147db	add SLACK_WEBHOOK_URL env var to book-search deployment	2026-03-29 13:53:24 +03:00
Viktor Barzin	78eff9ab11	fix: bump book-search memory to 512Mi for file upload/email [ci skip] Downloads and sends ebook files via HTTP — needs more than 128Mi for large PDFs. Applied live via kubectl, persisting in Terraform.	2026-03-29 13:24:19 +03:00
Viktor Barzin	914e0b08e2	add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip]	2026-03-29 12:42:45 +03:00
Viktor Barzin	cbea959966	feat(ebooks): mount calibre-library PVC in book-search for permission fixing CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as root. book-search now mounts the library to periodically fix permissions on recently imported books.	2026-03-29 11:31:41 +03:00
Viktor Barzin	fed9df8c0e	feat(ebooks): mount stacks-config PVC in book-search for force re-download Adds stacks-config volume mount to book-search pod so it can delete Stacks history entries and force re-downloads when a book was consumed by CWA but failed to import.	2026-03-29 11:26:30 +03:00
Viktor Barzin	6d44b4292f	add /api/download-status to book-search unprotected API ingress [ci skip] Needed for async polling from iOS Shortcuts — status endpoint doesn't need Authentik auth (job IDs are unguessable UUIDs).	2026-03-29 10:11:22 +03:00
Viktor Barzin	46444e0306	openclaw: remove install-dotfiles init container to reduce NFS writes The init container was cloning the dotfiles repo via git on every pod start, causing 200+ small NFS writes that amplified through ZFS. Dotfiles already exist on NFS from a previous clone — no need to re-clone on every restart. To update dotfiles, run git pull manually. Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB error log left over from migration to MariaDB).	2026-03-29 01:11:33 +02:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	d41211ddd5	add API key + unprotected API ingress for book-search iOS Shortcut - API_KEY env var from calibre-secrets for /api/download-url auth - SHORTCUT_ICLOUD_URL env var for /shortcut redirect - Separate ingress for /api/download-url and /shortcut (bypasses Authentik)	2026-03-29 00:43:34 +02:00
Viktor Barzin	06490b0634	reduce Prometheus cardinality round 3: drop 44k more series - cadvisor: drop unused network error/dropped counters, unused cpu metrics (load_avg, system, user), unused memory metrics (cache, failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive) - kubelet: drop all unused histogram buckets (storage_operation, csi, volume_operation, image_pull, http_requests, rest_client, pod_worker, volume_metric, cgroup_manager) + kubernetes_feature_enabled - apiserver: drop flowcontrol/rest_client histograms, longrunning_requests - traefik: drop all router-level metrics (keep service + entrypoint) - service-endpoints: drop coredns histograms, node_filesystem_* Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)	2026-03-29 00:27:23 +02:00
Viktor Barzin	614d3c72bd	add liveness probe to annas-archive-stacks deployment Prevents corrupted SQLite DB from looping errors forever — K8s will auto-restart the pod if /api/version stops responding.	2026-03-29 00:17:29 +02:00
Viktor Barzin	a9ca65bc31	reduce Prometheus cardinality round 2: drop 137k more series - fix traefik double-scrape: kubernetes-pods job was scraping traefik pods again (43k duplicate series). Added namespace drop rule. - drop unused cadvisor metrics: container_fs_, container_blkio_, container_pressure_, container_spec_, and misc (30k series) - drop more apiserver histogram buckets: watch_list, watch_cache, response_sizes, watch_events, admission_controller, workqueue (11k) - drop unused kube-state-metrics: replicaset_, pod_tolerations, pod_labels, endpoint_, service_, configmap_, etc (53k series) Post-relabel samples: 332k → 142k (-57%) Ingestion rate: 5,480 → 3,239 samples/sec (-41%)	2026-03-28 23:51:24 +02:00
Viktor Barzin	aceea7db94	increase global rate limit from 10/50 to 50/200 HA frontend loads 30-50 JS bundles on page load, exhausting the burst. iOS Companion app reconnections also trigger bursts. 172 rate-limited (429) requests found in Traefik logs causing intermittent connectivity failures for ha-sofia iOS app.	2026-03-28 23:40:10 +02:00
Viktor Barzin	4b3851829b	feat: organize Grafana dashboards into folders Enable sidecar folderAnnotation + foldersFromFilesStructure to group 26 dashboards into 5 managed folders: - Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics - Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic - Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU - Operations (4): backup health, registry, audit logs, Loki - Applications (2): realestate-crawler, qBittorrent Dashboard-to-folder mapping defined in grafana.tf locals block. External stacks (headscale, technitium) annotated individually.	2026-03-28 16:23:49 +02:00
Viktor Barzin	725fefe565	fix: add Headscale monitoring, alerts, and pin UI image - Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes, HighHTTPLatency, HighErrorRate - Add Grafana dashboard with node count, map responses, HTTP latency, nodestore operations, and memory panels - Pin headscale-ui to digest sha256:015f5ba0... (was :latest) - Set disable_check_updates: true to skip GitHub check on startup - Uptime Kuma monitor already existed (id=19, 300s interval)	2026-03-28 16:07:04 +02:00

1 2 3 4 5 ...

418 commits