infra

Author	SHA1	Message	Date
Viktor Barzin	25aee1d3e9	fix(mailserver): delete all e2e-probe emails, not just current marker Previously only searched for the current run's specific marker subject. If IMAP deletion failed, old emails accumulated. Now searches for all emails with "e2e-probe" in subject and deletes them, cleaning up any leftovers from prior failed runs.	2026-04-06 13:39:47 +03:00
Viktor Barzin	9349d5d566	fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS Root cause: Traefik v3 auto-detects HTTPS for backend port 443, ignoring the port name "http" and serversscheme annotations. MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik connected via HTTPS causing TLS handshake failure → 500. Fix: Change K8s service port from 443 to 80 with target_port 443. Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly. Also disables anti-AI scraping (internal tool behind Authentik).	2026-04-06 13:38:30 +03:00
Viktor Barzin	7f13f5fd76	fix(meshcentral): disable anti-AI middleware causing 500 errors The rewrite-body plugin (anti-AI trap links) was crashing when processing MeshCentral's HTML responses, returning 500. Disabled anti_ai_scraping since it's a protected internal tool behind Authentik. Re-enabled Authentik protection.	2026-04-06 13:32:37 +03:00
Viktor Barzin	66f1e2ea3b	fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy The previous init container incorrectly disabled TLSOffload, causing MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP, resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload is always enabled so MeshCentral serves plain HTTP behind Traefik.	2026-04-06 13:29:21 +03:00
Viktor Barzin	cba79cde35	fix(meshcentral): disable certUrl when using TLSOffload MeshCentral was failing to start with "Zipencryptionmodule failed" error because the service tried to fetch TLS certificates from an HTTPS endpoint during bootstrap. When using TLSOffload (reverse proxy terminating TLS), MeshCentral should not attempt to load certificates. Root cause: The existing config.json had "certUrl" set to HTTPS, causing MeshCentral to try fetching the certificate during startup. Since the pod was bootstrapping, this failed and cascaded into the Zipencryptionmodule failure. Fix: Add init container that runs before the main container to disable the certUrl by prefixing it with underscore (MeshCentral's convention for disabled settings). The sed command ensures the fix applies to both new and existing config.json files. This ensures MeshCentral behaves correctly with TLSOffload enabled: - Runs in plain HTTP mode on port 443 - Traefik/Ingress handles HTTPS termination - No certificate bootstrap failures	2026-04-06 13:22:59 +03:00
Viktor Barzin	6ee5b70a36	priority-pass: update backend to v8 (expanded QR container margins)	2026-04-06 13:22:27 +03:00
Viktor Barzin	feeed5ac35	priority-pass: update backend to v7 (square QR container)	2026-04-06 13:16:35 +03:00
Viktor Barzin	ee2c6517ba	fix(meshcentral): remove unused NFS modules after Wave 2 storage migration MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS modules for data and files are no longer used by the deployment, leaving behind orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains on NFS per the backup strategy pattern. Changes: - Removed module.nfs_data and module.nfs_files from Terraform config - Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm) - Backups volume: meshcentral-backups (NFS) - unchanged Pod status: healthy, running on proxmox-lvm volumes.	2026-04-06 13:13:16 +03:00
Viktor Barzin	0162b4f130	priority-pass: update backend to v6 (remove edge artifact scratches)	2026-04-06 13:09:45 +03:00
Viktor Barzin	c38ae944fc	priority-pass: update backend to v5 (QR container sizing fix)	2026-04-06 13:06:18 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	1d7244e47a	priority-pass: update backend to v4 (QR container clipping fix)	2026-04-06 13:00:33 +03:00
Viktor Barzin	0c44e11146	priority-pass: update backend to v3 (QR container layout fix)	2026-04-06 12:57:08 +03:00
Viktor Barzin	75b18717a1	priority-pass: update backend to v2 (QR code preservation fix)	2026-04-06 12:53:45 +03:00
Viktor Barzin	ef6f57e82c	priority-pass: update frontend image to v5 (clipboard paste support)	2026-04-06 12:44:19 +03:00
Viktor Barzin	3676cdbeeb	state(technitium): update encrypted state	2026-04-06 12:40:55 +03:00
root	a038b2a2c4	Woodpecker CI deploy commit [CI SKIP]	2026-04-06 09:28:10 +00:00
Viktor Barzin	5b43e57efa	actualbudget: use internal ClusterIP for http-api server URL The http-api sidecar was connecting to the public URL (https://budget-.viktorbarzin.me) which goes through Traefik/Authentik. When pods got rescheduled to different nodes, this caused ETIMEDOUT errors. Changed to internal service URL (http://budget-.actualbudget.svc.cluster.local) which is fast and reliable regardless of pod placement.	2026-04-06 12:22:57 +03:00
Viktor Barzin	aac02f0467	meshcentral: restore DB from backup; technitium: remove orphaned PVC - meshcentral: fix homepage annotations formatting (no functional change, serversscheme was tested but not needed since MeshCentral serves HTTP) - meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB) - technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer, never mounted — primary uses NFS, replicas have their own proxmox PVCs)	2026-04-06 12:17:08 +03:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	f9e85964ce	traefik: add middleware and platform traefik config updates	2026-04-06 11:57:52 +03:00
Viktor Barzin	dca06b8a00	freedify: increase memory limits and add new features	2026-04-06 11:57:47 +03:00
Viktor Barzin	61dc7a6862	nextcloud: refactor chart values and main.tf configuration	2026-04-06 11:57:44 +03:00
Viktor Barzin	fe342a974b	monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard - proxmox-csi: add RBAC for PVE host snapshot restore script - monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics - monitoring: add backup health Grafana dashboard	2026-04-06 11:57:41 +03:00
Viktor Barzin	b0178cf6d2	technitium: add tertiary DNS replica and fix CoreDNS forward order - Add tertiary DNS deployment with zone-transfer replication for externalTrafficPolicy=Local coverage across more nodes - Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first, then public DNS fallbacks (8.8.8.8, 1.1.1.1)	2026-04-06 11:57:31 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	faa6868f79	remove claude-memory PDB (blocks drains with single replica) Single replica + minAvailable=1 makes drains impossible. claude-memory is non-critical and recovers quickly. [ci skip]	2026-04-06 00:47:40 +03:00
Viktor Barzin	c8be07c403	resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s - MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss) - Descheduler: increase frequency from hourly to every 5 min for faster rebalancing - Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]	2026-04-06 00:25:49 +03:00
Viktor Barzin	0f2ef356d6	fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned) iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026. Controller is intentionally scaled to 0. Remove the stale alert and update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.	2026-04-05 23:53:18 +03:00
Viktor Barzin	ae0585048a	fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit. Also added custom LimitRange in dbaas stack (for when Terraform manages it directly).	2026-04-05 23:31:23 +03:00
Viktor Barzin	772f59d589	fix: add Vault-managed DB credentials for Matrix Synapse - Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser) - Add Vault DB static role pg-matrix with 24h rotation - Add ExternalSecret matrix-db-creds syncing password from Vault - Add inject-db-password init container that patches homeserver.yaml with current Vault password on every pod start - Update dependency annotation to pg-cluster-rw.dbaas - Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)	2026-04-05 23:18:16 +03:00
Viktor Barzin	e064778c2c	fix: resolve tandoor, matrix, navidrome crash loops - Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull failing with EOF from pull-through cache corruption) - Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead of legacy postgresql.dbaas service, update CNPG postgres password - Navidrome: deleted corrupted SQLite DB (malformed disk image from proxmox-lvm migration), navidrome recreates fresh DB on startup	2026-04-05 23:12:49 +03:00
Viktor Barzin	4da8f0242f	fix: right-size service memory after PVE RAM upgrade (142→272GB) - MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit) - Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled) - Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled) - Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled - Navidrome: 128Mi/128Mi → 256Mi/384Mi - Matrix: add explicit 256Mi/512Mi resources - Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled - Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi - Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi - Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections	2026-04-05 23:02:50 +03:00
Viktor Barzin	c946e5fdc9	tune controller-manager + apiserver for faster volume detach - kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default) - kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default) - kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default) Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure. Applied live + codified in cloud-init template. [ci skip]	2026-04-05 22:14:23 +03:00
Viktor Barzin	c239300154	fix: disable rspamd-redis and correct proxmox-lvm PVC size ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start an embedded Redis server. The rspamd-redis subprocess was failing repeatedly due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage migration. Since the DKIM signing config uses use_redis=false, Redis is not needed. Also correct the PVC storage request to match the actual provisioned size (2Gi). The mismatch was causing unnecessary PVC replacement during terraform apply.	2026-04-05 21:44:52 +03:00
Viktor Barzin	2eeb73fc57	add priority-based kubelet graceful shutdown ordering Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with shutdownGracePeriodByPodPriority for 9-tier ordered shutdown: unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) → tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) → sys-node(30s). Total: 310s. Apps stop before databases, databases stop before infrastructure. VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]	2026-04-05 20:54:36 +03:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00
Viktor Barzin	3cd560d4d9	fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip] The Terraform Helm provider's YAML diff comparison silently ignores rules containing {{ $labels.job }} in annotations, preventing the alerts from being applied. Also syncs alerts to platform stack tpl.	2026-04-05 20:07:51 +03:00
Viktor Barzin	0e3c0fb503	security: harden traefik auth flow — fix header spoofing, TLS leak, DERP rate-limit - Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid, email, name, groups) to prevent client-supplied header spoofing when Authentik is down. Previously only username was set, allowing a malicious client to inject fake X-authentik-groups. - Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching domains no longer get the wildcard cert served (TLS info leak). - Added rate-limit and CrowdSec middleware to catch-all IngressRoute. - Added rate-limit middleware to Headscale DERP IngressRoute. - Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin). - Created Authentik brute-force reputation policy (threshold -5, IP+username).	2026-04-05 20:01:06 +03:00
Viktor Barzin	3217a5f605	add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip] CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).	2026-04-05 19:32:40 +03:00
Viktor Barzin	aa7a7e74b2	fix: technitium secondary to proxmox-lvm + bootstrap TF state - Migrate technitium-secondary-config from NFS to proxmox-lvm PVC - Change secondary strategy from RollingUpdate to Recreate (RWO) - Bootstrap encrypted state for insta2spotify and ebooks stacks - Import servarr sub-module PVCs and reconcile state	2026-04-05 19:32:40 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	ee39dd2fc9	feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all SQLite-backed services. Deployments updated to use new block storage PVCs. Old NFS modules retained for 1-week rollback. Services: ntfy, freshrss, insta2spotify, actualbudget (x3), wealthfolio, navidrome (DB only), audiobookshelf config, headscale, forgejo, uptime-kuma. Also: set Recreate strategy on ntfy, forgejo, insta2spotify, wealthfolio (required for RWO volumes).	2026-04-04 16:26:59 +03:00
Viktor Barzin	3d3759ea2f	fix: disable cert-manager webhook for pvc-autoresizer, use self-signed cert [ci skip] Cluster doesn't have cert-manager installed. Use self-signed certificate for the controller and disable the PVC mutating webhook (annotations are set directly on PVCs via Terraform).	2026-04-03 23:44:49 +03:00
Viktor Barzin	ce7b8c2b2e	add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip] Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding info alert at 80%.	2026-04-03 23:30:00 +03:00
Viktor Barzin	d49acebd8e	migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip] - Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm - Update docs/architecture/storage.md: document Proxmox CSI as primary block storage, mark democratic-csi iSCSI as deprecated - Add full migration plan to docs/plans/	2026-04-03 19:45:34 +03:00
Viktor Barzin	dd59512153	migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip] Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes the iSCSI network hop for database I/O. New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart with StorageClass "proxmox-lvm" using existing local-lvm thin pool. Migrated PVCs (12 total): - Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus - Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2) All services verified healthy post-migration.	2026-04-02 22:13:04 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	a2b1b0e817	remove caretta network mapper to free 3Gi cluster memory Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for non-critical network topology visualization. Removing it to free memory for novelapp and aiostreams which were stuck in Pending.	2026-03-29 22:17:35 +03:00

1 2 3 4 5 ...

434 commits