infra

Author	SHA1	Message	Date
Viktor Barzin	4d3d3316ab	feat(phpipam): deploy phpIPAM for live IP address management Lightweight IPAM with auto-discovery scanning every 15min via fping. Replaces disabled NetBox (OOM'd). Uses existing MySQL InnoDB cluster with Vault-rotated credentials. Cloudflare DNS + Authentik auth. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 14:19:25 +00:00
Viktor Barzin	2a10c1821f	fix(nextcloud): enable mysql.utf8mb4 to fix collation errors with MySQL 8.4 MySQL 8.4 remapped `utf8` to mean `utf8mb4`, but Nextcloud without this config sends `COLLATE UTF8_general_ci` (a utf8mb3-only collation) in queries, causing SQLSTATE[42000] errors that broke occ commands and sync. Also removed stale `Work 🎯.csv` whose emoji filename was stripped in the DB filecache (stored as `Work .csv`), causing permanent sync errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 09:16:48 +00:00
Viktor Barzin	9820f2ced0	add foolery stack: agent orchestration UI on devvm [ci skip] Service+Endpoints pattern proxying to 10.0.10.10:3210 (Foolery). Protected by Authentik forward-auth. DNS via Cloudflare tunnel.	2026-04-10 00:21:59 +01:00
Viktor Barzin	795874fc21	immich: upgrade to v2.7.3, tune PG for vector search performance - Bump immich server + ML from v2.6.3 to v2.7.3 - Increase PG shared_buffers to 2GB (memory 3Gi) to prevent clip_index eviction by background jobs - Switch DB_STORAGE_TYPE to SSD (effective_io_concurrency=200, random_page_cost=1.2) - Add pg_prewarm autoprewarm for warm restarts - Add postgresql.override.conf via init container for tuning - Add postStart hook to prewarm vector tables on startup Search latency: ~1.3s → ~130ms (external), ~60ms (internal)	2026-04-09 23:04:13 +01:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	9de543c076	feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix 192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me domains because the TP-Link router doesn't support hairpin NAT. Adds a CronJob that configures Technitium's Split Horizon AddressTranslation post-processor on all 3 instances to translate 176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS Rebinding Protection privateDomains allowlist so the translated private IP isn't stripped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 18:43:15 +00:00
Viktor Barzin	64585e329c	fix: update Technitium DNS IP from 10.0.20.200 to 10.0.20.201 Technitium DNS was moved to its own dedicated MetalLB LoadBalancer IP (10.0.20.201) but several references still pointed to the old shared IP (10.0.20.200, now used by traefik/coturn/etc). This caused DNS resolution failures for *.viktorbarzin.lan from pfSense and LAN clients. - Update CoreDNS Corefile forward in both technitium and platform modules - Update MetalLB annotation and remove stale allow-shared-ip - Update zone NS records and apex A record in config.tfvars - Update legacy BIND forwarder reference Also fixed on pfSense (not in repo): - Removed NAT rule redirecting UDP 53 to wrong IP (10.0.20.200) - Added dnsmasq listen on WAN (192.168.1.2) for LAN clients - Added domain-specific forwarding (viktorbarzin.lan -> 10.0.20.201) - Created aliases (technitium_dns, k8s_shared_lb) for all NAT rules [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 09:53:56 +00:00
Viktor Barzin	cdfa1b7e92	fix(headscale): backup CronJob uses pod_affinity for RWO PVC access The backup CronJob was stuck in ContainerCreating because it couldn't mount the proxmox-lvm RWO PVC from a different node. Fixed by: - Adding pod_affinity to co-locate with the headscale pod (same node) - Mounting both data PVC (read-only) and NFS backup PVC (write) - Adding integrity check pattern from vaultwarden backup - Setting concurrency_policy=Replace and ttl_seconds_after_finished=10	2026-04-08 08:20:08 +01:00
Viktor Barzin	0498cc4ad1	feat(terminal): add image upload button for iOS paste support iOS Safari doesn't support reading images via navigator.clipboard.read(). Added a camera button that opens the native file/photo picker, which works reliably on all platforms including iOS.	2026-04-08 08:11:18 +01:00
Viktor Barzin	4d753a6486	fix(immich): improve thumbnail loading performance on iOS app - Bump immich-server memory 1700Mi/2500Mi → 2000Mi/3500Mi to prevent OOM kills - Disable anti-AI middleware chain for Immich (removes 3 unnecessary ForwardAuth hops per request — Immich content is behind auth, not crawlable) - Double rate limit to 200 avg / 2000 burst for fast-scroll thumbnail requests - Fix ImmichFrame image tag (1.7.4 → v1.0.32.0) - Add PostgreSQL vector search prewarming and tuning (SSD storage type, init container for override conf, postStart pg_prewarm)	2026-04-08 08:08:53 +01:00
Viktor Barzin	15e45b95a9	feat(terminal): add clipboard paste support for text and images - Custom index.html with xterm.js for reliable Ctrl+V text paste - Go clipboard-upload service saves pasted images to /tmp/clipboard-images/ - Traefik IngressRoute routes /clipboard/* to upload service (same-origin) - Authentik-protected upload path with strip-prefix middleware	2026-04-06 16:57:18 +03:00
Viktor Barzin	9338af3c29	fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources MySQL operator ignores podSpec.containers sidecar resource overrides, always injecting 6Gi limit defaults. Added sidecar to CR spec for documentation but raised quota from 32Gi to 40Gi as the practical fix. Quota usage drops from 99% to 79%.	2026-04-06 15:57:47 +03:00
Viktor Barzin	fbdb57eb58	fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck Changed from simple time-based (24h on inverter) to condition-based: only fires when on inverter AND battery charge <80% for 1h. This means normal daytime inverter usage won't trigger alerts — only fires when the grid is unavailable and battery is draining.	2026-04-06 15:43:47 +03:00
Viktor Barzin	b5689afe6d	fix(monitoring): tune alert thresholds to reduce false positives - HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W) - HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO) from latency checks — both have inherently high avg response times	2026-04-06 15:39:23 +03:00
Viktor Barzin	91242b0b40	feat(monitoring): add comprehensive hardware exporter alerts Added 20 new alerts across 3 rule groups: Power (8 new): - UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal - ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal Server Health (10 new): - iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy - iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning - iDRACServerPoweredOff, ProxmoxExporterDown - FuseMainFault, FuseGarageFault Metric Staleness (3 new): - FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing Plus 4 new inhibition rules for alert cascade protection.	2026-04-06 15:31:50 +03:00
Viktor Barzin	6abc0b9742	security(monitoring): remove public SNMP exporter ingress snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the public internet with no authentication. Removed the external ingress and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter via the existing .lan ingress (allow_local_access_only=true) using direct IP 10.0.20.200 with Host header.	2026-04-06 15:23:56 +03:00
Viktor Barzin	7f141faa8c	Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel - Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter - Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel) - Update ha-sofia REST integration to use external HTTPS endpoint - Fix ingress backend service routing to use existing snmp-exporter service - All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)	2026-04-06 15:14:19 +03:00
Viktor Barzin	a4c80adbce	fix(prowlarr): correct image tag from 1.31.1 to 2.3.5 [ci skip] LinuxServer.io prowlarr uses different version scheme than the agent guessed. Tag 1.31.1 doesn't exist on lscr.io.	2026-04-06 14:55:33 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	a81f7df2a0	feat(diun): add auto-update infrastructure - Custom DIUN image with git/ssh for script notifier - Auto-update script: detects new image versions, updates .tf files, pushes - ESO secret for git credentials, persistent repo clone PVC - GHA workflow to build custom DIUN image - Skips databases and CI/CD-managed images automatically	2026-04-06 14:27:01 +03:00
Viktor Barzin	7cfcbfa405	upgrade immich v2.6.1 → v2.6.3 (bug fixes only) [ci skip]	2026-04-06 14:26:56 +03:00
Viktor Barzin	9e25441c30	fix: restore changedetection and flaresolverr services - changedetection: increase memory from 64Mi to 256Mi/512Mi (was OOMKilling), set replicas back to 1 - flaresolverr: re-enable with replicas=1, increase memory limit to 1Gi (needed by book-search for Cloudflare bypass)	2026-04-06 14:26:29 +03:00
Viktor Barzin	25aee1d3e9	fix(mailserver): delete all e2e-probe emails, not just current marker Previously only searched for the current run's specific marker subject. If IMAP deletion failed, old emails accumulated. Now searches for all emails with "e2e-probe" in subject and deletes them, cleaning up any leftovers from prior failed runs.	2026-04-06 13:39:47 +03:00
Viktor Barzin	9349d5d566	fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS Root cause: Traefik v3 auto-detects HTTPS for backend port 443, ignoring the port name "http" and serversscheme annotations. MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik connected via HTTPS causing TLS handshake failure → 500. Fix: Change K8s service port from 443 to 80 with target_port 443. Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly. Also disables anti-AI scraping (internal tool behind Authentik).	2026-04-06 13:38:30 +03:00
Viktor Barzin	7f13f5fd76	fix(meshcentral): disable anti-AI middleware causing 500 errors The rewrite-body plugin (anti-AI trap links) was crashing when processing MeshCentral's HTML responses, returning 500. Disabled anti_ai_scraping since it's a protected internal tool behind Authentik. Re-enabled Authentik protection.	2026-04-06 13:32:37 +03:00
Viktor Barzin	66f1e2ea3b	fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy The previous init container incorrectly disabled TLSOffload, causing MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP, resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload is always enabled so MeshCentral serves plain HTTP behind Traefik.	2026-04-06 13:29:21 +03:00
Viktor Barzin	cba79cde35	fix(meshcentral): disable certUrl when using TLSOffload MeshCentral was failing to start with "Zipencryptionmodule failed" error because the service tried to fetch TLS certificates from an HTTPS endpoint during bootstrap. When using TLSOffload (reverse proxy terminating TLS), MeshCentral should not attempt to load certificates. Root cause: The existing config.json had "certUrl" set to HTTPS, causing MeshCentral to try fetching the certificate during startup. Since the pod was bootstrapping, this failed and cascaded into the Zipencryptionmodule failure. Fix: Add init container that runs before the main container to disable the certUrl by prefixing it with underscore (MeshCentral's convention for disabled settings). The sed command ensures the fix applies to both new and existing config.json files. This ensures MeshCentral behaves correctly with TLSOffload enabled: - Runs in plain HTTP mode on port 443 - Traefik/Ingress handles HTTPS termination - No certificate bootstrap failures	2026-04-06 13:22:59 +03:00
Viktor Barzin	6ee5b70a36	priority-pass: update backend to v8 (expanded QR container margins)	2026-04-06 13:22:27 +03:00
Viktor Barzin	feeed5ac35	priority-pass: update backend to v7 (square QR container)	2026-04-06 13:16:35 +03:00
Viktor Barzin	ee2c6517ba	fix(meshcentral): remove unused NFS modules after Wave 2 storage migration MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS modules for data and files are no longer used by the deployment, leaving behind orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains on NFS per the backup strategy pattern. Changes: - Removed module.nfs_data and module.nfs_files from Terraform config - Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm) - Backups volume: meshcentral-backups (NFS) - unchanged Pod status: healthy, running on proxmox-lvm volumes.	2026-04-06 13:13:16 +03:00
Viktor Barzin	0162b4f130	priority-pass: update backend to v6 (remove edge artifact scratches)	2026-04-06 13:09:45 +03:00
Viktor Barzin	c38ae944fc	priority-pass: update backend to v5 (QR container sizing fix)	2026-04-06 13:06:18 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	1d7244e47a	priority-pass: update backend to v4 (QR container clipping fix)	2026-04-06 13:00:33 +03:00
Viktor Barzin	0c44e11146	priority-pass: update backend to v3 (QR container layout fix)	2026-04-06 12:57:08 +03:00
Viktor Barzin	75b18717a1	priority-pass: update backend to v2 (QR code preservation fix)	2026-04-06 12:53:45 +03:00
Viktor Barzin	ef6f57e82c	priority-pass: update frontend image to v5 (clipboard paste support)	2026-04-06 12:44:19 +03:00
Viktor Barzin	3676cdbeeb	state(technitium): update encrypted state	2026-04-06 12:40:55 +03:00
root	a038b2a2c4	Woodpecker CI deploy commit [CI SKIP]	2026-04-06 09:28:10 +00:00
Viktor Barzin	5b43e57efa	actualbudget: use internal ClusterIP for http-api server URL The http-api sidecar was connecting to the public URL (https://budget-.viktorbarzin.me) which goes through Traefik/Authentik. When pods got rescheduled to different nodes, this caused ETIMEDOUT errors. Changed to internal service URL (http://budget-.actualbudget.svc.cluster.local) which is fast and reliable regardless of pod placement.	2026-04-06 12:22:57 +03:00
Viktor Barzin	aac02f0467	meshcentral: restore DB from backup; technitium: remove orphaned PVC - meshcentral: fix homepage annotations formatting (no functional change, serversscheme was tested but not needed since MeshCentral serves HTTP) - meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB) - technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer, never mounted — primary uses NFS, replicas have their own proxmox PVCs)	2026-04-06 12:17:08 +03:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	f9e85964ce	traefik: add middleware and platform traefik config updates	2026-04-06 11:57:52 +03:00
Viktor Barzin	dca06b8a00	freedify: increase memory limits and add new features	2026-04-06 11:57:47 +03:00
Viktor Barzin	61dc7a6862	nextcloud: refactor chart values and main.tf configuration	2026-04-06 11:57:44 +03:00
Viktor Barzin	fe342a974b	monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard - proxmox-csi: add RBAC for PVE host snapshot restore script - monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics - monitoring: add backup health Grafana dashboard	2026-04-06 11:57:41 +03:00
Viktor Barzin	b0178cf6d2	technitium: add tertiary DNS replica and fix CoreDNS forward order - Add tertiary DNS deployment with zone-transfer replication for externalTrafficPolicy=Local coverage across more nodes - Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first, then public DNS fallbacks (8.8.8.8, 1.1.1.1)	2026-04-06 11:57:31 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	faa6868f79	remove claude-memory PDB (blocks drains with single replica) Single replica + minAvailable=1 makes drains impossible. claude-memory is non-critical and recovers quickly. [ci skip]	2026-04-06 00:47:40 +03:00

1 2 3 4 5 ...

457 commits