infra

Author	SHA1	Message	Date
Viktor Barzin	4b2339bfae	state(terminal): update encrypted state	2026-04-06 16:49:30 +03:00
Viktor Barzin	fea8519f51	update VPN architecture docs and Authentik state reference - vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0 interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device names and subnets for London/Valchedrym) - authentik-state.md: Document brute-force-protection policy unbinding fix that was blocking all unauthenticated users from login flows [ci skip]	2026-04-06 16:26:21 +03:00
Viktor Barzin	d2af5339af	fix offsite sync: use --chmod for Synology permission compatibility Synology Administrator user can't create dirs with root-owned permissions from PVC snapshots. Switch from -az to -rltz --chmod to set writable permissions on destination. Also updated Cloud Sync Task 1 excludes to prevent duplication of backup dirs on Synology.	2026-04-06 16:01:42 +03:00
Viktor Barzin	9338af3c29	fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources MySQL operator ignores podSpec.containers sidecar resource overrides, always injecting 6Gi limit defaults. Added sidecar to CR spec for documentation but raised quota from 32Gi to 40Gi as the practical fix. Quota usage drops from 99% to 79%.	2026-04-06 15:57:47 +03:00
Viktor Barzin	fbdb57eb58	fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck Changed from simple time-based (24h on inverter) to condition-based: only fires when on inverter AND battery charge <80% for 1h. This means normal daytime inverter usage won't trigger alerts — only fires when the grid is unavailable and battery is draining.	2026-04-06 15:43:47 +03:00
Viktor Barzin	b5689afe6d	fix(monitoring): tune alert thresholds to reduce false positives - HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W) - HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO) from latency checks — both have inherently high avg response times	2026-04-06 15:39:23 +03:00
Viktor Barzin	ca4acaecd0	bd init: initialize beads issue tracking	2026-04-06 15:38:46 +03:00
Viktor Barzin	91242b0b40	feat(monitoring): add comprehensive hardware exporter alerts Added 20 new alerts across 3 rule groups: Power (8 new): - UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal - ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal Server Health (10 new): - iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy - iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning - iDRACServerPoweredOff, ProxmoxExporterDown - FuseMainFault, FuseGarageFault Metric Staleness (3 new): - FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing Plus 4 new inhibition rules for alert cascade protection.	2026-04-06 15:31:50 +03:00
Viktor Barzin	6abc0b9742	security(monitoring): remove public SNMP exporter ingress snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the public internet with no authentication. Removed the external ingress and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter via the existing .lan ingress (allow_local_access_only=true) using direct IP 10.0.20.200 with Host header.	2026-04-06 15:23:56 +03:00
Viktor Barzin	7f141faa8c	Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel - Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter - Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel) - Update ha-sofia REST integration to use external HTTPS endpoint - Fix ingress backend service routing to use existing snmp-exporter service - All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)	2026-04-06 15:14:19 +03:00
Viktor Barzin	b345b086ef	update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture	2026-04-06 15:06:01 +03:00
Viktor Barzin	d5b0990ed1	state(platform): update encrypted state	2026-04-06 15:04:39 +03:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	a4c80adbce	fix(prowlarr): correct image tag from 1.31.1 to 2.3.5 [ci skip] LinuxServer.io prowlarr uses different version scheme than the agent guessed. Tag 1.31.1 doesn't exist on lscr.io.	2026-04-06 14:55:33 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	a81f7df2a0	feat(diun): add auto-update infrastructure - Custom DIUN image with git/ssh for script notifier - Auto-update script: detects new image versions, updates .tf files, pushes - ESO secret for git credentials, persistent repo clone PVC - GHA workflow to build custom DIUN image - Skips databases and CI/CD-managed images automatically	2026-04-06 14:27:01 +03:00
Viktor Barzin	7cfcbfa405	upgrade immich v2.6.1 → v2.6.3 (bug fixes only) [ci skip]	2026-04-06 14:26:56 +03:00
Viktor Barzin	9e25441c30	fix: restore changedetection and flaresolverr services - changedetection: increase memory from 64Mi to 256Mi/512Mi (was OOMKilling), set replicas back to 1 - flaresolverr: re-enable with replicas=1, increase memory limit to 1Gi (needed by book-search for Cloudflare bypass)	2026-04-06 14:26:29 +03:00
Viktor Barzin	0c0a346d50	state(changedetection): update encrypted state	2026-04-06 14:25:03 +03:00
Viktor Barzin	ec50aaae59	state(changedetection): update encrypted state	2026-04-06 14:23:30 +03:00
Viktor Barzin	25aee1d3e9	fix(mailserver): delete all e2e-probe emails, not just current marker Previously only searched for the current run's specific marker subject. If IMAP deletion failed, old emails accumulated. Now searches for all emails with "e2e-probe" in subject and deletes them, cleaning up any leftovers from prior failed runs.	2026-04-06 13:39:47 +03:00
Viktor Barzin	9349d5d566	fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS Root cause: Traefik v3 auto-detects HTTPS for backend port 443, ignoring the port name "http" and serversscheme annotations. MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik connected via HTTPS causing TLS handshake failure → 500. Fix: Change K8s service port from 443 to 80 with target_port 443. Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly. Also disables anti-AI scraping (internal tool behind Authentik).	2026-04-06 13:38:30 +03:00
Viktor Barzin	2ced1e8fb5	state(meshcentral): update encrypted state	2026-04-06 13:38:09 +03:00
Viktor Barzin	65b320873c	state(meshcentral): update encrypted state	2026-04-06 13:37:41 +03:00
Viktor Barzin	41a329c0a5	state(meshcentral): update encrypted state	2026-04-06 13:36:22 +03:00
Viktor Barzin	33485dc5c7	state(meshcentral): update encrypted state	2026-04-06 13:36:00 +03:00
Viktor Barzin	355c787169	state(meshcentral): update encrypted state	2026-04-06 13:33:21 +03:00
Viktor Barzin	7f13f5fd76	fix(meshcentral): disable anti-AI middleware causing 500 errors The rewrite-body plugin (anti-AI trap links) was crashing when processing MeshCentral's HTML responses, returning 500. Disabled anti_ai_scraping since it's a protected internal tool behind Authentik. Re-enabled Authentik protection.	2026-04-06 13:32:37 +03:00
Viktor Barzin	09501fdb64	state(meshcentral): update encrypted state	2026-04-06 13:32:16 +03:00
Viktor Barzin	d549efbe11	state(meshcentral): update encrypted state	2026-04-06 13:30:12 +03:00
Viktor Barzin	66f1e2ea3b	fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy The previous init container incorrectly disabled TLSOffload, causing MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP, resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload is always enabled so MeshCentral serves plain HTTP behind Traefik.	2026-04-06 13:29:21 +03:00
Viktor Barzin	cf62771177	state(meshcentral): update encrypted state	2026-04-06 13:28:34 +03:00
Viktor Barzin	e3514d7d5b	state(meshcentral): update encrypted state	2026-04-06 13:26:59 +03:00
Viktor Barzin	cba79cde35	fix(meshcentral): disable certUrl when using TLSOffload MeshCentral was failing to start with "Zipencryptionmodule failed" error because the service tried to fetch TLS certificates from an HTTPS endpoint during bootstrap. When using TLSOffload (reverse proxy terminating TLS), MeshCentral should not attempt to load certificates. Root cause: The existing config.json had "certUrl" set to HTTPS, causing MeshCentral to try fetching the certificate during startup. Since the pod was bootstrapping, this failed and cascaded into the Zipencryptionmodule failure. Fix: Add init container that runs before the main container to disable the certUrl by prefixing it with underscore (MeshCentral's convention for disabled settings). The sed command ensures the fix applies to both new and existing config.json files. This ensures MeshCentral behaves correctly with TLSOffload enabled: - Runs in plain HTTP mode on port 443 - Traefik/Ingress handles HTTPS termination - No certificate bootstrap failures	2026-04-06 13:22:59 +03:00
Viktor Barzin	b8120b22c0	state(meshcentral): update encrypted state	2026-04-06 13:22:41 +03:00
Viktor Barzin	6ee5b70a36	priority-pass: update backend to v8 (expanded QR container margins)	2026-04-06 13:22:27 +03:00
Viktor Barzin	64c378d158	add critical instruction to update docs with every infra change [ci skip]	2026-04-06 13:21:49 +03:00
Viktor Barzin	fc233bd27f	docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates	2026-04-06 13:21:05 +03:00
Viktor Barzin	06359aa3fa	state(mailserver): update encrypted state	2026-04-06 13:17:27 +03:00
Viktor Barzin	feeed5ac35	priority-pass: update backend to v7 (square QR container)	2026-04-06 13:16:35 +03:00
Viktor Barzin	ee2c6517ba	fix(meshcentral): remove unused NFS modules after Wave 2 storage migration MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS modules for data and files are no longer used by the deployment, leaving behind orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains on NFS per the backup strategy pattern. Changes: - Removed module.nfs_data and module.nfs_files from Terraform config - Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm) - Backups volume: meshcentral-backups (NFS) - unchanged Pod status: healthy, running on proxmox-lvm volumes.	2026-04-06 13:13:16 +03:00
Viktor Barzin	7b94abd54e	state(meshcentral): update encrypted state	2026-04-06 13:13:09 +03:00
Viktor Barzin	0162b4f130	priority-pass: update backend to v6 (remove edge artifact scratches)	2026-04-06 13:09:45 +03:00
Viktor Barzin	c38ae944fc	priority-pass: update backend to v5 (QR container sizing fix)	2026-04-06 13:06:18 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	1d7244e47a	priority-pass: update backend to v4 (QR container clipping fix)	2026-04-06 13:00:33 +03:00
Viktor Barzin	0c44e11146	priority-pass: update backend to v3 (QR container layout fix)	2026-04-06 12:57:08 +03:00
Viktor Barzin	75b18717a1	priority-pass: update backend to v2 (QR code preservation fix)	2026-04-06 12:53:45 +03:00
Viktor Barzin	ef6f57e82c	priority-pass: update frontend image to v5 (clipboard paste support)	2026-04-06 12:44:19 +03:00

1 2 3 4 5 ...

2343 commits