Commit graph

2343 commits

Author SHA1 Message Date
Viktor Barzin
4b2339bfae state(terminal): update encrypted state 2026-04-06 16:49:30 +03:00
Viktor Barzin
fea8519f51 update VPN architecture docs and Authentik state reference
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
  interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
  names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
  that was blocking all unauthenticated users from login flows

[ci skip]
2026-04-06 16:26:21 +03:00
Viktor Barzin
d2af5339af fix offsite sync: use --chmod for Synology permission compatibility
Synology Administrator user can't create dirs with root-owned permissions
from PVC snapshots. Switch from -az to -rltz --chmod to set writable
permissions on destination. Also updated Cloud Sync Task 1 excludes
to prevent duplication of backup dirs on Synology.
2026-04-06 16:01:42 +03:00
Viktor Barzin
9338af3c29 fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources
MySQL operator ignores podSpec.containers sidecar resource overrides,
always injecting 6Gi limit defaults. Added sidecar to CR spec for
documentation but raised quota from 32Gi to 40Gi as the practical fix.
Quota usage drops from 99% to 79%.
2026-04-06 15:57:47 +03:00
Viktor Barzin
fbdb57eb58 fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck
Changed from simple time-based (24h on inverter) to condition-based:
only fires when on inverter AND battery charge <80% for 1h. This means
normal daytime inverter usage won't trigger alerts — only fires when
the grid is unavailable and battery is draining.
2026-04-06 15:43:47 +03:00
Viktor Barzin
b5689afe6d fix(monitoring): tune alert thresholds to reduce false positives
- HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W)
- HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO)
  from latency checks — both have inherently high avg response times
2026-04-06 15:39:23 +03:00
Viktor Barzin
ca4acaecd0 bd init: initialize beads issue tracking 2026-04-06 15:38:46 +03:00
Viktor Barzin
91242b0b40 feat(monitoring): add comprehensive hardware exporter alerts
Added 20 new alerts across 3 rule groups:

Power (8 new):
- UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal
- ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal

Server Health (10 new):
- iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy
- iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning
- iDRACServerPoweredOff, ProxmoxExporterDown
- FuseMainFault, FuseGarageFault

Metric Staleness (3 new):
- FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing

Plus 4 new inhibition rules for alert cascade protection.
2026-04-06 15:31:50 +03:00
Viktor Barzin
6abc0b9742 security(monitoring): remove public SNMP exporter ingress
snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the
public internet with no authentication. Removed the external ingress
and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter
via the existing .lan ingress (allow_local_access_only=true) using
direct IP 10.0.20.200 with Host header.
2026-04-06 15:23:56 +03:00
Viktor Barzin
7f141faa8c Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel
- Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter
- Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel)
- Update ha-sofia REST integration to use external HTTPS endpoint
- Fix ingress backend service routing to use existing snmp-exporter service
- All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)
2026-04-06 15:14:19 +03:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
d5b0990ed1 state(platform): update encrypted state 2026-04-06 15:04:39 +03:00
Viktor Barzin
9e2ac5fbb5 feat: add hardware exporter checks to cluster healthcheck (check #30)
Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and
tuya-bridge pods are running, plus checks Prometheus scrape targets
(snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.
2026-04-06 14:58:46 +03:00
Viktor Barzin
a4c80adbce fix(prowlarr): correct image tag from 1.31.1 to 2.3.5 [ci skip]
LinuxServer.io prowlarr uses different version scheme than the agent
guessed. Tag 1.31.1 doesn't exist on lscr.io.
2026-04-06 14:55:33 +03:00
Viktor Barzin
d009f9a0f2 add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync
- weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data
  with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS,
  backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots.
- offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk).
  Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency.
- lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily)
- Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale,
  OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.
2026-04-06 14:53:28 +03:00
Viktor Barzin
09b4bad958 feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline
Pin third-party images from :latest to current stable versions:
- Platform: cloudflared, technitium, snmp-exporter, pve-exporter,
  headscale, shadowsocks, xray
- Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse,
  n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe,
  cyberchef, networking-toolbox, echo, coturn, shlink, affine

Enable DIUN annotations on all pinned deployments with per-image
tag patterns. Add Woodpecker app-stacks pipeline for selective
terragrunt apply on changed app stacks.
2026-04-06 14:27:13 +03:00
Viktor Barzin
a81f7df2a0 feat(diun): add auto-update infrastructure
- Custom DIUN image with git/ssh for script notifier
- Auto-update script: detects new image versions, updates .tf files, pushes
- ESO secret for git credentials, persistent repo clone PVC
- GHA workflow to build custom DIUN image
- Skips databases and CI/CD-managed images automatically
2026-04-06 14:27:01 +03:00
Viktor Barzin
7cfcbfa405 upgrade immich v2.6.1 → v2.6.3 (bug fixes only) [ci skip] 2026-04-06 14:26:56 +03:00
Viktor Barzin
9e25441c30 fix: restore changedetection and flaresolverr services
- changedetection: increase memory from 64Mi to 256Mi/512Mi (was OOMKilling),
  set replicas back to 1
- flaresolverr: re-enable with replicas=1, increase memory limit to 1Gi
  (needed by book-search for Cloudflare bypass)
2026-04-06 14:26:29 +03:00
Viktor Barzin
0c0a346d50 state(changedetection): update encrypted state 2026-04-06 14:25:03 +03:00
Viktor Barzin
ec50aaae59 state(changedetection): update encrypted state 2026-04-06 14:23:30 +03:00
Viktor Barzin
25aee1d3e9 fix(mailserver): delete all e2e-probe emails, not just current marker
Previously only searched for the current run's specific marker subject.
If IMAP deletion failed, old emails accumulated. Now searches for all
emails with "e2e-probe" in subject and deletes them, cleaning up any
leftovers from prior failed runs.
2026-04-06 13:39:47 +03:00
Viktor Barzin
9349d5d566 fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS
Root cause: Traefik v3 auto-detects HTTPS for backend port 443,
ignoring the port name "http" and serversscheme annotations.
MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik
connected via HTTPS causing TLS handshake failure → 500.

Fix: Change K8s service port from 443 to 80 with target_port 443.
Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly.
Also disables anti-AI scraping (internal tool behind Authentik).
2026-04-06 13:38:30 +03:00
Viktor Barzin
2ced1e8fb5 state(meshcentral): update encrypted state 2026-04-06 13:38:09 +03:00
Viktor Barzin
65b320873c state(meshcentral): update encrypted state 2026-04-06 13:37:41 +03:00
Viktor Barzin
41a329c0a5 state(meshcentral): update encrypted state 2026-04-06 13:36:22 +03:00
Viktor Barzin
33485dc5c7 state(meshcentral): update encrypted state 2026-04-06 13:36:00 +03:00
Viktor Barzin
355c787169 state(meshcentral): update encrypted state 2026-04-06 13:33:21 +03:00
Viktor Barzin
7f13f5fd76 fix(meshcentral): disable anti-AI middleware causing 500 errors
The rewrite-body plugin (anti-AI trap links) was crashing when
processing MeshCentral's HTML responses, returning 500. Disabled
anti_ai_scraping since it's a protected internal tool behind Authentik.
Re-enabled Authentik protection.
2026-04-06 13:32:37 +03:00
Viktor Barzin
09501fdb64 state(meshcentral): update encrypted state 2026-04-06 13:32:16 +03:00
Viktor Barzin
d549efbe11 state(meshcentral): update encrypted state 2026-04-06 13:30:12 +03:00
Viktor Barzin
66f1e2ea3b fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy
The previous init container incorrectly disabled TLSOffload, causing
MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP,
resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload
is always enabled so MeshCentral serves plain HTTP behind Traefik.
2026-04-06 13:29:21 +03:00
Viktor Barzin
cf62771177 state(meshcentral): update encrypted state 2026-04-06 13:28:34 +03:00
Viktor Barzin
e3514d7d5b state(meshcentral): update encrypted state 2026-04-06 13:26:59 +03:00
Viktor Barzin
cba79cde35 fix(meshcentral): disable certUrl when using TLSOffload
MeshCentral was failing to start with "Zipencryptionmodule failed" error
because the service tried to fetch TLS certificates from an HTTPS endpoint
during bootstrap. When using TLSOffload (reverse proxy terminating TLS),
MeshCentral should not attempt to load certificates.

Root cause: The existing config.json had "certUrl" set to HTTPS, causing
MeshCentral to try fetching the certificate during startup. Since the pod
was bootstrapping, this failed and cascaded into the Zipencryptionmodule
failure.

Fix: Add init container that runs before the main container to disable
the certUrl by prefixing it with underscore (MeshCentral's convention for
disabled settings). The sed command ensures the fix applies to both new
and existing config.json files.

This ensures MeshCentral behaves correctly with TLSOffload enabled:
- Runs in plain HTTP mode on port 443
- Traefik/Ingress handles HTTPS termination
- No certificate bootstrap failures
2026-04-06 13:22:59 +03:00
Viktor Barzin
b8120b22c0 state(meshcentral): update encrypted state 2026-04-06 13:22:41 +03:00
Viktor Barzin
6ee5b70a36 priority-pass: update backend to v8 (expanded QR container margins) 2026-04-06 13:22:27 +03:00
Viktor Barzin
64c378d158 add critical instruction to update docs with every infra change [ci skip] 2026-04-06 13:21:49 +03:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
06359aa3fa state(mailserver): update encrypted state 2026-04-06 13:17:27 +03:00
Viktor Barzin
feeed5ac35 priority-pass: update backend to v7 (square QR container) 2026-04-06 13:16:35 +03:00
Viktor Barzin
ee2c6517ba fix(meshcentral): remove unused NFS modules after Wave 2 storage migration
MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS
modules for data and files are no longer used by the deployment, leaving behind
orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains
on NFS per the backup strategy pattern.

Changes:
- Removed module.nfs_data and module.nfs_files from Terraform config
- Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm)
- Backups volume: meshcentral-backups (NFS) - unchanged

Pod status: healthy, running on proxmox-lvm volumes.
2026-04-06 13:13:16 +03:00
Viktor Barzin
7b94abd54e state(meshcentral): update encrypted state 2026-04-06 13:13:09 +03:00
Viktor Barzin
0162b4f130 priority-pass: update backend to v6 (remove edge artifact scratches) 2026-04-06 13:09:45 +03:00
Viktor Barzin
c38ae944fc priority-pass: update backend to v5 (QR container sizing fix) 2026-04-06 13:06:18 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
1d7244e47a priority-pass: update backend to v4 (QR container clipping fix) 2026-04-06 13:00:33 +03:00
Viktor Barzin
0c44e11146 priority-pass: update backend to v3 (QR container layout fix) 2026-04-06 12:57:08 +03:00
Viktor Barzin
75b18717a1 priority-pass: update backend to v2 (QR code preservation fix) 2026-04-06 12:53:45 +03:00
Viktor Barzin
ef6f57e82c priority-pass: update frontend image to v5 (clipboard paste support) 2026-04-06 12:44:19 +03:00