Commit graph

446 commits

Author SHA1 Message Date
Viktor Barzin
9338af3c29 fix(dbaas): raise ResourceQuota to 40Gi and add sidecar resources
MySQL operator ignores podSpec.containers sidecar resource overrides,
always injecting 6Gi limit defaults. Added sidecar to CR spec for
documentation but raised quota from 32Gi to 40Gi as the practical fix.
Quota usage drops from 99% to 79%.
2026-04-06 15:57:47 +03:00
Viktor Barzin
fbdb57eb58 fix(monitoring): UsingInverterEnergyForTooLong only alerts when stuck
Changed from simple time-based (24h on inverter) to condition-based:
only fires when on inverter AND battery charge <80% for 1h. This means
normal daytime inverter usage won't trigger alerts — only fires when
the grid is unavailable and battery is draining.
2026-04-06 15:43:47 +03:00
Viktor Barzin
b5689afe6d fix(monitoring): tune alert thresholds to reduce false positives
- HighPowerUsage: raise from 200W to 300W (R730 idles at ~230W)
- HighServiceLatency: exclude headscale (WebSocket) and authentik (SSO)
  from latency checks — both have inherently high avg response times
2026-04-06 15:39:23 +03:00
Viktor Barzin
91242b0b40 feat(monitoring): add comprehensive hardware exporter alerts
Added 20 new alerts across 3 rule groups:

Power (8 new):
- UPSAlarmsActive, UPSBatteryDegraded, UPSOverloaded, UPSOutputVoltageAbnormal
- ATSFault, ATSPowerFault, ATSOverload, ATSInputVoltageAbnormal

Server Health (10 new):
- iDRACSystemUnhealthy, iDRACPowerSupplyUnhealthy, iDRACMemoryUnhealthy
- iDRACStorageDriveUnhealthy, iDRACSSDWearCritical/Warning
- iDRACServerPoweredOff, ProxmoxExporterDown
- FuseMainFault, FuseGarageFault

Metric Staleness (3 new):
- FuseMainMetricsMissing, FuseGarageMetricsMissing, ProxmoxMetricsMissing

Plus 4 new inhibition rules for alert cascade protection.
2026-04-06 15:31:50 +03:00
Viktor Barzin
6abc0b9742 security(monitoring): remove public SNMP exporter ingress
snmp-exporter-external.viktorbarzin.me exposed UPS metrics to the
public internet with no authentication. Removed the external ingress
and Cloudflare DNS record. ha-sofia now accesses the SNMP exporter
via the existing .lan ingress (allow_local_access_only=true) using
direct IP 10.0.20.200 with Host header.
2026-04-06 15:23:56 +03:00
Viktor Barzin
7f141faa8c Fix: Expose SNMP exporter externally to ha-sofia via Cloudflare tunnel
- Add snmp-exporter-ingress-external module for external HTTPS access to snmp-exporter
- Register snmp-exporter-external.viktorbarzin.me in Cloudflare DNS (proxied via tunnel)
- Update ha-sofia REST integration to use external HTTPS endpoint
- Fix ingress backend service routing to use existing snmp-exporter service
- All UPS sensors on ha-sofia now report values (voltage, battery %, load, etc.)
2026-04-06 15:14:19 +03:00
Viktor Barzin
a4c80adbce fix(prowlarr): correct image tag from 1.31.1 to 2.3.5 [ci skip]
LinuxServer.io prowlarr uses different version scheme than the agent
guessed. Tag 1.31.1 doesn't exist on lscr.io.
2026-04-06 14:55:33 +03:00
Viktor Barzin
d009f9a0f2 add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync
- weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data
  with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS,
  backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots.
- offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk).
  Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency.
- lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily)
- Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale,
  OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.
2026-04-06 14:53:28 +03:00
Viktor Barzin
09b4bad958 feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline
Pin third-party images from :latest to current stable versions:
- Platform: cloudflared, technitium, snmp-exporter, pve-exporter,
  headscale, shadowsocks, xray
- Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse,
  n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe,
  cyberchef, networking-toolbox, echo, coturn, shlink, affine

Enable DIUN annotations on all pinned deployments with per-image
tag patterns. Add Woodpecker app-stacks pipeline for selective
terragrunt apply on changed app stacks.
2026-04-06 14:27:13 +03:00
Viktor Barzin
a81f7df2a0 feat(diun): add auto-update infrastructure
- Custom DIUN image with git/ssh for script notifier
- Auto-update script: detects new image versions, updates .tf files, pushes
- ESO secret for git credentials, persistent repo clone PVC
- GHA workflow to build custom DIUN image
- Skips databases and CI/CD-managed images automatically
2026-04-06 14:27:01 +03:00
Viktor Barzin
7cfcbfa405 upgrade immich v2.6.1 → v2.6.3 (bug fixes only) [ci skip] 2026-04-06 14:26:56 +03:00
Viktor Barzin
9e25441c30 fix: restore changedetection and flaresolverr services
- changedetection: increase memory from 64Mi to 256Mi/512Mi (was OOMKilling),
  set replicas back to 1
- flaresolverr: re-enable with replicas=1, increase memory limit to 1Gi
  (needed by book-search for Cloudflare bypass)
2026-04-06 14:26:29 +03:00
Viktor Barzin
25aee1d3e9 fix(mailserver): delete all e2e-probe emails, not just current marker
Previously only searched for the current run's specific marker subject.
If IMAP deletion failed, old emails accumulated. Now searches for all
emails with "e2e-probe" in subject and deletes them, cleaning up any
leftovers from prior failed runs.
2026-04-06 13:39:47 +03:00
Viktor Barzin
9349d5d566 fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS
Root cause: Traefik v3 auto-detects HTTPS for backend port 443,
ignoring the port name "http" and serversscheme annotations.
MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik
connected via HTTPS causing TLS handshake failure → 500.

Fix: Change K8s service port from 443 to 80 with target_port 443.
Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly.
Also disables anti-AI scraping (internal tool behind Authentik).
2026-04-06 13:38:30 +03:00
Viktor Barzin
7f13f5fd76 fix(meshcentral): disable anti-AI middleware causing 500 errors
The rewrite-body plugin (anti-AI trap links) was crashing when
processing MeshCentral's HTML responses, returning 500. Disabled
anti_ai_scraping since it's a protected internal tool behind Authentik.
Re-enabled Authentik protection.
2026-04-06 13:32:37 +03:00
Viktor Barzin
66f1e2ea3b fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy
The previous init container incorrectly disabled TLSOffload, causing
MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP,
resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload
is always enabled so MeshCentral serves plain HTTP behind Traefik.
2026-04-06 13:29:21 +03:00
Viktor Barzin
cba79cde35 fix(meshcentral): disable certUrl when using TLSOffload
MeshCentral was failing to start with "Zipencryptionmodule failed" error
because the service tried to fetch TLS certificates from an HTTPS endpoint
during bootstrap. When using TLSOffload (reverse proxy terminating TLS),
MeshCentral should not attempt to load certificates.

Root cause: The existing config.json had "certUrl" set to HTTPS, causing
MeshCentral to try fetching the certificate during startup. Since the pod
was bootstrapping, this failed and cascaded into the Zipencryptionmodule
failure.

Fix: Add init container that runs before the main container to disable
the certUrl by prefixing it with underscore (MeshCentral's convention for
disabled settings). The sed command ensures the fix applies to both new
and existing config.json files.

This ensures MeshCentral behaves correctly with TLSOffload enabled:
- Runs in plain HTTP mode on port 443
- Traefik/Ingress handles HTTPS termination
- No certificate bootstrap failures
2026-04-06 13:22:59 +03:00
Viktor Barzin
6ee5b70a36 priority-pass: update backend to v8 (expanded QR container margins) 2026-04-06 13:22:27 +03:00
Viktor Barzin
feeed5ac35 priority-pass: update backend to v7 (square QR container) 2026-04-06 13:16:35 +03:00
Viktor Barzin
ee2c6517ba fix(meshcentral): remove unused NFS modules after Wave 2 storage migration
MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS
modules for data and files are no longer used by the deployment, leaving behind
orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains
on NFS per the backup strategy pattern.

Changes:
- Removed module.nfs_data and module.nfs_files from Terraform config
- Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm)
- Backups volume: meshcentral-backups (NFS) - unchanged

Pod status: healthy, running on proxmox-lvm volumes.
2026-04-06 13:13:16 +03:00
Viktor Barzin
0162b4f130 priority-pass: update backend to v6 (remove edge artifact scratches) 2026-04-06 13:09:45 +03:00
Viktor Barzin
c38ae944fc priority-pass: update backend to v5 (QR container sizing fix) 2026-04-06 13:06:18 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
1d7244e47a priority-pass: update backend to v4 (QR container clipping fix) 2026-04-06 13:00:33 +03:00
Viktor Barzin
0c44e11146 priority-pass: update backend to v3 (QR container layout fix) 2026-04-06 12:57:08 +03:00
Viktor Barzin
75b18717a1 priority-pass: update backend to v2 (QR code preservation fix) 2026-04-06 12:53:45 +03:00
Viktor Barzin
ef6f57e82c priority-pass: update frontend image to v5 (clipboard paste support) 2026-04-06 12:44:19 +03:00
Viktor Barzin
3676cdbeeb state(technitium): update encrypted state 2026-04-06 12:40:55 +03:00
root
a038b2a2c4 Woodpecker CI deploy commit [CI SKIP] 2026-04-06 09:28:10 +00:00
Viktor Barzin
5b43e57efa actualbudget: use internal ClusterIP for http-api server URL
The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.

Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
2026-04-06 12:22:57 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
0de2fef9c9 misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates
- actualbudget: adjust resource config
- authentik: add configuration
- headscale: minor fix
- rybbit: add resources
- terminal: add terminal stack config
- platform/dbaas: add config
- infra: update lock file
2026-04-06 11:58:00 +03:00
Viktor Barzin
f9e85964ce traefik: add middleware and platform traefik config updates 2026-04-06 11:57:52 +03:00
Viktor Barzin
dca06b8a00 freedify: increase memory limits and add new features 2026-04-06 11:57:47 +03:00
Viktor Barzin
61dc7a6862 nextcloud: refactor chart values and main.tf configuration 2026-04-06 11:57:44 +03:00
Viktor Barzin
fe342a974b monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard
- proxmox-csi: add RBAC for PVE host snapshot restore script
- monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics
- monitoring: add backup health Grafana dashboard
2026-04-06 11:57:41 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
faa6868f79 remove claude-memory PDB (blocks drains with single replica)
Single replica + minAvailable=1 makes drains impossible.
claude-memory is non-critical and recovers quickly. [ci skip]
2026-04-06 00:47:40 +03:00
Viktor Barzin
c8be07c403 resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
2026-04-06 00:25:49 +03:00
Viktor Barzin
0f2ef356d6 fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned)
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
2026-04-05 23:53:18 +03:00
Viktor Barzin
ae0585048a fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
2026-04-05 23:31:23 +03:00
Viktor Barzin
772f59d589 fix: add Vault-managed DB credentials for Matrix Synapse
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
  with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
2026-04-05 23:18:16 +03:00
Viktor Barzin
e064778c2c fix: resolve tandoor, matrix, navidrome crash loops
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
  failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
  of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
  proxmox-lvm migration), navidrome recreates fresh DB on startup
2026-04-05 23:12:49 +03:00
Viktor Barzin
4da8f0242f fix: right-size service memory after PVE RAM upgrade (142→272GB)
- MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit)
- Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled)
- Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled)
- Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled
- Navidrome: 128Mi/128Mi → 256Mi/384Mi
- Matrix: add explicit 256Mi/512Mi resources
- Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled
- Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi
- Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi
- Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections
2026-04-05 23:02:50 +03:00
Viktor Barzin
c946e5fdc9 tune controller-manager + apiserver for faster volume detach
- kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default)
- kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default)
- kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default)

Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure.
Applied live + codified in cloud-init template. [ci skip]
2026-04-05 22:14:23 +03:00
Viktor Barzin
c239300154 fix: disable rspamd-redis and correct proxmox-lvm PVC size
ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start
an embedded Redis server. The rspamd-redis subprocess was failing repeatedly
due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage
migration. Since the DKIM signing config uses use_redis=false, Redis is not
needed.

Also correct the PVC storage request to match the actual provisioned size (2Gi).
The mismatch was causing unnecessary PVC replacement during terraform apply.
2026-04-05 21:44:52 +03:00
Viktor Barzin
2eeb73fc57 add priority-based kubelet graceful shutdown ordering
Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with
shutdownGracePeriodByPodPriority for 9-tier ordered shutdown:
  unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) →
  tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) →
  sys-node(30s). Total: 310s.

Apps stop before databases, databases stop before infrastructure.
VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]
2026-04-05 20:54:36 +03:00
Viktor Barzin
56583c3825 fix: add retry middleware and per-service rate limit for ha-sofia
The global rate limit (10 req/s, 50 burst) was too aggressive for HA
dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel
blips between London K8s and Sofia caused 502s with no retry fallback.

- Add traefik-retry middleware to reverse-proxy factory (all services)
- Add skip_global_rate_limit variable to both reverse-proxy factories
- Create ha-sofia-rate-limit middleware (100 avg, 200 burst)
- Apply to ha-sofia and music-assistant (both route to Sofia)
2026-04-05 20:47:58 +03:00
Viktor Barzin
3cd560d4d9 fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip]
The Terraform Helm provider's YAML diff comparison silently ignores rules
containing {{ $labels.job }} in annotations, preventing the alerts from being
applied. Also syncs alerts to platform stack tpl.
2026-04-05 20:07:51 +03:00