Commit graph

2323 commits

Author SHA1 Message Date
Viktor Barzin
ec50aaae59 state(changedetection): update encrypted state 2026-04-06 14:23:30 +03:00
Viktor Barzin
25aee1d3e9 fix(mailserver): delete all e2e-probe emails, not just current marker
Previously only searched for the current run's specific marker subject.
If IMAP deletion failed, old emails accumulated. Now searches for all
emails with "e2e-probe" in subject and deletes them, cleaning up any
leftovers from prior failed runs.
2026-04-06 13:39:47 +03:00
Viktor Barzin
9349d5d566 fix(meshcentral): use service port 80→443 to prevent Traefik HTTPS
Root cause: Traefik v3 auto-detects HTTPS for backend port 443,
ignoring the port name "http" and serversscheme annotations.
MeshCentral serves HTTP on 443 (TLSOffload mode), but Traefik
connected via HTTPS causing TLS handshake failure → 500.

Fix: Change K8s service port from 443 to 80 with target_port 443.
Traefik sees port 80 → uses HTTP → reaches MeshCentral correctly.
Also disables anti-AI scraping (internal tool behind Authentik).
2026-04-06 13:38:30 +03:00
Viktor Barzin
2ced1e8fb5 state(meshcentral): update encrypted state 2026-04-06 13:38:09 +03:00
Viktor Barzin
65b320873c state(meshcentral): update encrypted state 2026-04-06 13:37:41 +03:00
Viktor Barzin
41a329c0a5 state(meshcentral): update encrypted state 2026-04-06 13:36:22 +03:00
Viktor Barzin
33485dc5c7 state(meshcentral): update encrypted state 2026-04-06 13:36:00 +03:00
Viktor Barzin
355c787169 state(meshcentral): update encrypted state 2026-04-06 13:33:21 +03:00
Viktor Barzin
7f13f5fd76 fix(meshcentral): disable anti-AI middleware causing 500 errors
The rewrite-body plugin (anti-AI trap links) was crashing when
processing MeshCentral's HTML responses, returning 500. Disabled
anti_ai_scraping since it's a protected internal tool behind Authentik.
Re-enabled Authentik protection.
2026-04-06 13:32:37 +03:00
Viktor Barzin
09501fdb64 state(meshcentral): update encrypted state 2026-04-06 13:32:16 +03:00
Viktor Barzin
d549efbe11 state(meshcentral): update encrypted state 2026-04-06 13:30:12 +03:00
Viktor Barzin
66f1e2ea3b fix(meshcentral): re-enable TLSOffload for Traefik reverse proxy
The previous init container incorrectly disabled TLSOffload, causing
MeshCentral to serve HTTPS on port 443. Traefik connects via HTTP,
resulting in protocol mismatch and 500 errors. Fix ensures TLSOffload
is always enabled so MeshCentral serves plain HTTP behind Traefik.
2026-04-06 13:29:21 +03:00
Viktor Barzin
cf62771177 state(meshcentral): update encrypted state 2026-04-06 13:28:34 +03:00
Viktor Barzin
e3514d7d5b state(meshcentral): update encrypted state 2026-04-06 13:26:59 +03:00
Viktor Barzin
cba79cde35 fix(meshcentral): disable certUrl when using TLSOffload
MeshCentral was failing to start with "Zipencryptionmodule failed" error
because the service tried to fetch TLS certificates from an HTTPS endpoint
during bootstrap. When using TLSOffload (reverse proxy terminating TLS),
MeshCentral should not attempt to load certificates.

Root cause: The existing config.json had "certUrl" set to HTTPS, causing
MeshCentral to try fetching the certificate during startup. Since the pod
was bootstrapping, this failed and cascaded into the Zipencryptionmodule
failure.

Fix: Add init container that runs before the main container to disable
the certUrl by prefixing it with underscore (MeshCentral's convention for
disabled settings). The sed command ensures the fix applies to both new
and existing config.json files.

This ensures MeshCentral behaves correctly with TLSOffload enabled:
- Runs in plain HTTP mode on port 443
- Traefik/Ingress handles HTTPS termination
- No certificate bootstrap failures
2026-04-06 13:22:59 +03:00
Viktor Barzin
b8120b22c0 state(meshcentral): update encrypted state 2026-04-06 13:22:41 +03:00
Viktor Barzin
6ee5b70a36 priority-pass: update backend to v8 (expanded QR container margins) 2026-04-06 13:22:27 +03:00
Viktor Barzin
64c378d158 add critical instruction to update docs with every infra change [ci skip] 2026-04-06 13:21:49 +03:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
06359aa3fa state(mailserver): update encrypted state 2026-04-06 13:17:27 +03:00
Viktor Barzin
feeed5ac35 priority-pass: update backend to v7 (square QR container) 2026-04-06 13:16:35 +03:00
Viktor Barzin
ee2c6517ba fix(meshcentral): remove unused NFS modules after Wave 2 storage migration
MeshCentral was migrated from NFS to proxmox-lvm storage (Wave 2). The old NFS
modules for data and files are no longer used by the deployment, leaving behind
orphaned PVCs (meshcentral-data, meshcentral-files). The backups volume remains
on NFS per the backup strategy pattern.

Changes:
- Removed module.nfs_data and module.nfs_files from Terraform config
- Active volumes now: meshcentral-data-proxmox, meshcentral-files-proxmox (proxmox-lvm)
- Backups volume: meshcentral-backups (NFS) - unchanged

Pod status: healthy, running on proxmox-lvm volumes.
2026-04-06 13:13:16 +03:00
Viktor Barzin
7b94abd54e state(meshcentral): update encrypted state 2026-04-06 13:13:09 +03:00
Viktor Barzin
0162b4f130 priority-pass: update backend to v6 (remove edge artifact scratches) 2026-04-06 13:09:45 +03:00
Viktor Barzin
c38ae944fc priority-pass: update backend to v5 (QR container sizing fix) 2026-04-06 13:06:18 +03:00
Viktor Barzin
9492874c43 fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.

- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
2026-04-06 13:00:49 +03:00
Viktor Barzin
1d7244e47a priority-pass: update backend to v4 (QR container clipping fix) 2026-04-06 13:00:33 +03:00
Viktor Barzin
0c44e11146 priority-pass: update backend to v3 (QR container layout fix) 2026-04-06 12:57:08 +03:00
Viktor Barzin
75b18717a1 priority-pass: update backend to v2 (QR code preservation fix) 2026-04-06 12:53:45 +03:00
Viktor Barzin
ef6f57e82c priority-pass: update frontend image to v5 (clipboard paste support) 2026-04-06 12:44:19 +03:00
Viktor Barzin
3676cdbeeb state(technitium): update encrypted state 2026-04-06 12:40:55 +03:00
Viktor Barzin
cd2d00703c state(vault): update encrypted state 2026-04-06 12:40:54 +03:00
root
a038b2a2c4 Woodpecker CI deploy commit [CI SKIP] 2026-04-06 09:28:10 +00:00
Viktor Barzin
5b43e57efa actualbudget: use internal ClusterIP for http-api server URL
The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.

Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
2026-04-06 12:22:57 +03:00
Viktor Barzin
07bad79489 state(status-page): update encrypted state 2026-04-06 12:20:54 +03:00
Viktor Barzin
5fef9945de state(actualbudget): update encrypted state 2026-04-06 12:20:28 +03:00
Viktor Barzin
4594472e69 state(status-page): update encrypted state 2026-04-06 12:20:01 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
f675b1492f state(meshcentral): update encrypted state 2026-04-06 12:10:41 +03:00
Viktor Barzin
bbc9ea3444 state(status-page): update encrypted state 2026-04-06 12:09:50 +03:00
Viktor Barzin
0de2fef9c9 misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates
- actualbudget: adjust resource config
- authentik: add configuration
- headscale: minor fix
- rybbit: add resources
- terminal: add terminal stack config
- platform/dbaas: add config
- infra: update lock file
2026-04-06 11:58:00 +03:00
Viktor Barzin
c2f9ca0d13 modules: improve create-vm with additional config options and cloud-init updates 2026-04-06 11:57:55 +03:00
Viktor Barzin
f9e85964ce traefik: add middleware and platform traefik config updates 2026-04-06 11:57:52 +03:00
Viktor Barzin
dca06b8a00 freedify: increase memory limits and add new features 2026-04-06 11:57:47 +03:00
Viktor Barzin
61dc7a6862 nextcloud: refactor chart values and main.tf configuration 2026-04-06 11:57:44 +03:00
Viktor Barzin
fe342a974b monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard
- proxmox-csi: add RBAC for PVE host snapshot restore script
- monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics
- monitoring: add backup health Grafana dashboard
2026-04-06 11:57:41 +03:00
Viktor Barzin
72d832fee7 add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs
- Healthcheck: add entity availability, integration health, automation
  status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
2026-04-06 11:57:36 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
0115320d72 state(status-page): update encrypted state 2026-04-06 11:48:40 +03:00