Commit graph

2292 commits

Author SHA1 Message Date
Viktor Barzin
cd2d00703c state(vault): update encrypted state 2026-04-06 12:40:54 +03:00
root
a038b2a2c4 Woodpecker CI deploy commit [CI SKIP] 2026-04-06 09:28:10 +00:00
Viktor Barzin
5b43e57efa actualbudget: use internal ClusterIP for http-api server URL
The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.

Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
2026-04-06 12:22:57 +03:00
Viktor Barzin
07bad79489 state(status-page): update encrypted state 2026-04-06 12:20:54 +03:00
Viktor Barzin
5fef9945de state(actualbudget): update encrypted state 2026-04-06 12:20:28 +03:00
Viktor Barzin
4594472e69 state(status-page): update encrypted state 2026-04-06 12:20:01 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
f675b1492f state(meshcentral): update encrypted state 2026-04-06 12:10:41 +03:00
Viktor Barzin
bbc9ea3444 state(status-page): update encrypted state 2026-04-06 12:09:50 +03:00
Viktor Barzin
0de2fef9c9 misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates
- actualbudget: adjust resource config
- authentik: add configuration
- headscale: minor fix
- rybbit: add resources
- terminal: add terminal stack config
- platform/dbaas: add config
- infra: update lock file
2026-04-06 11:58:00 +03:00
Viktor Barzin
c2f9ca0d13 modules: improve create-vm with additional config options and cloud-init updates 2026-04-06 11:57:55 +03:00
Viktor Barzin
f9e85964ce traefik: add middleware and platform traefik config updates 2026-04-06 11:57:52 +03:00
Viktor Barzin
dca06b8a00 freedify: increase memory limits and add new features 2026-04-06 11:57:47 +03:00
Viktor Barzin
61dc7a6862 nextcloud: refactor chart values and main.tf configuration 2026-04-06 11:57:44 +03:00
Viktor Barzin
fe342a974b monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard
- proxmox-csi: add RBAC for PVE host snapshot restore script
- monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics
- monitoring: add backup health Grafana dashboard
2026-04-06 11:57:41 +03:00
Viktor Barzin
72d832fee7 add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs
- Healthcheck: add entity availability, integration health, automation
  status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
2026-04-06 11:57:36 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
0115320d72 state(status-page): update encrypted state 2026-04-06 11:48:40 +03:00
Viktor Barzin
d0ed3cc3ce state(status-page): update encrypted state 2026-04-06 11:31:45 +03:00
Viktor Barzin
8e6034c34d state(status-page): update encrypted state 2026-04-06 11:29:48 +03:00
Viktor Barzin
9479a562f1 state(status-page): update encrypted state 2026-04-06 11:28:40 +03:00
Viktor Barzin
e0dcfd7694 state(status-page): update encrypted state 2026-04-06 11:27:16 +03:00
Viktor Barzin
9f91a3db88 state: update encrypted terraform state 2026-04-06 11:26:45 +03:00
Viktor Barzin
a486bbd66c state(immich): update encrypted state 2026-04-06 10:50:34 +03:00
Viktor Barzin
c988c5b43e state(nfs-csi): update encrypted state 2026-04-06 10:40:50 +03:00
Viktor Barzin
58e698c647 state(immich): update encrypted state 2026-04-06 10:38:43 +03:00
Viktor Barzin
9aea54674e state(trading-bot): update encrypted state 2026-04-06 10:29:54 +03:00
Viktor Barzin
faa6868f79 remove claude-memory PDB (blocks drains with single replica)
Single replica + minAvailable=1 makes drains impossible.
claude-memory is non-critical and recovers quickly. [ci skip]
2026-04-06 00:47:40 +03:00
Viktor Barzin
70c870a2ed state: update encrypted terraform state 2026-04-06 00:37:58 +03:00
Viktor Barzin
dcda285d9b state(authentik): update encrypted state 2026-04-06 00:35:33 +03:00
Viktor Barzin
88307e3e5f state(headscale): update encrypted state 2026-04-06 00:33:54 +03:00
Viktor Barzin
c8be07c403 resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
2026-04-06 00:25:49 +03:00
Viktor Barzin
3eb15149e1 state(platform): update encrypted state 2026-04-06 00:25:21 +03:00
Viktor Barzin
2ead11d36b state(descheduler): update encrypted state 2026-04-06 00:25:14 +03:00
Viktor Barzin
0f2ef356d6 fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned)
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
2026-04-05 23:53:18 +03:00
Viktor Barzin
ae0585048a fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
2026-04-05 23:31:23 +03:00
Viktor Barzin
aa62e789e5 state(kyverno): update encrypted state 2026-04-05 23:28:46 +03:00
Viktor Barzin
772f59d589 fix: add Vault-managed DB credentials for Matrix Synapse
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
  with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
2026-04-05 23:18:16 +03:00
Viktor Barzin
e064778c2c fix: resolve tandoor, matrix, navidrome crash loops
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
  failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
  of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
  proxmox-lvm migration), navidrome recreates fresh DB on startup
2026-04-05 23:12:49 +03:00
Viktor Barzin
4da8f0242f fix: right-size service memory after PVE RAM upgrade (142→272GB)
- MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit)
- Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled)
- Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled)
- Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled
- Navidrome: 128Mi/128Mi → 256Mi/384Mi
- Matrix: add explicit 256Mi/512Mi resources
- Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled
- Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi
- Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi
- Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections
2026-04-05 23:02:50 +03:00
Viktor Barzin
825adc4a67 state(trading-bot): update encrypted state 2026-04-05 22:59:53 +03:00
Viktor Barzin
e206ffd676 state(kyverno): update encrypted state 2026-04-05 22:53:47 +03:00
Viktor Barzin
9c47311d45 state(mailserver): update encrypted state 2026-04-05 22:23:12 +03:00
Viktor Barzin
c946e5fdc9 tune controller-manager + apiserver for faster volume detach
- kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default)
- kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default)
- kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default)

Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure.
Applied live + codified in cloud-init template. [ci skip]
2026-04-05 22:14:23 +03:00
Viktor Barzin
502ab23156 state(actualbudget): update encrypted state 2026-04-05 22:08:13 +03:00
Viktor Barzin
0ca177ff98 state(mailserver): update encrypted state 2026-04-05 21:55:30 +03:00
Viktor Barzin
c239300154 fix: disable rspamd-redis and correct proxmox-lvm PVC size
ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start
an embedded Redis server. The rspamd-redis subprocess was failing repeatedly
due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage
migration. Since the DKIM signing config uses use_redis=false, Redis is not
needed.

Also correct the PVC storage request to match the actual provisioned size (2Gi).
The mismatch was causing unnecessary PVC replacement during terraform apply.
2026-04-05 21:44:52 +03:00
Viktor Barzin
2eeb73fc57 add priority-based kubelet graceful shutdown ordering
Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with
shutdownGracePeriodByPodPriority for 9-tier ordered shutdown:
  unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) →
  tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) →
  sys-node(30s). Total: 310s.

Apps stop before databases, databases stop before infrastructure.
VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]
2026-04-05 20:54:36 +03:00
Viktor Barzin
56583c3825 fix: add retry middleware and per-service rate limit for ha-sofia
The global rate limit (10 req/s, 50 burst) was too aggressive for HA
dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel
blips between London K8s and Sofia caused 502s with no retry fallback.

- Add traefik-retry middleware to reverse-proxy factory (all services)
- Add skip_global_rate_limit variable to both reverse-proxy factories
- Create ha-sofia-rate-limit middleware (100 avg, 200 burst)
- Apply to ha-sofia and music-assistant (both route to Sofia)
2026-04-05 20:47:58 +03:00