infra

Author	SHA1	Message	Date
Viktor Barzin	cd2d00703c	state(vault): update encrypted state	2026-04-06 12:40:54 +03:00
root	a038b2a2c4	Woodpecker CI deploy commit [CI SKIP]	2026-04-06 09:28:10 +00:00
Viktor Barzin	5b43e57efa	actualbudget: use internal ClusterIP for http-api server URL The http-api sidecar was connecting to the public URL (https://budget-.viktorbarzin.me) which goes through Traefik/Authentik. When pods got rescheduled to different nodes, this caused ETIMEDOUT errors. Changed to internal service URL (http://budget-.actualbudget.svc.cluster.local) which is fast and reliable regardless of pod placement.	2026-04-06 12:22:57 +03:00
Viktor Barzin	07bad79489	state(status-page): update encrypted state	2026-04-06 12:20:54 +03:00
Viktor Barzin	5fef9945de	state(actualbudget): update encrypted state	2026-04-06 12:20:28 +03:00
Viktor Barzin	4594472e69	state(status-page): update encrypted state	2026-04-06 12:20:01 +03:00
Viktor Barzin	aac02f0467	meshcentral: restore DB from backup; technitium: remove orphaned PVC - meshcentral: fix homepage annotations formatting (no functional change, serversscheme was tested but not needed since MeshCentral serves HTTP) - meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB) - technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer, never mounted — primary uses NFS, replicas have their own proxmox PVCs)	2026-04-06 12:17:08 +03:00
Viktor Barzin	f675b1492f	state(meshcentral): update encrypted state	2026-04-06 12:10:41 +03:00
Viktor Barzin	bbc9ea3444	state(status-page): update encrypted state	2026-04-06 12:09:50 +03:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	c2f9ca0d13	modules: improve create-vm with additional config options and cloud-init updates	2026-04-06 11:57:55 +03:00
Viktor Barzin	f9e85964ce	traefik: add middleware and platform traefik config updates	2026-04-06 11:57:52 +03:00
Viktor Barzin	dca06b8a00	freedify: increase memory limits and add new features	2026-04-06 11:57:47 +03:00
Viktor Barzin	61dc7a6862	nextcloud: refactor chart values and main.tf configuration	2026-04-06 11:57:44 +03:00
Viktor Barzin	fe342a974b	monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard - proxmox-csi: add RBAC for PVE host snapshot restore script - monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics - monitoring: add backup health Grafana dashboard	2026-04-06 11:57:41 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	b0178cf6d2	technitium: add tertiary DNS replica and fix CoreDNS forward order - Add tertiary DNS deployment with zone-transfer replication for externalTrafficPolicy=Local coverage across more nodes - Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first, then public DNS fallbacks (8.8.8.8, 1.1.1.1)	2026-04-06 11:57:31 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	0115320d72	state(status-page): update encrypted state	2026-04-06 11:48:40 +03:00
Viktor Barzin	d0ed3cc3ce	state(status-page): update encrypted state	2026-04-06 11:31:45 +03:00
Viktor Barzin	8e6034c34d	state(status-page): update encrypted state	2026-04-06 11:29:48 +03:00
Viktor Barzin	9479a562f1	state(status-page): update encrypted state	2026-04-06 11:28:40 +03:00
Viktor Barzin	e0dcfd7694	state(status-page): update encrypted state	2026-04-06 11:27:16 +03:00
Viktor Barzin	9f91a3db88	state: update encrypted terraform state	2026-04-06 11:26:45 +03:00
Viktor Barzin	a486bbd66c	state(immich): update encrypted state	2026-04-06 10:50:34 +03:00
Viktor Barzin	c988c5b43e	state(nfs-csi): update encrypted state	2026-04-06 10:40:50 +03:00
Viktor Barzin	58e698c647	state(immich): update encrypted state	2026-04-06 10:38:43 +03:00
Viktor Barzin	9aea54674e	state(trading-bot): update encrypted state	2026-04-06 10:29:54 +03:00
Viktor Barzin	faa6868f79	remove claude-memory PDB (blocks drains with single replica) Single replica + minAvailable=1 makes drains impossible. claude-memory is non-critical and recovers quickly. [ci skip]	2026-04-06 00:47:40 +03:00
Viktor Barzin	70c870a2ed	state: update encrypted terraform state	2026-04-06 00:37:58 +03:00
Viktor Barzin	dcda285d9b	state(authentik): update encrypted state	2026-04-06 00:35:33 +03:00
Viktor Barzin	88307e3e5f	state(headscale): update encrypted state	2026-04-06 00:33:54 +03:00
Viktor Barzin	c8be07c403	resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s - MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss) - Descheduler: increase frequency from hourly to every 5 min for faster rebalancing - Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]	2026-04-06 00:25:49 +03:00
Viktor Barzin	3eb15149e1	state(platform): update encrypted state	2026-04-06 00:25:21 +03:00
Viktor Barzin	2ead11d36b	state(descheduler): update encrypted state	2026-04-06 00:25:14 +03:00
Viktor Barzin	0f2ef356d6	fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned) iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026. Controller is intentionally scaled to 0. Remove the stale alert and update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.	2026-04-05 23:53:18 +03:00
Viktor Barzin	ae0585048a	fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit. Also added custom LimitRange in dbaas stack (for when Terraform manages it directly).	2026-04-05 23:31:23 +03:00
Viktor Barzin	aa62e789e5	state(kyverno): update encrypted state	2026-04-05 23:28:46 +03:00
Viktor Barzin	772f59d589	fix: add Vault-managed DB credentials for Matrix Synapse - Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser) - Add Vault DB static role pg-matrix with 24h rotation - Add ExternalSecret matrix-db-creds syncing password from Vault - Add inject-db-password init container that patches homeserver.yaml with current Vault password on every pod start - Update dependency annotation to pg-cluster-rw.dbaas - Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)	2026-04-05 23:18:16 +03:00
Viktor Barzin	e064778c2c	fix: resolve tandoor, matrix, navidrome crash loops - Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull failing with EOF from pull-through cache corruption) - Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead of legacy postgresql.dbaas service, update CNPG postgres password - Navidrome: deleted corrupted SQLite DB (malformed disk image from proxmox-lvm migration), navidrome recreates fresh DB on startup	2026-04-05 23:12:49 +03:00
Viktor Barzin	4da8f0242f	fix: right-size service memory after PVE RAM upgrade (142→272GB) - MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit) - Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled) - Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled) - Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled - Navidrome: 128Mi/128Mi → 256Mi/384Mi - Matrix: add explicit 256Mi/512Mi resources - Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled - Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi - Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi - Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections	2026-04-05 23:02:50 +03:00
Viktor Barzin	825adc4a67	state(trading-bot): update encrypted state	2026-04-05 22:59:53 +03:00
Viktor Barzin	e206ffd676	state(kyverno): update encrypted state	2026-04-05 22:53:47 +03:00
Viktor Barzin	9c47311d45	state(mailserver): update encrypted state	2026-04-05 22:23:12 +03:00
Viktor Barzin	c946e5fdc9	tune controller-manager + apiserver for faster volume detach - kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default) - kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default) - kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default) Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure. Applied live + codified in cloud-init template. [ci skip]	2026-04-05 22:14:23 +03:00
Viktor Barzin	502ab23156	state(actualbudget): update encrypted state	2026-04-05 22:08:13 +03:00
Viktor Barzin	0ca177ff98	state(mailserver): update encrypted state	2026-04-05 21:55:30 +03:00
Viktor Barzin	c239300154	fix: disable rspamd-redis and correct proxmox-lvm PVC size ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start an embedded Redis server. The rspamd-redis subprocess was failing repeatedly due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage migration. Since the DKIM signing config uses use_redis=false, Redis is not needed. Also correct the PVC storage request to match the actual provisioned size (2Gi). The mismatch was causing unnecessary PVC replacement during terraform apply.	2026-04-05 21:44:52 +03:00
Viktor Barzin	2eeb73fc57	add priority-based kubelet graceful shutdown ordering Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with shutdownGracePeriodByPodPriority for 9-tier ordered shutdown: unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) → tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) → sys-node(30s). Total: 310s. Apps stop before databases, databases stop before infrastructure. VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]	2026-04-05 20:54:36 +03:00
Viktor Barzin	56583c3825	fix: add retry middleware and per-service rate limit for ha-sofia The global rate limit (10 req/s, 50 burst) was too aggressive for HA dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel blips between London K8s and Sofia caused 502s with no retry fallback. - Add traefik-retry middleware to reverse-proxy factory (all services) - Add skip_global_rate_limit variable to both reverse-proxy factories - Create ha-sofia-rate-limit middleware (100 avg, 200 burst) - Apply to ha-sofia and music-assistant (both route to Sofia)	2026-04-05 20:47:58 +03:00

1 2 3 4 5 ...

2292 commits