Commit graph

2262 commits

Author SHA1 Message Date
Viktor Barzin
dcda285d9b state(authentik): update encrypted state 2026-04-06 00:35:33 +03:00
Viktor Barzin
88307e3e5f state(headscale): update encrypted state 2026-04-06 00:33:54 +03:00
Viktor Barzin
c8be07c403 resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
2026-04-06 00:25:49 +03:00
Viktor Barzin
3eb15149e1 state(platform): update encrypted state 2026-04-06 00:25:21 +03:00
Viktor Barzin
2ead11d36b state(descheduler): update encrypted state 2026-04-06 00:25:14 +03:00
Viktor Barzin
0f2ef356d6 fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned)
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
2026-04-05 23:53:18 +03:00
Viktor Barzin
ae0585048a fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
2026-04-05 23:31:23 +03:00
Viktor Barzin
aa62e789e5 state(kyverno): update encrypted state 2026-04-05 23:28:46 +03:00
Viktor Barzin
772f59d589 fix: add Vault-managed DB credentials for Matrix Synapse
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
  with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
2026-04-05 23:18:16 +03:00
Viktor Barzin
e064778c2c fix: resolve tandoor, matrix, navidrome crash loops
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
  failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
  of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
  proxmox-lvm migration), navidrome recreates fresh DB on startup
2026-04-05 23:12:49 +03:00
Viktor Barzin
4da8f0242f fix: right-size service memory after PVE RAM upgrade (142→272GB)
- MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit)
- Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled)
- Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled)
- Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled
- Navidrome: 128Mi/128Mi → 256Mi/384Mi
- Matrix: add explicit 256Mi/512Mi resources
- Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled
- Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi
- Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi
- Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections
2026-04-05 23:02:50 +03:00
Viktor Barzin
825adc4a67 state(trading-bot): update encrypted state 2026-04-05 22:59:53 +03:00
Viktor Barzin
e206ffd676 state(kyverno): update encrypted state 2026-04-05 22:53:47 +03:00
Viktor Barzin
9c47311d45 state(mailserver): update encrypted state 2026-04-05 22:23:12 +03:00
Viktor Barzin
c946e5fdc9 tune controller-manager + apiserver for faster volume detach
- kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default)
- kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default)
- kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default)

Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure.
Applied live + codified in cloud-init template. [ci skip]
2026-04-05 22:14:23 +03:00
Viktor Barzin
502ab23156 state(actualbudget): update encrypted state 2026-04-05 22:08:13 +03:00
Viktor Barzin
0ca177ff98 state(mailserver): update encrypted state 2026-04-05 21:55:30 +03:00
Viktor Barzin
c239300154 fix: disable rspamd-redis and correct proxmox-lvm PVC size
ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start
an embedded Redis server. The rspamd-redis subprocess was failing repeatedly
due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage
migration. Since the DKIM signing config uses use_redis=false, Redis is not
needed.

Also correct the PVC storage request to match the actual provisioned size (2Gi).
The mismatch was causing unnecessary PVC replacement during terraform apply.
2026-04-05 21:44:52 +03:00
Viktor Barzin
2eeb73fc57 add priority-based kubelet graceful shutdown ordering
Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with
shutdownGracePeriodByPodPriority for 9-tier ordered shutdown:
  unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) →
  tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) →
  sys-node(30s). Total: 310s.

Apps stop before databases, databases stop before infrastructure.
VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]
2026-04-05 20:54:36 +03:00
Viktor Barzin
56583c3825 fix: add retry middleware and per-service rate limit for ha-sofia
The global rate limit (10 req/s, 50 burst) was too aggressive for HA
dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel
blips between London K8s and Sofia caused 502s with no retry fallback.

- Add traefik-retry middleware to reverse-proxy factory (all services)
- Add skip_global_rate_limit variable to both reverse-proxy factories
- Create ha-sofia-rate-limit middleware (100 avg, 200 burst)
- Apply to ha-sofia and music-assistant (both route to Sofia)
2026-04-05 20:47:58 +03:00
Viktor Barzin
ccc956ab9c state(reverse-proxy): update encrypted state 2026-04-05 20:46:06 +03:00
Viktor Barzin
3cd560d4d9 fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip]
The Terraform Helm provider's YAML diff comparison silently ignores rules
containing {{ $labels.job }} in annotations, preventing the alerts from being
applied. Also syncs alerts to platform stack tpl.
2026-04-05 20:07:51 +03:00
Viktor Barzin
f8daf7a245 state(platform): update encrypted state 2026-04-05 20:01:06 +03:00
Viktor Barzin
0e3c0fb503 security: harden traefik auth flow — fix header spoofing, TLS leak, DERP rate-limit
- Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid,
  email, name, groups) to prevent client-supplied header spoofing when
  Authentik is down. Previously only username was set, allowing a malicious
  client to inject fake X-authentik-groups.
- Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching
  domains no longer get the wildcard cert served (TLS info leak).
- Added rate-limit and CrowdSec middleware to catch-all IngressRoute.
- Added rate-limit middleware to Headscale DERP IngressRoute.
- Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin).
- Created Authentik brute-force reputation policy (threshold -5, IP+username).
2026-04-05 20:01:06 +03:00
Viktor Barzin
3d02036a18 state(platform): update encrypted state 2026-04-05 20:01:03 +03:00
Viktor Barzin
32be8a3789 state(headscale): update encrypted state 2026-04-05 19:58:42 +03:00
Viktor Barzin
3e2a2b8b28 state(traefik): update encrypted state 2026-04-05 19:58:25 +03:00
Viktor Barzin
ad7c0d7fc8 docs: add critical "Terraform Only" rule to CLAUDE.md
All infrastructure changes must go through Terraform/Terragrunt.
kubectl is read-only except for temporary migration steps.
If a resource isn't in Terraform, evaluate adding it before
making manual changes.
2026-04-05 19:46:07 +03:00
Viktor Barzin
9b134fe2ff state(platform): update encrypted state 2026-04-05 19:44:44 +03:00
Viktor Barzin
3217a5f605 add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip]
CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success
to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).
2026-04-05 19:32:40 +03:00
Viktor Barzin
3f09a2d007 state(actualbudget): update encrypted state 2026-04-05 19:32:40 +03:00
Viktor Barzin
aa7a7e74b2 fix: technitium secondary to proxmox-lvm + bootstrap TF state
- Migrate technitium-secondary-config from NFS to proxmox-lvm PVC
- Change secondary strategy from RollingUpdate to Recreate (RWO)
- Bootstrap encrypted state for insta2spotify and ebooks stacks
- Import servarr sub-module PVCs and reconcile state
2026-04-05 19:32:40 +03:00
Viktor Barzin
8bb486339c state(servarr): update encrypted state 2026-04-05 19:32:40 +03:00
root
22b4410cb7 Woodpecker CI Update TLS Certificates Commit 2026-04-05 00:03:00 +00:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
3dccbca95b state(mailserver): update encrypted state 2026-04-04 17:55:54 +03:00
Viktor Barzin
333bc6ecc4 state(meshcentral): update encrypted state 2026-04-04 17:44:23 +03:00
Viktor Barzin
cd52422eb0 state(freshrss): update encrypted state 2026-04-04 17:43:50 +03:00
Viktor Barzin
5662962e60 state(ollama): update encrypted state 2026-04-04 17:43:06 +03:00
Viktor Barzin
1ccb625d7b state(frigate): update encrypted state 2026-04-04 17:42:59 +03:00
Viktor Barzin
0e3df62a35 state(whisper): update encrypted state 2026-04-04 17:42:03 +03:00
Viktor Barzin
4907b9932c state(tor-proxy): update encrypted state 2026-04-04 17:41:56 +03:00
Viktor Barzin
53305a11f8 state(rybbit): update encrypted state 2026-04-04 17:40:42 +03:00
Viktor Barzin
8016dcb9d3 state(tandoor): update encrypted state 2026-04-04 17:32:21 +03:00
Viktor Barzin
de21e035cc state(stirling-pdf): update encrypted state 2026-04-04 17:32:05 +03:00
Viktor Barzin
57b84a110f state(speedtest): update encrypted state 2026-04-04 17:29:26 +03:00
Viktor Barzin
396c06c2a7 state(privatebin): update encrypted state 2026-04-04 17:28:31 +03:00
Viktor Barzin
411f4ed585 state(paperless-ngx): update encrypted state 2026-04-04 17:27:42 +03:00
Viktor Barzin
42ab019ddb state(owntracks): update encrypted state 2026-04-04 17:27:04 +03:00
Viktor Barzin
b22087effe state(onlyoffice): update encrypted state 2026-04-04 17:26:28 +03:00