Commit graph

418 commits

Author SHA1 Message Date
root
a038b2a2c4 Woodpecker CI deploy commit [CI SKIP] 2026-04-06 09:28:10 +00:00
Viktor Barzin
5b43e57efa actualbudget: use internal ClusterIP for http-api server URL
The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.

Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
2026-04-06 12:22:57 +03:00
Viktor Barzin
aac02f0467 meshcentral: restore DB from backup; technitium: remove orphaned PVC
- meshcentral: fix homepage annotations formatting (no functional change,
  serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
  never mounted — primary uses NFS, replicas have their own proxmox PVCs)
2026-04-06 12:17:08 +03:00
Viktor Barzin
0de2fef9c9 misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates
- actualbudget: adjust resource config
- authentik: add configuration
- headscale: minor fix
- rybbit: add resources
- terminal: add terminal stack config
- platform/dbaas: add config
- infra: update lock file
2026-04-06 11:58:00 +03:00
Viktor Barzin
f9e85964ce traefik: add middleware and platform traefik config updates 2026-04-06 11:57:52 +03:00
Viktor Barzin
dca06b8a00 freedify: increase memory limits and add new features 2026-04-06 11:57:47 +03:00
Viktor Barzin
61dc7a6862 nextcloud: refactor chart values and main.tf configuration 2026-04-06 11:57:44 +03:00
Viktor Barzin
fe342a974b monitoring + proxmox-csi: LVM snapshot RBAC, pushgateway NodePort, backup dashboard
- proxmox-csi: add RBAC for PVE host snapshot restore script
- monitoring: expose Pushgateway via NodePort for PVE LVM snapshot metrics
- monitoring: add backup health Grafana dashboard
2026-04-06 11:57:41 +03:00
Viktor Barzin
b0178cf6d2 technitium: add tertiary DNS replica and fix CoreDNS forward order
- Add tertiary DNS deployment with zone-transfer replication for
  externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
  then public DNS fallbacks (8.8.8.8, 1.1.1.1)
2026-04-06 11:57:31 +03:00
Viktor Barzin
f80e1fa868 cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
  update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
  instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
2026-04-06 11:54:45 +03:00
Viktor Barzin
faa6868f79 remove claude-memory PDB (blocks drains with single replica)
Single replica + minAvailable=1 makes drains impossible.
claude-memory is non-critical and recovers quickly. [ci skip]
2026-04-06 00:47:40 +03:00
Viktor Barzin
c8be07c403 resilience improvements: MySQL anti-affinity comment, descheduler 5min, prometheus termination 60s
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
2026-04-06 00:25:49 +03:00
Viktor Barzin
0f2ef356d6 fix: remove ISCSICSIControllerDown alert (democratic-csi decommissioned)
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
2026-04-05 23:53:18 +03:00
Viktor Barzin
ae0585048a fix: bump tier-1-cluster LimitRange max to 8Gi for MySQL 6Gi limit
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
2026-04-05 23:31:23 +03:00
Viktor Barzin
772f59d589 fix: add Vault-managed DB credentials for Matrix Synapse
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
  with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
2026-04-05 23:18:16 +03:00
Viktor Barzin
e064778c2c fix: resolve tandoor, matrix, navidrome crash loops
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
  failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
  of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
  proxmox-lvm migration), navidrome recreates fresh DB on startup
2026-04-05 23:12:49 +03:00
Viktor Barzin
4da8f0242f fix: right-size service memory after PVE RAM upgrade (142→272GB)
- MySQL InnoDB: 2Gi/4Gi → 3Gi/6Gi (was at 97% of limit)
- Redis HAProxy: 16Mi/16Mi → 32Mi/64Mi (OOMKilled)
- Plotting-book: 64Mi/64Mi → 128Mi/256Mi (OOMKilled)
- Tandoor: 256Mi/256Mi → 384Mi/512Mi (60 OOM restarts), re-enabled
- Navidrome: 128Mi/128Mi → 256Mi/384Mi
- Matrix: add explicit 256Mi/512Mi resources
- Trading-bot workers: 64Mi/64Mi → 128Mi/256Mi, re-enabled
- Tier 3-edge defaults: 96Mi/192Mi → 128Mi/256Mi
- Fallback tier defaults: 128Mi/128Mi → 128Mi/192Mi, max 2→4Gi
- Mailserver: disable rspamd-redis, fix Roundcube IPv6/IMAP, bump dovecot connections
2026-04-05 23:02:50 +03:00
Viktor Barzin
c946e5fdc9 tune controller-manager + apiserver for faster volume detach
- kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default)
- kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default)
- kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default)

Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure.
Applied live + codified in cloud-init template. [ci skip]
2026-04-05 22:14:23 +03:00
Viktor Barzin
c239300154 fix: disable rspamd-redis and correct proxmox-lvm PVC size
ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start
an embedded Redis server. The rspamd-redis subprocess was failing repeatedly
due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage
migration. Since the DKIM signing config uses use_redis=false, Redis is not
needed.

Also correct the PVC storage request to match the actual provisioned size (2Gi).
The mismatch was causing unnecessary PVC replacement during terraform apply.
2026-04-05 21:44:52 +03:00
Viktor Barzin
2eeb73fc57 add priority-based kubelet graceful shutdown ordering
Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with
shutdownGracePeriodByPodPriority for 9-tier ordered shutdown:
  unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) →
  tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) →
  sys-node(30s). Total: 310s.

Apps stop before databases, databases stop before infrastructure.
VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]
2026-04-05 20:54:36 +03:00
Viktor Barzin
56583c3825 fix: add retry middleware and per-service rate limit for ha-sofia
The global rate limit (10 req/s, 50 burst) was too aggressive for HA
dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel
blips between London K8s and Sofia caused 502s with no retry fallback.

- Add traefik-retry middleware to reverse-proxy factory (all services)
- Add skip_global_rate_limit variable to both reverse-proxy factories
- Create ha-sofia-rate-limit middleware (100 avg, 200 burst)
- Apply to ha-sofia and music-assistant (both route to Sofia)
2026-04-05 20:47:58 +03:00
Viktor Barzin
3cd560d4d9 fix: bank sync alerts - remove {{ $labels.job }} that Helm provider silently drops [ci skip]
The Terraform Helm provider's YAML diff comparison silently ignores rules
containing {{ $labels.job }} in annotations, preventing the alerts from being
applied. Also syncs alerts to platform stack tpl.
2026-04-05 20:07:51 +03:00
Viktor Barzin
0e3c0fb503 security: harden traefik auth flow — fix header spoofing, TLS leak, DERP rate-limit
- Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid,
  email, name, groups) to prevent client-supplied header spoofing when
  Authentik is down. Previously only username was set, allowing a malicious
  client to inject fake X-authentik-groups.
- Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching
  domains no longer get the wildcard cert served (TLS info leak).
- Added rate-limit and CrowdSec middleware to catch-all IngressRoute.
- Added rate-limit middleware to Headscale DERP IngressRoute.
- Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin).
- Created Authentik brute-force reputation policy (threshold -5, IP+username).
2026-04-05 20:01:06 +03:00
Viktor Barzin
3217a5f605 add bank sync monitoring with Pushgateway metrics and Prometheus alerts [ci skip]
CronJob now captures HTTP status, pushes bank_sync_success/duration/last_success
to Pushgateway. Alerts: BankSyncFailing (6h), BankSyncStale (48h).
2026-04-05 19:32:40 +03:00
Viktor Barzin
aa7a7e74b2 fix: technitium secondary to proxmox-lvm + bootstrap TF state
- Migrate technitium-secondary-config from NFS to proxmox-lvm PVC
- Change secondary strategy from RollingUpdate to Recreate (RWO)
- Bootstrap encrypted state for insta2spotify and ebooks stacks
- Import servarr sub-module PVCs and reconcile state
2026-04-05 19:32:40 +03:00
Viktor Barzin
cb8a808700 feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
remaining single-pod app data services. Deployments updated to
use new block storage PVCs. Old NFS modules retained for rollback.

Services: affine, changedetection, diun, excalidraw, f1-stream,
hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health,
onlyoffice, owntracks, paperless-ngx, privatebin, resume,
speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy
(torrserver), whisper+piper, frigate (config), ollama (ui),
servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss
(extensions), meshcentral (data+files), openclaw (data+home+
openlobster), technitium, mailserver (data+roundcube html+enigma),
dbaas (pgadmin).

Strategy set to Recreate where needed for RWO volumes.
2026-04-04 19:25:12 +03:00
Viktor Barzin
ee39dd2fc9 feat(storage): migrate 12 SQLite NFS PVCs to proxmox-lvm (Wave 1)
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
SQLite-backed services. Deployments updated to use new block storage
PVCs. Old NFS modules retained for 1-week rollback.

Services: ntfy, freshrss, insta2spotify, actualbudget (x3),
wealthfolio, navidrome (DB only), audiobookshelf config,
headscale, forgejo, uptime-kuma.

Also: set Recreate strategy on ntfy, forgejo, insta2spotify,
wealthfolio (required for RWO volumes).
2026-04-04 16:26:59 +03:00
Viktor Barzin
3d3759ea2f fix: disable cert-manager webhook for pvc-autoresizer, use self-signed cert [ci skip]
Cluster doesn't have cert-manager installed. Use self-signed certificate
for the controller and disable the PVC mutating webhook (annotations are
set directly on PVCs via Terraform).
2026-04-03 23:44:49 +03:00
Viktor Barzin
ce7b8c2b2e add pvc-autoresizer for automatic PVC expansion before volumes fill up [ci skip]
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
2026-04-03 23:30:00 +03:00
Viktor Barzin
d49acebd8e migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip]
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
  block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
2026-04-03 19:45:34 +03:00
Viktor Barzin
dd59512153 migrate iSCSI block volumes from democratic-csi to Proxmox CSI [ci skip]
Replace TrueNAS iSCSI (democratic-csi) with Proxmox CSI plugin for all
block storage PVCs. Eliminates double-CoW (ZFS + LVM-thin) and removes
the iSCSI network hop for database I/O.

New stack: stacks/proxmox-csi/ — deploys proxmox-csi-plugin Helm chart
with StorageClass "proxmox-lvm" using existing local-lvm thin pool.

Migrated PVCs (12 total):
- Phase 1 standalone: plotting-book, novelapp, vaultwarden, nextcloud, prometheus
- Phase 2 StatefulSets: CNPG PostgreSQL (2), MySQL InnoDB (3), Redis (2)

All services verified healthy post-migration.
2026-04-02 22:13:04 +03:00
Viktor Barzin
337da2184d add upstream fallback to containerd registry mirrors
When the pull-through proxy (10.0.20.10) is down, containerd now falls
back to the official upstream registries (registry-1.docker.io, ghcr.io)
instead of failing. Also cleans up stale disabled registry mirror dirs
and removes unnecessary containerd restart from the rollout script.
2026-04-02 11:05:30 +03:00
Viktor Barzin
dd461beb33 add registry blob integrity checker to self-heal corrupted cache
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.

fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.

Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
2026-03-29 22:31:39 +03:00
Viktor Barzin
a2b1b0e817 remove caretta network mapper to free 3Gi cluster memory
Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for
non-critical network topology visualization. Removing it to free
memory for novelapp and aiostreams which were stuck in Pending.
2026-03-29 22:17:35 +03:00
Viktor Barzin
7ad01661f0 novelapp: migrate NEXTAUTH env vars to Auth.js v5 (AUTH_*)
Replace NEXTAUTH_URL/NEXTAUTH_SECRET with AUTH_URL/AUTH_SECRET and add
AUTH_TRUST_HOST=true for Auth.js v5 compatibility.
2026-03-29 20:37:26 +03:00
Viktor Barzin
8bf83147db add SLACK_WEBHOOK_URL env var to book-search deployment 2026-03-29 13:53:24 +03:00
Viktor Barzin
78eff9ab11 fix: bump book-search memory to 512Mi for file upload/email [ci skip]
Downloads and sends ebook files via HTTP — needs more than 128Mi
for large PDFs. Applied live via kubectl, persisting in Terraform.
2026-03-29 13:24:19 +03:00
Viktor Barzin
914e0b08e2 add SMTP and CWA auth env vars to book-search for send-to-kindle [ci skip] 2026-03-29 12:42:45 +03:00
Viktor Barzin
cbea959966 feat(ebooks): mount calibre-library PVC in book-search for permission fixing
CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as
root. book-search now mounts the library to periodically fix permissions
on recently imported books.
2026-03-29 11:31:41 +03:00
Viktor Barzin
fed9df8c0e feat(ebooks): mount stacks-config PVC in book-search for force re-download
Adds stacks-config volume mount to book-search pod so it can delete
Stacks history entries and force re-downloads when a book was consumed
by CWA but failed to import.
2026-03-29 11:26:30 +03:00
Viktor Barzin
6d44b4292f add /api/download-status to book-search unprotected API ingress [ci skip]
Needed for async polling from iOS Shortcuts — status endpoint
doesn't need Authentik auth (job IDs are unguessable UUIDs).
2026-03-29 10:11:22 +03:00
Viktor Barzin
46444e0306 openclaw: remove install-dotfiles init container to reduce NFS writes
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.

Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
2026-03-29 01:11:33 +02:00
Viktor Barzin
878b556179 state(monitoring): update encrypted state 2026-03-29 01:04:11 +02:00
Viktor Barzin
d41211ddd5 add API key + unprotected API ingress for book-search iOS Shortcut
- API_KEY env var from calibre-secrets for /api/download-url auth
- SHORTCUT_ICLOUD_URL env var for /shortcut redirect
- Separate ingress for /api/download-url and /shortcut (bypasses Authentik)
2026-03-29 00:43:34 +02:00
Viktor Barzin
06490b0634 reduce Prometheus cardinality round 3: drop 44k more series
- cadvisor: drop unused network error/dropped counters, unused cpu
  metrics (load_avg, system, user), unused memory metrics (cache,
  failcnt, kernel, mapped_file, max_usage, rss, swap, active/inactive)
- kubelet: drop all unused histogram buckets (storage_operation, csi,
  volume_operation, image_pull, http_requests, rest_client, pod_worker,
  volume_metric, cgroup_manager) + kubernetes_feature_enabled
- apiserver: drop flowcontrol/rest_client histograms, longrunning_requests
- traefik: drop all router-level metrics (keep service + entrypoint)
- service-endpoints: drop coredns histograms, node_filesystem_*

Post-relabel: 332k → 99k (-70%), ingestion: 5,480 → 1,659 samples/sec (-70%)
2026-03-29 00:27:23 +02:00
Viktor Barzin
614d3c72bd add liveness probe to annas-archive-stacks deployment
Prevents corrupted SQLite DB from looping errors forever —
K8s will auto-restart the pod if /api/version stops responding.
2026-03-29 00:17:29 +02:00
Viktor Barzin
a9ca65bc31 reduce Prometheus cardinality round 2: drop 137k more series
- fix traefik double-scrape: kubernetes-pods job was scraping traefik
  pods again (43k duplicate series). Added namespace drop rule.
- drop unused cadvisor metrics: container_fs_*, container_blkio_*,
  container_pressure_*, container_spec_*, and misc (30k series)
- drop more apiserver histogram buckets: watch_list, watch_cache,
  response_sizes, watch_events, admission_controller, workqueue (11k)
- drop unused kube-state-metrics: replicaset_*, pod_tolerations,
  pod_labels, endpoint_*, service_*, configmap_*, etc (53k series)

Post-relabel samples: 332k → 142k (-57%)
Ingestion rate: 5,480 → 3,239 samples/sec (-41%)
2026-03-28 23:51:24 +02:00
Viktor Barzin
aceea7db94 increase global rate limit from 10/50 to 50/200
HA frontend loads 30-50 JS bundles on page load, exhausting the burst.
iOS Companion app reconnections also trigger bursts. 172 rate-limited
(429) requests found in Traefik logs causing intermittent connectivity
failures for ha-sofia iOS app.
2026-03-28 23:40:10 +02:00
Viktor Barzin
4b3851829b feat: organize Grafana dashboards into folders
Enable sidecar folderAnnotation + foldersFromFilesStructure to group
26 dashboards into 5 managed folders:

- Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics
- Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic
- Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU
- Operations (4): backup health, registry, audit logs, Loki
- Applications (2): realestate-crawler, qBittorrent

Dashboard-to-folder mapping defined in grafana.tf locals block.
External stacks (headscale, technitium) annotated individually.
2026-03-28 16:23:49 +02:00
Viktor Barzin
725fefe565 fix: add Headscale monitoring, alerts, and pin UI image
- Add 4 Prometheus alerts: HeadscaleDown (critical), NoOnlineNodes,
  HighHTTPLatency, HighErrorRate
- Add Grafana dashboard with node count, map responses, HTTP latency,
  nodestore operations, and memory panels
- Pin headscale-ui to digest sha256:015f5ba0... (was :latest)
- Set disable_check_updates: true to skip GitHub check on startup
- Uptime Kuma monitor already existed (id=19, 300s interval)
2026-03-28 16:07:04 +02:00