The http-api sidecar was connecting to the public URL
(https://budget-*.viktorbarzin.me) which goes through Traefik/Authentik.
When pods got rescheduled to different nodes, this caused ETIMEDOUT errors.
Changed to internal service URL (http://budget-*.actualbudget.svc.cluster.local)
which is fast and reliable regardless of pod placement.
- meshcentral: fix homepage annotations formatting (no functional change,
serversscheme was tested but not needed since MeshCentral serves HTTP)
- meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB)
- technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer,
never mounted — primary uses NFS, replicas have their own proxmox PVCs)
- Add tertiary DNS deployment with zone-transfer replication for
externalTrafficPolicy=Local coverage across more nodes
- Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first,
then public DNS fallbacks (8.8.8.8, 1.1.1.1)
- NFS CSI: fix liveness-probe port conflict (29652 → 29653)
- Immich ML: add gpu-workload priority class to enable preemption on node1
- dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi)
- Redis: add redis-master service via HAProxy for master-only routing,
update config.tfvars redis_host to use it
- CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53)
instead of stale LoadBalancer IP (10.0.20.200)
- Trading bot: comment out all resources (no longer needed)
- Vault: remove trading-bot PostgreSQL database role
- MySQL InnoDB: keep required anti-affinity but document why (2/3 members OK during node loss)
- Descheduler: increase frequency from hourly to every 5 min for faster rebalancing
- Prometheus: set terminationGracePeriodSeconds=60 to prevent drain timeout [ci skip]
iSCSI CSI (democratic-csi) was replaced by proxmox-csi in April 2026.
Controller is intentionally scaled to 0. Remove the stale alert and
update CSIDriverCrashLoop to monitor proxmox-csi instead of iscsi-csi.
Kyverno's tier-1-cluster LimitRange had max=4Gi which blocked
mysql-cluster-2 from starting after we bumped MySQL to 6Gi limit.
Also added custom LimitRange in dbaas stack (for when Terraform
manages it directly).
- Create dedicated 'matrix' PostgreSQL user (was using 'postgres' superuser)
- Add Vault DB static role pg-matrix with 24h rotation
- Add ExternalSecret matrix-db-creds syncing password from Vault
- Add inject-db-password init container that patches homeserver.yaml
with current Vault password on every pod start
- Update dependency annotation to pg-cluster-rw.dbaas
- Also updated Vault DB config to use pg-cluster-rw (was legacy postgresql.dbaas)
- Tandoor: pin image to vabene1111/recipes:1.5.27 (latest tag pull
failing with EOF from pull-through cache corruption)
- Matrix: update homeserver.yaml to use pg-cluster-rw.dbaas instead
of legacy postgresql.dbaas service, update CNPG postgres password
- Navidrome: deleted corrupted SQLite DB (malformed disk image from
proxmox-lvm migration), navidrome recreates fresh DB on startup
ENABLE_RSPAMD_REDIS=0 prevents the docker-mailserver from attempting to start
an embedded Redis server. The rspamd-redis subprocess was failing repeatedly
due to a corrupted/empty RDB file after the recent NFS-to-proxmox-lvm storage
migration. Since the DKIM signing config uses use_redis=false, Redis is not
needed.
Also correct the PVC storage request to match the actual provisioned size (2Gi).
The mismatch was causing unnecessary PVC replacement during terraform apply.
The global rate limit (10 req/s, 50 burst) was too aggressive for HA
dashboards that load 30+ JS files on page load, causing 429s. VPN tunnel
blips between London K8s and Sofia caused 502s with no retry fallback.
- Add traefik-retry middleware to reverse-proxy factory (all services)
- Add skip_global_rate_limit variable to both reverse-proxy factories
- Create ha-sofia-rate-limit middleware (100 avg, 200 burst)
- Apply to ha-sofia and music-assistant (both route to Sofia)
The Terraform Helm provider's YAML diff comparison silently ignores rules
containing {{ $labels.job }} in annotations, preventing the alerts from being
applied. Also syncs alerts to platform stack tpl.
- Auth-proxy fallback now sets ALL X-authentik-* headers (username, uid,
email, name, groups) to prevent client-supplied header spoofing when
Authentik is down. Previously only username was set, allowing a malicious
client to inject fake X-authentik-groups.
- Catch-all IngressRoute restricted to *.viktorbarzin.me only. Non-matching
domains no longer get the wildcard cert served (TLS info leak).
- Added rate-limit and CrowdSec middleware to catch-all IngressRoute.
- Added rate-limit middleware to Headscale DERP IngressRoute.
- Rotated auth-proxy basicAuth credentials (bcrypt cost 5 → 12, admin → emergency-admin).
- Created Authentik brute-force reputation policy (threshold -5, IP+username).
- Migrate technitium-secondary-config from NFS to proxmox-lvm PVC
- Change secondary strategy from RollingUpdate to Recreate (RWO)
- Bootstrap encrypted state for insta2spotify and ebooks stacks
- Import servarr sub-module PVCs and reconcile state
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
SQLite-backed services. Deployments updated to use new block storage
PVCs. Old NFS modules retained for 1-week rollback.
Services: ntfy, freshrss, insta2spotify, actualbudget (x3),
wealthfolio, navidrome (DB only), audiobookshelf config,
headscale, forgejo, uptime-kuma.
Also: set Recreate strategy on ntfy, forgejo, insta2spotify,
wealthfolio (required for RWO volumes).
Cluster doesn't have cert-manager installed. Use self-signed certificate
for the controller and disable the PVC mutating webhook (annotations are
set directly on PVCs via Terraform).
Deploy topolvm/pvc-autoresizer controller that monitors kubelet_volume_stats
via Prometheus and auto-expands annotated PVCs. Annotated all 9 block-storage
PVCs (proxmox-lvm) with per-PVC thresholds and max limits. Updated PVFillingUp
alert to critical/10m (means auto-expansion failed) and added PVAutoExpanding
info alert at 80%.
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
When the pull-through proxy (10.0.20.10) is down, containerd now falls
back to the official upstream registries (registry-1.docker.io, ghcr.io)
instead of failing. Also cleans up stale disabled registry mirror dirs
and removes unnecessary containerd restart from the rollout script.
The cleanup-tags.sh + garbage-collect cycle can delete blob data while
leaving _layers/ link files intact. The registry then returns HTTP 200
with 0 bytes for those layers, causing "unexpected EOF" on image pulls.
fix-broken-blobs.sh walks all repositories, checks each layer link
against actual blob data, and removes orphaned links so the registry
re-fetches from upstream on next pull.
Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am
(after garbage collection). First run found 2335/2556 (91%) of
layer links were orphaned.
Caretta eBPF DaemonSet was using 600Mi x 5 nodes = 3Gi total for
non-critical network topology visualization. Removing it to free
memory for novelapp and aiostreams which were stuck in Pending.
CWA NETWORK_SHARE_MODE=true skips post-import chown, leaving files as
root. book-search now mounts the library to periodically fix permissions
on recently imported books.
Adds stacks-config volume mount to book-search pod so it can delete
Stacks history entries and force re-downloads when a book was consumed
by CWA but failed to import.
The init container was cloning the dotfiles repo via git on every pod
start, causing 200+ small NFS writes that amplified through ZFS.
Dotfiles already exist on NFS from a previous clone — no need to
re-clone on every restart. To update dotfiles, run git pull manually.
Also cleaned up stale Uptime Kuma files (1.6GB old SQLite DB + 289MB
error log left over from migration to MariaDB).
- API_KEY env var from calibre-secrets for /api/download-url auth
- SHORTCUT_ICLOUD_URL env var for /shortcut redirect
- Separate ingress for /api/download-url and /shortcut (bypasses Authentik)
HA frontend loads 30-50 JS bundles on page load, exhausting the burst.
iOS Companion app reconnections also trigger bursts. 172 rate-limited
(429) requests found in Traefik logs causing intermittent connectivity
failures for ha-sofia iOS app.