infra

Author	SHA1	Message	Date
Viktor Barzin	8b43692af0	[infra] Suppress Goldilocks vpa-update-mode label drift on all namespaces [ci skip] ## Context Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with `metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This is intentional — Terraform owns container resource limits, and Goldilocks should only provide recommendations, never auto-update. The label is how Goldilocks decides per-namespace whether to run its VPA in `off` mode. Effect on Terraform: every `kubernetes_namespace` resource shows the label as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey 2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the label (`kubectl get ns -o json \| jq ... \| wc -l`). Every TF-managed namespace is affected. This commit brings the intentional admission drift under the same `# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in `c9d221d5` for the ndots dns_config pattern. The marker now stands generically for any Kyverno admission-webhook drift suppression; the inline comment records which specific policy stamps which specific field so future grep audits show why each suppression exists. ## This change 107 `.tf` files touched — every stack's `resource "kubernetes_namespace"` resource gets: ```hcl lifecycle { # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] } ``` Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`): match `^resource "kubernetes_namespace" ` → track `{` / `}` until the outermost closing brace → insert the lifecycle block before the closing brace. The script is idempotent (skips any file that already mentions `goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe. Vault stack picked up 2 namespaces in the same file (k8s-users produces one, plus a second explicit ns) — confirmed via file diff (+8 lines). ## What is NOT in this change - `stacks/trading-bot/main.tf` — entire file is `/* … /` commented out (paused 2026-04-06 per user decision). Reverted after the script ran. - `stacks/_template/main.tf.example` — per-stack skeleton, intentionally minimal. User keeps it that way. Not touched by the script (file has no real `resource "kubernetes_namespace"` — only a placeholder comment). - `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) — gitignored, won't commit; the live path was edited. - `terraform fmt` cleanup of adjacent pre-existing alignment issues in authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted to keep the commit scoped to the Goldilocks sweep. Those files will need a separate fmt-only commit or will be cleaned up on next real apply to that stack. ## Verification Dawarich (one of the hundred-plus touched stacks) showed the pattern before and after: ``` $ cd stacks/dawarich && ../../scripts/tg plan Before: Plan: 0 to add, 2 to change, 0 to destroy. # kubernetes_namespace.dawarich will be updated in-place (goldilocks.fairwinds.com/vpa-update-mode -> null) # module.tls_secret.kubernetes_secret.tls_secret will be updated in-place (Kyverno generate. labels — fixed in `8d94688d`) After: No changes. Your infrastructure matches the configuration. ``` Injection count check: ``` $ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ \| awk -F: '{s+=$2} END {print s}' 108 ``` ## Reproduce locally 1. `git pull` 2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan` 3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label. Closes: code-dwx Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:15:27 +00:00
Viktor Barzin	afb8a16623	[infra] Scale down unused services + remove DoH ingress Scale to 0 replicas: - ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle - poison-fountain: RSS link archiver, not actively used - travel-blog: Hugo blog, not actively used Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201). Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel) should clear now that the Uptime Kuma monitors will report both down. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:55:52 +00:00
Viktor Barzin	996bdfc9b6	[technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling ## Context Disabling MySQL/SQLite query logging via config was not durable — Technitium re-enables disabled plugins on pod restart, causing 46 GB/day of writes to the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs). ## This change: The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is permanent — the plugin files are removed from the PVC, so they can't re-enable on restart. The CronJob checks if the plugins are present first (idempotent). Only PostgreSQL query logging remains (90-day retention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 08:20:55 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	803cb5fd26	fix: convert Technitium zone sync from one-time Job to CronJob Secondary/tertiary DNS instances had no custom zones — only the primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job ran once at deployment and never synced new zones. New CronJob runs every 30 minutes: - Gets all zones from primary - Enables zone transfer on primary - Creates missing zones as Secondary type on replicas - Resyncs existing zones via AXFR Fixes .lan resolution failures (2/3 queries returned NXDOMAIN). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:18:19 +00:00
Viktor Barzin	68c8c5b4a0	fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory mounts from k8s (NFSv4 pseudo-root path resolution). Combined with lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services. Changes: - Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted - Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted - Removed NFS module dependency from technitium stack - Added full post-mortem with prevention plan [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:18:59 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	9de543c076	feat(technitium): add Split Horizon AddressTranslation for hairpin NAT fix 192.168.1.x LAN clients couldn't reach non-proxied *.viktorbarzin.me domains because the TP-Link router doesn't support hairpin NAT. Adds a CronJob that configures Technitium's Split Horizon AddressTranslation post-processor on all 3 instances to translate 176.12.22.76 (public IP) → 10.0.20.200 (Traefik LB) in DNS responses for 192.168.1.0/24 clients. Also adds viktorbarzin.me to the DNS Rebinding Protection privateDomains allowlist so the translated private IP isn't stripped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 18:43:15 +00:00
Viktor Barzin	09b4bad958	feat: pin ~28 images to specific versions, enable DIUN monitoring, add app-stacks pipeline Pin third-party images from :latest to current stable versions: - Platform: cloudflared, technitium, snmp-exporter, pve-exporter, headscale, shadowsocks, xray - Apps: paperless-ngx, linkwarden, wealthfolio, speedtest, synapse, n8n, prowlarr, qbittorrent, lidarr, rybbit, ollama, immichframe, cyberchef, networking-toolbox, echo, coturn, shlink, affine Enable DIUN annotations on all pinned deployments with per-image tag patterns. Add Woodpecker app-stacks pipeline for selective terragrunt apply on changed app stacks.	2026-04-06 14:27:13 +03:00
Viktor Barzin	9492874c43	fix: restore technitium MySQL query logging with Vault auto-rotation [ci skip] Query logs stopped syncing on 2026-03-16 due to password mismatch after MySQL cluster rebuild and Technitium app config reset. - Add Vault static role mysql-technitium (7-day rotation) - Add ExternalSecret for technitium-db-creds in technitium namespace - Add password-sync CronJob (6h) to push rotated password to Technitium API - Update Grafana datasource to use ESO-managed password - Remove stale technitium_db_password variable (replaced by ESO) - Update databases.md and restore-mysql.md runbook	2026-04-06 13:00:49 +03:00
Viktor Barzin	aac02f0467	meshcentral: restore DB from backup; technitium: remove orphaned PVC - meshcentral: fix homepage annotations formatting (no functional change, serversscheme was tested but not needed since MeshCentral serves HTTP) - meshcentral: restored user DB from Dec 2024 backup (1428B → 45KB) - technitium: remove unused technitium-config-proxmox PVC (WaitForFirstConsumer, never mounted — primary uses NFS, replicas have their own proxmox PVCs)	2026-04-06 12:17:08 +03:00
Viktor Barzin	b0178cf6d2	technitium: add tertiary DNS replica and fix CoreDNS forward order - Add tertiary DNS deployment with zone-transfer replication for externalTrafficPolicy=Local coverage across more nodes - Reorder CoreDNS default forwarders: pfSense (10.0.20.1) first, then public DNS fallbacks (8.8.8.8, 1.1.1.1)	2026-04-06 11:57:31 +03:00
Viktor Barzin	f80e1fa868	cluster health fixes: NFS CSI, Immich ML, dbaas, Redis, DNS, trading-bot removal - NFS CSI: fix liveness-probe port conflict (29652 → 29653) - Immich ML: add gpu-workload priority class to enable preemption on node1 - dbaas: right-size MySQL memory limits (sidecar 6Gi→350Mi, main 4Gi→3Gi) - Redis: add redis-master service via HAProxy for master-only routing, update config.tfvars redis_host to use it - CoreDNS: forward .viktorbarzin.lan to Technitium ClusterIP (10.96.0.53) instead of stale LoadBalancer IP (10.0.20.200) - Trading bot: comment out all resources (no longer needed) - Vault: remove trading-bot PostgreSQL database role	2026-04-06 11:54:45 +03:00
Viktor Barzin	aa7a7e74b2	fix: technitium secondary to proxmox-lvm + bootstrap TF state - Migrate technitium-secondary-config from NFS to proxmox-lvm PVC - Change secondary strategy from RollingUpdate to Recreate (RWO) - Bootstrap encrypted state for insta2spotify and ebooks stacks - Import servarr sub-module PVCs and reconcile state	2026-04-05 19:32:40 +03:00
Viktor Barzin	cb8a808700	feat(storage): migrate 38 NFS PVCs to proxmox-lvm (Wave 2) Add proxmox-lvm PVCs with pvc-autoresizer annotations for all remaining single-pod app data services. Deployments updated to use new block storage PVCs. Old NFS modules retained for rollback. Services: affine, changedetection, diun, excalidraw, f1-stream, hackmd, isponsorblocktv, matrix, n8n, send, grampsweb, health, onlyoffice, owntracks, paperless-ngx, privatebin, resume, speedtest, stirling-pdf, tandoor, rybbit (clickhouse), tor-proxy (torrserver), whisper+piper, frigate (config), ollama (ui), servarr (prowlarr/listenarr/qbittorrent), aiostreams, freshrss (extensions), meshcentral (data+files), openclaw (data+home+ openlobster), technitium, mailserver (data+roundcube html+enigma), dbaas (pgadmin). Strategy set to Recreate where needed for RWO volumes.	2026-04-04 19:25:12 +03:00
Viktor Barzin	4b3851829b	feat: organize Grafana dashboards into folders Enable sidecar folderAnnotation + foldersFromFilesStructure to group 26 dashboards into 5 managed folders: - Cluster (6): k8s health, API server, nodes, pods, kube-state-metrics - Networking (6): CoreDNS, Technitium, Headscale, ingress, network traffic - Hardware (5): node-exporter, proxmox, iDRAC, UPS, NVIDIA GPU - Operations (4): backup health, registry, audit logs, Loki - Applications (2): realestate-crawler, qBittorrent Dashboard-to-folder mapping defined in grafana.tf locals block. External stacks (headscale, technitium) annotated individually.	2026-03-28 16:23:49 +02:00
Viktor Barzin	c49e4561a3	consolidate MetalLB IPs: 5 → 1 (10.0.20.200) Migrate all 11 LoadBalancer services to share 10.0.20.200: - Update annotations: metallb.universe.tf → metallb.io - Pin all services to 10.0.20.200 with allow-shared-ip: shared - Standardize externalTrafficPolicy to Cluster (required for IP sharing) - Remove redundant port 80 (roundcube) from mailserver LB - Update CoreDNS forward: 10.0.20.204 → 10.0.20.200 - Update cloudflared tunnel target: 10.0.20.202 → 10.0.20.200 Services consolidated: coturn, headscale, kms, qbittorrent, shadowsocks, torrserver, wireguard, mailserver, traefik, xray, technitium	2026-03-24 18:35:43 +02:00
Viktor Barzin	33037eba46	upgrade MetalLB v0.10.2 → v0.15.3 and update annotations - Replace custom ViktorBarzin/metallb module with official Helm chart - Migrate from ConfigMap-based config to CRD (IPAddressPool + L2Advertisement) - Update Traefik LB annotations from metallb.universe.tf to metallb.io format - Technitium DNS keeps stable IP 10.0.20.204 via MetalLB auto-assignment - Headscale split DNS already configured to use 10.0.20.204	2026-03-24 17:24:05 +02:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00

22 commits